feat: add more
This commit is contained in:
15
src/pytorch/.env.example
Normal file
15
src/pytorch/.env.example
Normal file
@@ -0,0 +1,15 @@
|
||||
# PyTorch version with CUDA support
|
||||
PYTORCH_VERSION="2.6.0-cuda12.6-cudnn9-runtime"
|
||||
|
||||
# Jupyter configuration
|
||||
JUPYTER_ENABLE_LAB="yes"
|
||||
JUPYTER_TOKEN="pytorch"
|
||||
|
||||
# NVIDIA GPU configuration
|
||||
NVIDIA_VISIBLE_DEVICES="all"
|
||||
NVIDIA_DRIVER_CAPABILITIES="compute,utility"
|
||||
GPU_COUNT=1
|
||||
|
||||
# Port overrides
|
||||
JUPYTER_PORT_OVERRIDE=8888
|
||||
TENSORBOARD_PORT_OVERRIDE=6006
|
||||
153
src/pytorch/README.md
Normal file
153
src/pytorch/README.md
Normal file
@@ -0,0 +1,153 @@
|
||||
# PyTorch
|
||||
|
||||
[English](./README.md) | [中文](./README.zh.md)
|
||||
|
||||
This service deploys PyTorch with CUDA support, Jupyter Lab, and TensorBoard for deep learning development.
|
||||
|
||||
## Services
|
||||
|
||||
- `pytorch`: PyTorch container with GPU support, Jupyter Lab, and TensorBoard.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
**NVIDIA GPU Required**: This service requires an NVIDIA GPU with CUDA support and the NVIDIA Container Toolkit installed.
|
||||
|
||||
### Install NVIDIA Container Toolkit
|
||||
|
||||
**Linux:**
|
||||
|
||||
```bash
|
||||
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
|
||||
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
|
||||
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
|
||||
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
|
||||
sudo systemctl restart docker
|
||||
```
|
||||
|
||||
**Windows (Docker Desktop):**
|
||||
|
||||
Ensure you have WSL2 with NVIDIA drivers installed and Docker Desktop configured to use WSL2 backend.
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable Name | Description | Default Value |
|
||||
| -------------------------- | -------------------------- | ------------------------------- |
|
||||
| PYTORCH_VERSION | PyTorch image version | `2.6.0-cuda12.6-cudnn9-runtime` |
|
||||
| JUPYTER_ENABLE_LAB | Enable Jupyter Lab | `yes` |
|
||||
| JUPYTER_TOKEN | Jupyter access token | `pytorch` |
|
||||
| NVIDIA_VISIBLE_DEVICES | GPUs to use | `all` |
|
||||
| NVIDIA_DRIVER_CAPABILITIES | Driver capabilities | `compute,utility` |
|
||||
| GPU_COUNT | Number of GPUs to allocate | `1` |
|
||||
| JUPYTER_PORT_OVERRIDE | Jupyter Lab port | `8888` |
|
||||
| TENSORBOARD_PORT_OVERRIDE | TensorBoard port | `6006` |
|
||||
|
||||
Please modify the `.env` file as needed for your use case.
|
||||
|
||||
## Volumes
|
||||
|
||||
- `pytorch_notebooks`: Jupyter notebooks and scripts.
|
||||
- `pytorch_data`: Training data and datasets.
|
||||
|
||||
## Usage
|
||||
|
||||
### Start the Service
|
||||
|
||||
```bash
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
### Access Jupyter Lab
|
||||
|
||||
Open your browser and navigate to:
|
||||
|
||||
```text
|
||||
http://localhost:8888
|
||||
```
|
||||
|
||||
Login with the token specified in `JUPYTER_TOKEN` (default: `pytorch`).
|
||||
|
||||
### Verify GPU Access
|
||||
|
||||
In a Jupyter notebook:
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
print(f"PyTorch version: {torch.__version__}")
|
||||
print(f"CUDA available: {torch.cuda.is_available()}")
|
||||
print(f"CUDA version: {torch.version.cuda}")
|
||||
print(f"Number of GPUs: {torch.cuda.device_count()}")
|
||||
|
||||
if torch.cuda.is_available():
|
||||
print(f"GPU name: {torch.cuda.get_device_name(0)}")
|
||||
```
|
||||
|
||||
### Example Training Script
|
||||
|
||||
```python
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.optim as optim
|
||||
|
||||
# Set device
|
||||
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
||||
|
||||
# Define a simple model
|
||||
model = nn.Sequential(
|
||||
nn.Linear(784, 128),
|
||||
nn.ReLU(),
|
||||
nn.Linear(128, 10)
|
||||
).to(device)
|
||||
|
||||
# Create dummy data
|
||||
x = torch.randn(64, 784).to(device)
|
||||
y = torch.randint(0, 10, (64,)).to(device)
|
||||
|
||||
# Training
|
||||
criterion = nn.CrossEntropyLoss()
|
||||
optimizer = optim.Adam(model.parameters())
|
||||
|
||||
output = model(x)
|
||||
loss = criterion(output, y)
|
||||
loss.backward()
|
||||
optimizer.step()
|
||||
|
||||
print(f"Loss: {loss.item()}")
|
||||
```
|
||||
|
||||
### Access TensorBoard
|
||||
|
||||
TensorBoard port is exposed but needs to be started manually:
|
||||
|
||||
```python
|
||||
from torch.utils.tensorboard import SummaryWriter
|
||||
writer = SummaryWriter('/workspace/runs')
|
||||
```
|
||||
|
||||
Then start TensorBoard:
|
||||
|
||||
```bash
|
||||
docker exec pytorch tensorboard --logdir=/workspace/runs --host=0.0.0.0
|
||||
```
|
||||
|
||||
Access at: `http://localhost:6006`
|
||||
|
||||
## Features
|
||||
|
||||
- **GPU Acceleration**: CUDA support for fast training
|
||||
- **Jupyter Lab**: Interactive development environment
|
||||
- **TensorBoard**: Visualization for training metrics
|
||||
- **Pre-installed**: PyTorch, CUDA, cuDNN ready to use
|
||||
- **Persistent Storage**: Notebooks and data stored in volumes
|
||||
|
||||
## Notes
|
||||
|
||||
- GPU is required for optimal performance
|
||||
- Recommended: 8GB+ VRAM for most deep learning tasks
|
||||
- The container installs Jupyter and TensorBoard on first start
|
||||
- Use `pytorch/pytorch:*-devel` for building custom extensions
|
||||
- For multi-GPU training, adjust `GPU_COUNT` and use `torch.nn.DataParallel`
|
||||
|
||||
## License
|
||||
|
||||
PyTorch is licensed under the BSD-style license.
|
||||
153
src/pytorch/README.zh.md
Normal file
153
src/pytorch/README.zh.md
Normal file
@@ -0,0 +1,153 @@
|
||||
# PyTorch
|
||||
|
||||
[English](./README.md) | [中文](./README.zh.md)
|
||||
|
||||
此服务用于部署支持 CUDA、Jupyter Lab 和 TensorBoard 的 PyTorch 深度学习开发环境。
|
||||
|
||||
## 服务
|
||||
|
||||
- `pytorch`: 支持 GPU、Jupyter Lab 和 TensorBoard 的 PyTorch 容器。
|
||||
|
||||
## 先决条件
|
||||
|
||||
**需要 NVIDIA GPU**: 此服务需要支持 CUDA 的 NVIDIA GPU 和已安装的 NVIDIA Container Toolkit。
|
||||
|
||||
### 安装 NVIDIA Container Toolkit
|
||||
|
||||
**Linux:**
|
||||
|
||||
```bash
|
||||
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
|
||||
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
|
||||
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
|
||||
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
|
||||
sudo systemctl restart docker
|
||||
```
|
||||
|
||||
**Windows (Docker Desktop):**
|
||||
|
||||
确保已安装带有 NVIDIA 驱动程序的 WSL2,并将 Docker Desktop 配置为使用 WSL2 后端。
|
||||
|
||||
## 环境变量
|
||||
|
||||
| 变量名 | 说明 | 默认值 |
|
||||
| -------------------------- | ---------------- | ------------------------------- |
|
||||
| PYTORCH_VERSION | PyTorch 镜像版本 | `2.6.0-cuda12.6-cudnn9-runtime` |
|
||||
| JUPYTER_ENABLE_LAB | 启用 Jupyter Lab | `yes` |
|
||||
| JUPYTER_TOKEN | Jupyter 访问令牌 | `pytorch` |
|
||||
| NVIDIA_VISIBLE_DEVICES | 使用的 GPU | `all` |
|
||||
| NVIDIA_DRIVER_CAPABILITIES | 驱动程序功能 | `compute,utility` |
|
||||
| GPU_COUNT | 分配的 GPU 数量 | `1` |
|
||||
| JUPYTER_PORT_OVERRIDE | Jupyter Lab 端口 | `8888` |
|
||||
| TENSORBOARD_PORT_OVERRIDE | TensorBoard 端口 | `6006` |
|
||||
|
||||
请根据实际需求修改 `.env` 文件。
|
||||
|
||||
## 卷
|
||||
|
||||
- `pytorch_notebooks`: Jupyter 笔记本和脚本。
|
||||
- `pytorch_data`: 训练数据和数据集。
|
||||
|
||||
## 使用方法
|
||||
|
||||
### 启动服务
|
||||
|
||||
```bash
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
### 访问 Jupyter Lab
|
||||
|
||||
在浏览器中打开:
|
||||
|
||||
```text
|
||||
http://localhost:8888
|
||||
```
|
||||
|
||||
使用 `JUPYTER_TOKEN` 中指定的令牌登录(默认: `pytorch`)。
|
||||
|
||||
### 验证 GPU 访问
|
||||
|
||||
在 Jupyter 笔记本中:
|
||||
|
||||
```python
|
||||
import torch
|
||||
|
||||
print(f"PyTorch version: {torch.__version__}")
|
||||
print(f"CUDA available: {torch.cuda.is_available()}")
|
||||
print(f"CUDA version: {torch.version.cuda}")
|
||||
print(f"Number of GPUs: {torch.cuda.device_count()}")
|
||||
|
||||
if torch.cuda.is_available():
|
||||
print(f"GPU name: {torch.cuda.get_device_name(0)}")
|
||||
```
|
||||
|
||||
### 训练脚本示例
|
||||
|
||||
```python
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import torch.optim as optim
|
||||
|
||||
# 设置设备
|
||||
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
||||
|
||||
# 定义简单模型
|
||||
model = nn.Sequential(
|
||||
nn.Linear(784, 128),
|
||||
nn.ReLU(),
|
||||
nn.Linear(128, 10)
|
||||
).to(device)
|
||||
|
||||
# 创建虚拟数据
|
||||
x = torch.randn(64, 784).to(device)
|
||||
y = torch.randint(0, 10, (64,)).to(device)
|
||||
|
||||
# 训练
|
||||
criterion = nn.CrossEntropyLoss()
|
||||
optimizer = optim.Adam(model.parameters())
|
||||
|
||||
output = model(x)
|
||||
loss = criterion(output, y)
|
||||
loss.backward()
|
||||
optimizer.step()
|
||||
|
||||
print(f"Loss: {loss.item()}")
|
||||
```
|
||||
|
||||
### 访问 TensorBoard
|
||||
|
||||
TensorBoard 端口已暴露,但需要手动启动:
|
||||
|
||||
```python
|
||||
from torch.utils.tensorboard import SummaryWriter
|
||||
writer = SummaryWriter('/workspace/runs')
|
||||
```
|
||||
|
||||
然后启动 TensorBoard:
|
||||
|
||||
```bash
|
||||
docker exec pytorch tensorboard --logdir=/workspace/runs --host=0.0.0.0
|
||||
```
|
||||
|
||||
访问地址: `http://localhost:6006`
|
||||
|
||||
## 功能
|
||||
|
||||
- **GPU 加速**: CUDA 支持以实现快速训练
|
||||
- **Jupyter Lab**: 交互式开发环境
|
||||
- **TensorBoard**: 训练指标的可视化
|
||||
- **预安装**: PyTorch、CUDA、cuDNN 即可使用
|
||||
- **持久存储**: 笔记本和数据存储在卷中
|
||||
|
||||
## 注意事项
|
||||
|
||||
- GPU 对于最佳性能是必需的
|
||||
- 推荐: 大多数深度学习任务需要 8GB+ 显存
|
||||
- 容器在首次启动时安装 Jupyter 和 TensorBoard
|
||||
- 使用 `pytorch/pytorch:*-devel` 构建自定义扩展
|
||||
- 对于多 GPU 训练,调整 `GPU_COUNT` 并使用 `torch.nn.DataParallel`
|
||||
|
||||
## 许可证
|
||||
|
||||
PyTorch 使用 BSD 风格许可证授权。
|
||||
48
src/pytorch/docker-compose.yaml
Normal file
48
src/pytorch/docker-compose.yaml
Normal file
@@ -0,0 +1,48 @@
|
||||
x-default: &default
|
||||
restart: unless-stopped
|
||||
volumes:
|
||||
- &localtime /etc/localtime:/etc/localtime:ro
|
||||
- &timezone /etc/timezone:/etc/timezone:ro
|
||||
logging:
|
||||
driver: json-file
|
||||
options:
|
||||
max-size: 100m
|
||||
|
||||
services:
|
||||
pytorch:
|
||||
<<: *default
|
||||
image: pytorch/pytorch:${PYTORCH_VERSION:-2.6.0-cuda12.6-cudnn9-runtime}
|
||||
container_name: pytorch
|
||||
ports:
|
||||
- "${JUPYTER_PORT_OVERRIDE:-8888}:8888"
|
||||
- "${TENSORBOARD_PORT_OVERRIDE:-6006}:6006"
|
||||
environment:
|
||||
NVIDIA_VISIBLE_DEVICES: ${NVIDIA_VISIBLE_DEVICES:-all}
|
||||
NVIDIA_DRIVER_CAPABILITIES: ${NVIDIA_DRIVER_CAPABILITIES:-compute,utility}
|
||||
JUPYTER_ENABLE_LAB: ${JUPYTER_ENABLE_LAB:-yes}
|
||||
command: >
|
||||
bash -c "pip install --no-cache-dir jupyter tensorboard &&
|
||||
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
|
||||
--NotebookApp.token='${JUPYTER_TOKEN:-pytorch}'"
|
||||
volumes:
|
||||
- *localtime
|
||||
- *timezone
|
||||
- pytorch_notebooks:/workspace
|
||||
- pytorch_data:/data
|
||||
working_dir: /workspace
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
cpus: '4.0'
|
||||
memory: 16G
|
||||
reservations:
|
||||
cpus: '2.0'
|
||||
memory: 8G
|
||||
devices:
|
||||
- driver: nvidia
|
||||
count: ${GPU_COUNT:-1}
|
||||
capabilities: [gpu]
|
||||
|
||||
volumes:
|
||||
pytorch_notebooks:
|
||||
pytorch_data:
|
||||
Reference in New Issue
Block a user