feat: add more

2025-10-06 21:48:39 +08:00
parent f330e00fa0
commit 3c609b5989
120 changed files with 7698 additions and 59 deletions
--- a/src/pytorch/.env.example
+++ b/src/pytorch/.env.example
@@ -0,0 +1,15 @@
+# PyTorch version with CUDA support
+PYTORCH_VERSION="2.6.0-cuda12.6-cudnn9-runtime"
+
+# Jupyter configuration
+JUPYTER_ENABLE_LAB="yes"
+JUPYTER_TOKEN="pytorch"
+
+# NVIDIA GPU configuration
+NVIDIA_VISIBLE_DEVICES="all"
+NVIDIA_DRIVER_CAPABILITIES="compute,utility"
+GPU_COUNT=1
+
+# Port overrides
+JUPYTER_PORT_OVERRIDE=8888
+TENSORBOARD_PORT_OVERRIDE=6006
--- a/src/pytorch/README.md
+++ b/src/pytorch/README.md
@@ -0,0 +1,153 @@
+# PyTorch
+
+[English](./README.md) | [中文](./README.zh.md)
+
+This service deploys PyTorch with CUDA support, Jupyter Lab, and TensorBoard for deep learning development.
+
+## Services
+
+- `pytorch`: PyTorch container with GPU support, Jupyter Lab, and TensorBoard.
+
+## Prerequisites
+
+**NVIDIA GPU Required**: This service requires an NVIDIA GPU with CUDA support and the NVIDIA Container Toolkit installed.
+
+### Install NVIDIA Container Toolkit
+
+**Linux:**
+
+```bash
+distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
+curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
+curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
+sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
+sudo systemctl restart docker
+```
+
+**Windows (Docker Desktop):**
+
+Ensure you have WSL2 with NVIDIA drivers installed and Docker Desktop configured to use WSL2 backend.
+
+## Environment Variables
+
+| Variable Name              | Description                | Default Value                   |
+| -------------------------- | -------------------------- | ------------------------------- |
+| PYTORCH_VERSION            | PyTorch image version      | `2.6.0-cuda12.6-cudnn9-runtime` |
+| JUPYTER_ENABLE_LAB         | Enable Jupyter Lab         | `yes`                           |
+| JUPYTER_TOKEN              | Jupyter access token       | `pytorch`                       |
+| NVIDIA_VISIBLE_DEVICES     | GPUs to use                | `all`                           |
+| NVIDIA_DRIVER_CAPABILITIES | Driver capabilities        | `compute,utility`               |
+| GPU_COUNT                  | Number of GPUs to allocate | `1`                             |
+| JUPYTER_PORT_OVERRIDE      | Jupyter Lab port           | `8888`                          |
+| TENSORBOARD_PORT_OVERRIDE  | TensorBoard port           | `6006`                          |
+
+Please modify the `.env` file as needed for your use case.
+
+## Volumes
+
+- `pytorch_notebooks`: Jupyter notebooks and scripts.
+- `pytorch_data`: Training data and datasets.
+
+## Usage
+
+### Start the Service
+
+```bash
+docker-compose up -d
+```
+
+### Access Jupyter Lab
+
+Open your browser and navigate to:
+
+```text
+http://localhost:8888
+```
+
+Login with the token specified in `JUPYTER_TOKEN` (default: `pytorch`).
+
+### Verify GPU Access
+
+In a Jupyter notebook:
+
+```python
+import torch
+
+print(f"PyTorch version: {torch.__version__}")
+print(f"CUDA available: {torch.cuda.is_available()}")
+print(f"CUDA version: {torch.version.cuda}")
+print(f"Number of GPUs: {torch.cuda.device_count()}")
+
+if torch.cuda.is_available():
+    print(f"GPU name: {torch.cuda.get_device_name(0)}")
+```
+
+### Example Training Script
+
+```python
+import torch
+import torch.nn as nn
+import torch.optim as optim
+
+# Set device
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+
+# Define a simple model
+model = nn.Sequential(
+    nn.Linear(784, 128),
+    nn.ReLU(),
+    nn.Linear(128, 10)
+).to(device)
+
+# Create dummy data
+x = torch.randn(64, 784).to(device)
+y = torch.randint(0, 10, (64,)).to(device)
+
+# Training
+criterion = nn.CrossEntropyLoss()
+optimizer = optim.Adam(model.parameters())
+
+output = model(x)
+loss = criterion(output, y)
+loss.backward()
+optimizer.step()
+
+print(f"Loss: {loss.item()}")
+```
+
+### Access TensorBoard
+
+TensorBoard port is exposed but needs to be started manually:
+
+```python
+from torch.utils.tensorboard import SummaryWriter
+writer = SummaryWriter('/workspace/runs')
+```
+
+Then start TensorBoard:
+
+```bash
+docker exec pytorch tensorboard --logdir=/workspace/runs --host=0.0.0.0
+```
+
+Access at: `http://localhost:6006`
+
+## Features
+
+- **GPU Acceleration**: CUDA support for fast training
+- **Jupyter Lab**: Interactive development environment
+- **TensorBoard**: Visualization for training metrics
+- **Pre-installed**: PyTorch, CUDA, cuDNN ready to use
+- **Persistent Storage**: Notebooks and data stored in volumes
+
+## Notes
+
+- GPU is required for optimal performance
+- Recommended: 8GB+ VRAM for most deep learning tasks
+- The container installs Jupyter and TensorBoard on first start
+- Use `pytorch/pytorch:*-devel` for building custom extensions
+- For multi-GPU training, adjust `GPU_COUNT` and use `torch.nn.DataParallel`
+
+## License
+
+PyTorch is licensed under the BSD-style license.
--- a/src/pytorch/README.zh.md
+++ b/src/pytorch/README.zh.md
@@ -0,0 +1,153 @@
+# PyTorch
+
+[English](./README.md) | [中文](./README.zh.md)
+
+此服务用于部署支持 CUDA、Jupyter Lab 和 TensorBoard 的 PyTorch 深度学习开发环境。
+
+## 服务
+
+- `pytorch`: 支持 GPU、Jupyter Lab 和 TensorBoard 的 PyTorch 容器。
+
+## 先决条件
+
+**需要 NVIDIA GPU**: 此服务需要支持 CUDA 的 NVIDIA GPU 和已安装的 NVIDIA Container Toolkit。
+
+### 安装 NVIDIA Container Toolkit
+
+**Linux:**
+
+```bash
+distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
+curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
+curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
+sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
+sudo systemctl restart docker
+```
+
+**Windows (Docker Desktop):**
+
+确保已安装带有 NVIDIA 驱动程序的 WSL2，并将 Docker Desktop 配置为使用 WSL2 后端。
+
+## 环境变量
+
+| 变量名                     | 说明             | 默认值                          |
+| -------------------------- | ---------------- | ------------------------------- |
+| PYTORCH_VERSION            | PyTorch 镜像版本 | `2.6.0-cuda12.6-cudnn9-runtime` |
+| JUPYTER_ENABLE_LAB         | 启用 Jupyter Lab | `yes`                           |
+| JUPYTER_TOKEN              | Jupyter 访问令牌 | `pytorch`                       |
+| NVIDIA_VISIBLE_DEVICES     | 使用的 GPU       | `all`                           |
+| NVIDIA_DRIVER_CAPABILITIES | 驱动程序功能     | `compute,utility`               |
+| GPU_COUNT                  | 分配的 GPU 数量  | `1`                             |
+| JUPYTER_PORT_OVERRIDE      | Jupyter Lab 端口 | `8888`                          |
+| TENSORBOARD_PORT_OVERRIDE  | TensorBoard 端口 | `6006`                          |
+
+请根据实际需求修改 `.env` 文件。
+
+## 卷
+
+- `pytorch_notebooks`: Jupyter 笔记本和脚本。
+- `pytorch_data`: 训练数据和数据集。
+
+## 使用方法
+
+### 启动服务
+
+```bash
+docker-compose up -d
+```
+
+### 访问 Jupyter Lab
+
+在浏览器中打开:
+
+```text
+http://localhost:8888
+```
+
+使用 `JUPYTER_TOKEN` 中指定的令牌登录（默认: `pytorch`）。
+
+### 验证 GPU 访问
+
+在 Jupyter 笔记本中:
+
+```python
+import torch
+
+print(f"PyTorch version: {torch.__version__}")
+print(f"CUDA available: {torch.cuda.is_available()}")
+print(f"CUDA version: {torch.version.cuda}")
+print(f"Number of GPUs: {torch.cuda.device_count()}")
+
+if torch.cuda.is_available():
+    print(f"GPU name: {torch.cuda.get_device_name(0)}")
+```
+
+### 训练脚本示例
+
+```python
+import torch
+import torch.nn as nn
+import torch.optim as optim
+
+# 设置设备
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+
+# 定义简单模型
+model = nn.Sequential(
+    nn.Linear(784, 128),
+    nn.ReLU(),
+    nn.Linear(128, 10)
+).to(device)
+
+# 创建虚拟数据
+x = torch.randn(64, 784).to(device)
+y = torch.randint(0, 10, (64,)).to(device)
+
+# 训练
+criterion = nn.CrossEntropyLoss()
+optimizer = optim.Adam(model.parameters())
+
+output = model(x)
+loss = criterion(output, y)
+loss.backward()
+optimizer.step()
+
+print(f"Loss: {loss.item()}")
+```
+
+### 访问 TensorBoard
+
+TensorBoard 端口已暴露，但需要手动启动:
+
+```python
+from torch.utils.tensorboard import SummaryWriter
+writer = SummaryWriter('/workspace/runs')
+```
+
+然后启动 TensorBoard:
+
+```bash
+docker exec pytorch tensorboard --logdir=/workspace/runs --host=0.0.0.0
+```
+
+访问地址: `http://localhost:6006`
+
+## 功能
+
+- **GPU 加速**: CUDA 支持以实现快速训练
+- **Jupyter Lab**: 交互式开发环境
+- **TensorBoard**: 训练指标的可视化
+- **预安装**: PyTorch、CUDA、cuDNN 即可使用
+- **持久存储**: 笔记本和数据存储在卷中
+
+## 注意事项
+
+- GPU 对于最佳性能是必需的
+- 推荐: 大多数深度学习任务需要 8GB+ 显存
+- 容器在首次启动时安装 Jupyter 和 TensorBoard
+- 使用 `pytorch/pytorch:*-devel` 构建自定义扩展
+- 对于多 GPU 训练，调整 `GPU_COUNT` 并使用 `torch.nn.DataParallel`
+
+## 许可证
+
+PyTorch 使用 BSD 风格许可证授权。
--- a/src/pytorch/docker-compose.yaml
+++ b/src/pytorch/docker-compose.yaml
@@ -0,0 +1,48 @@
+x-default: &default
+  restart: unless-stopped
+  volumes:
+    - &localtime /etc/localtime:/etc/localtime:ro
+    - &timezone /etc/timezone:/etc/timezone:ro
+  logging:
+    driver: json-file
+    options:
+      max-size: 100m
+
+services:
+  pytorch:
+    <<: *default
+    image: pytorch/pytorch:${PYTORCH_VERSION:-2.6.0-cuda12.6-cudnn9-runtime}
+    container_name: pytorch
+    ports:
+      - "${JUPYTER_PORT_OVERRIDE:-8888}:8888"
+      - "${TENSORBOARD_PORT_OVERRIDE:-6006}:6006"
+    environment:
+      NVIDIA_VISIBLE_DEVICES: ${NVIDIA_VISIBLE_DEVICES:-all}
+      NVIDIA_DRIVER_CAPABILITIES: ${NVIDIA_DRIVER_CAPABILITIES:-compute,utility}
+      JUPYTER_ENABLE_LAB: ${JUPYTER_ENABLE_LAB:-yes}
+    command: >
+      bash -c "pip install --no-cache-dir jupyter tensorboard &&
+               jupyter lab --ip=0.0.0.0 --port=8888 --no-browser --allow-root
+               --NotebookApp.token='${JUPYTER_TOKEN:-pytorch}'"
+    volumes:
+      - *localtime
+      - *timezone
+      - pytorch_notebooks:/workspace
+      - pytorch_data:/data
+    working_dir: /workspace
+    deploy:
+      resources:
+        limits:
+          cpus: '4.0'
+          memory: 16G
+        reservations:
+          cpus: '2.0'
+          memory: 8G
+          devices:
+            - driver: nvidia
+              count: ${GPU_COUNT:-1}
+              capabilities: [gpu]
+
+volumes:
+  pytorch_notebooks:
+  pytorch_data: