feat: add more
This commit is contained in:
15
src/ray/.env.example
Normal file
15
src/ray/.env.example
Normal file
@@ -0,0 +1,15 @@
|
||||
# Ray version
|
||||
RAY_VERSION="2.42.1-py312"
|
||||
|
||||
# Ray head node configuration
|
||||
RAY_HEAD_NUM_CPUS=4
|
||||
RAY_HEAD_MEMORY=8589934592 # 8GB in bytes
|
||||
|
||||
# Ray worker node configuration
|
||||
RAY_WORKER_NUM_CPUS=2
|
||||
RAY_WORKER_MEMORY=4294967296 # 4GB in bytes
|
||||
|
||||
# Port overrides
|
||||
RAY_DASHBOARD_PORT_OVERRIDE=8265
|
||||
RAY_CLIENT_PORT_OVERRIDE=10001
|
||||
RAY_GCS_PORT_OVERRIDE=6379
|
||||
142
src/ray/README.md
Normal file
142
src/ray/README.md
Normal file
@@ -0,0 +1,142 @@
|
||||
# Ray
|
||||
|
||||
[English](./README.md) | [中文](./README.zh.md)
|
||||
|
||||
This service deploys a Ray cluster with 1 head node and 2 worker nodes for distributed computing.
|
||||
|
||||
## Services
|
||||
|
||||
- `ray-head`: Ray head node with dashboard.
|
||||
- `ray-worker-1`: First Ray worker node.
|
||||
- `ray-worker-2`: Second Ray worker node.
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable Name | Description | Default Value |
|
||||
| --------------------------- | -------------------------- | ------------------ |
|
||||
| RAY_VERSION | Ray image version | `2.42.1-py312` |
|
||||
| RAY_HEAD_NUM_CPUS | Head node CPU count | `4` |
|
||||
| RAY_HEAD_MEMORY | Head node memory (bytes) | `8589934592` (8GB) |
|
||||
| RAY_WORKER_NUM_CPUS | Worker node CPU count | `2` |
|
||||
| RAY_WORKER_MEMORY | Worker node memory (bytes) | `4294967296` (4GB) |
|
||||
| RAY_DASHBOARD_PORT_OVERRIDE | Ray Dashboard port | `8265` |
|
||||
| RAY_CLIENT_PORT_OVERRIDE | Ray Client Server port | `10001` |
|
||||
| RAY_GCS_PORT_OVERRIDE | Ray GCS Server port | `6379` |
|
||||
|
||||
Please modify the `.env` file as needed for your use case.
|
||||
|
||||
## Volumes
|
||||
|
||||
- `ray_storage`: Shared storage for Ray temporary files.
|
||||
|
||||
## Usage
|
||||
|
||||
### Start the Cluster
|
||||
|
||||
```bash
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
### Access Ray Dashboard
|
||||
|
||||
Open your browser and navigate to:
|
||||
|
||||
```text
|
||||
http://localhost:8265
|
||||
```
|
||||
|
||||
The dashboard shows cluster status, running jobs, and resource usage.
|
||||
|
||||
### Connect from Python Client
|
||||
|
||||
```python
|
||||
import ray
|
||||
|
||||
# Connect to the Ray cluster
|
||||
ray.init(address="ray://localhost:10001")
|
||||
|
||||
# Run a simple task
|
||||
@ray.remote
|
||||
def hello_world():
|
||||
return "Hello from Ray!"
|
||||
|
||||
# Execute the task
|
||||
result = ray.get(hello_world.remote())
|
||||
print(result)
|
||||
|
||||
# Check cluster resources
|
||||
print(ray.cluster_resources())
|
||||
```
|
||||
|
||||
### Distributed Computing Example
|
||||
|
||||
```python
|
||||
import ray
|
||||
import time
|
||||
|
||||
ray.init(address="ray://localhost:10001")
|
||||
|
||||
@ray.remote
|
||||
def compute_task(x):
|
||||
time.sleep(1)
|
||||
return x * x
|
||||
|
||||
# Submit 100 tasks in parallel
|
||||
results = ray.get([compute_task.remote(i) for i in range(100)])
|
||||
print(f"Sum of squares: {sum(results)}")
|
||||
```
|
||||
|
||||
### Using Ray Data
|
||||
|
||||
```python
|
||||
import ray
|
||||
|
||||
ray.init(address="ray://localhost:10001")
|
||||
|
||||
# Create a dataset
|
||||
ds = ray.data.range(1000)
|
||||
|
||||
# Process data in parallel
|
||||
result = ds.map(lambda x: x * 2).take(10)
|
||||
print(result)
|
||||
```
|
||||
|
||||
## Features
|
||||
|
||||
- **Distributed Computing**: Scale Python applications across multiple nodes
|
||||
- **Auto-scaling**: Dynamic resource allocation
|
||||
- **Ray Dashboard**: Web UI for monitoring and debugging
|
||||
- **Ray Data**: Distributed data processing
|
||||
- **Ray Train**: Distributed training for ML models
|
||||
- **Ray Serve**: Model serving and deployment
|
||||
- **Ray Tune**: Hyperparameter tuning
|
||||
|
||||
## Notes
|
||||
|
||||
- Workers automatically connect to the head node
|
||||
- The cluster has 1 head node (4 CPU, 8GB RAM) and 2 workers (2 CPU, 4GB RAM each)
|
||||
- Total cluster resources: 8 CPUs, 16GB RAM
|
||||
- Add more workers by duplicating the worker service definition
|
||||
- For GPU support, use `rayproject/ray-ml` image and configure NVIDIA runtime
|
||||
- Ray uses Redis protocol on port 6379 for cluster communication
|
||||
|
||||
## Scaling
|
||||
|
||||
To add more worker nodes, add new service definitions:
|
||||
|
||||
```yaml
|
||||
ray-worker-3:
|
||||
<<: *default
|
||||
image: rayproject/ray:${RAY_VERSION:-2.42.1-py312}
|
||||
container_name: ray-worker-3
|
||||
command: ray start --address=ray-head:6379 --block
|
||||
depends_on:
|
||||
- ray-head
|
||||
environment:
|
||||
RAY_NUM_CPUS: ${RAY_WORKER_NUM_CPUS:-2}
|
||||
RAY_MEMORY: ${RAY_WORKER_MEMORY:-4294967296}
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
Ray is licensed under the Apache License 2.0.
|
||||
142
src/ray/README.zh.md
Normal file
142
src/ray/README.zh.md
Normal file
@@ -0,0 +1,142 @@
|
||||
# Ray
|
||||
|
||||
[English](./README.md) | [中文](./README.zh.md)
|
||||
|
||||
此服务用于部署一个包含 1 个头节点和 2 个工作节点的 Ray 集群,用于分布式计算。
|
||||
|
||||
## 服务
|
||||
|
||||
- `ray-head`: Ray 头节点,带有仪表板。
|
||||
- `ray-worker-1`: 第一个 Ray 工作节点。
|
||||
- `ray-worker-2`: 第二个 Ray 工作节点。
|
||||
|
||||
## 环境变量
|
||||
|
||||
| 变量名 | 说明 | 默认值 |
|
||||
| --------------------------- | -------------------- | ------------------ |
|
||||
| RAY_VERSION | Ray 镜像版本 | `2.42.1-py312` |
|
||||
| RAY_HEAD_NUM_CPUS | 头节点 CPU 数量 | `4` |
|
||||
| RAY_HEAD_MEMORY | 头节点内存(字节) | `8589934592` (8GB) |
|
||||
| RAY_WORKER_NUM_CPUS | 工作节点 CPU 数量 | `2` |
|
||||
| RAY_WORKER_MEMORY | 工作节点内存(字节) | `4294967296` (4GB) |
|
||||
| RAY_DASHBOARD_PORT_OVERRIDE | Ray 仪表板端口 | `8265` |
|
||||
| RAY_CLIENT_PORT_OVERRIDE | Ray 客户端服务器端口 | `10001` |
|
||||
| RAY_GCS_PORT_OVERRIDE | Ray GCS 服务器端口 | `6379` |
|
||||
|
||||
请根据实际需求修改 `.env` 文件。
|
||||
|
||||
## 卷
|
||||
|
||||
- `ray_storage`: Ray 临时文件的共享存储。
|
||||
|
||||
## 使用方法
|
||||
|
||||
### 启动集群
|
||||
|
||||
```bash
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
### 访问 Ray 仪表板
|
||||
|
||||
在浏览器中打开:
|
||||
|
||||
```text
|
||||
http://localhost:8265
|
||||
```
|
||||
|
||||
仪表板显示集群状态、正在运行的作业和资源使用情况。
|
||||
|
||||
### 从 Python 客户端连接
|
||||
|
||||
```python
|
||||
import ray
|
||||
|
||||
# 连接到 Ray 集群
|
||||
ray.init(address="ray://localhost:10001")
|
||||
|
||||
# 运行简单任务
|
||||
@ray.remote
|
||||
def hello_world():
|
||||
return "Hello from Ray!"
|
||||
|
||||
# 执行任务
|
||||
result = ray.get(hello_world.remote())
|
||||
print(result)
|
||||
|
||||
# 检查集群资源
|
||||
print(ray.cluster_resources())
|
||||
```
|
||||
|
||||
### 分布式计算示例
|
||||
|
||||
```python
|
||||
import ray
|
||||
import time
|
||||
|
||||
ray.init(address="ray://localhost:10001")
|
||||
|
||||
@ray.remote
|
||||
def compute_task(x):
|
||||
time.sleep(1)
|
||||
return x * x
|
||||
|
||||
# 并行提交 100 个任务
|
||||
results = ray.get([compute_task.remote(i) for i in range(100)])
|
||||
print(f"Sum of squares: {sum(results)}")
|
||||
```
|
||||
|
||||
### 使用 Ray Data
|
||||
|
||||
```python
|
||||
import ray
|
||||
|
||||
ray.init(address="ray://localhost:10001")
|
||||
|
||||
# 创建数据集
|
||||
ds = ray.data.range(1000)
|
||||
|
||||
# 并行处理数据
|
||||
result = ds.map(lambda x: x * 2).take(10)
|
||||
print(result)
|
||||
```
|
||||
|
||||
## 功能
|
||||
|
||||
- **分布式计算**: 跨多个节点扩展 Python 应用程序
|
||||
- **自动扩展**: 动态资源分配
|
||||
- **Ray 仪表板**: 用于监控和调试的 Web UI
|
||||
- **Ray Data**: 分布式数据处理
|
||||
- **Ray Train**: ML 模型的分布式训练
|
||||
- **Ray Serve**: 模型服务和部署
|
||||
- **Ray Tune**: 超参数调优
|
||||
|
||||
## 注意事项
|
||||
|
||||
- 工作节点自动连接到头节点
|
||||
- 集群有 1 个头节点(4 CPU,8GB RAM)和 2 个工作节点(每个 2 CPU,4GB RAM)
|
||||
- 集群总资源: 8 个 CPU,16GB RAM
|
||||
- 通过复制工作节点服务定义添加更多工作节点
|
||||
- 对于 GPU 支持,使用 `rayproject/ray-ml` 镜像并配置 NVIDIA 运行时
|
||||
- Ray 使用端口 6379 上的 Redis 协议进行集群通信
|
||||
|
||||
## 扩展
|
||||
|
||||
要添加更多工作节点,添加新的服务定义:
|
||||
|
||||
```yaml
|
||||
ray-worker-3:
|
||||
<<: *default
|
||||
image: rayproject/ray:${RAY_VERSION:-2.42.1-py312}
|
||||
container_name: ray-worker-3
|
||||
command: ray start --address=ray-head:6379 --block
|
||||
depends_on:
|
||||
- ray-head
|
||||
environment:
|
||||
RAY_NUM_CPUS: ${RAY_WORKER_NUM_CPUS:-2}
|
||||
RAY_MEMORY: ${RAY_WORKER_MEMORY:-4294967296}
|
||||
```
|
||||
|
||||
## 许可证
|
||||
|
||||
Ray 使用 Apache License 2.0 授权。
|
||||
82
src/ray/docker-compose.yaml
Normal file
82
src/ray/docker-compose.yaml
Normal file
@@ -0,0 +1,82 @@
|
||||
x-default: &default
|
||||
restart: unless-stopped
|
||||
volumes:
|
||||
- &localtime /etc/localtime:/etc/localtime:ro
|
||||
- &timezone /etc/timezone:/etc/timezone:ro
|
||||
logging:
|
||||
driver: json-file
|
||||
options:
|
||||
max-size: 100m
|
||||
|
||||
services:
|
||||
ray-head:
|
||||
<<: *default
|
||||
image: rayproject/ray:${RAY_VERSION:-2.42.1-py312}
|
||||
container_name: ray-head
|
||||
command: ray start --head --dashboard-host=0.0.0.0 --port=6379 --block
|
||||
ports:
|
||||
- "${RAY_DASHBOARD_PORT_OVERRIDE:-8265}:8265"
|
||||
- "${RAY_CLIENT_PORT_OVERRIDE:-10001}:10001"
|
||||
- "${RAY_GCS_PORT_OVERRIDE:-6379}:6379"
|
||||
environment:
|
||||
RAY_NUM_CPUS: ${RAY_HEAD_NUM_CPUS:-4}
|
||||
RAY_MEMORY: ${RAY_HEAD_MEMORY:-8589934592}
|
||||
volumes:
|
||||
- *localtime
|
||||
- *timezone
|
||||
- ray_storage:/tmp/ray
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
cpus: '4.0'
|
||||
memory: 8G
|
||||
reservations:
|
||||
cpus: '2.0'
|
||||
memory: 4G
|
||||
|
||||
ray-worker-1:
|
||||
<<: *default
|
||||
image: rayproject/ray:${RAY_VERSION:-2.42.1-py312}
|
||||
container_name: ray-worker-1
|
||||
command: ray start --address=ray-head:6379 --block
|
||||
depends_on:
|
||||
- ray-head
|
||||
environment:
|
||||
RAY_NUM_CPUS: ${RAY_WORKER_NUM_CPUS:-2}
|
||||
RAY_MEMORY: ${RAY_WORKER_MEMORY:-4294967296}
|
||||
volumes:
|
||||
- *localtime
|
||||
- *timezone
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
cpus: '2.0'
|
||||
memory: 4G
|
||||
reservations:
|
||||
cpus: '1.0'
|
||||
memory: 2G
|
||||
|
||||
ray-worker-2:
|
||||
<<: *default
|
||||
image: rayproject/ray:${RAY_VERSION:-2.42.1-py312}
|
||||
container_name: ray-worker-2
|
||||
command: ray start --address=ray-head:6379 --block
|
||||
depends_on:
|
||||
- ray-head
|
||||
environment:
|
||||
RAY_NUM_CPUS: ${RAY_WORKER_NUM_CPUS:-2}
|
||||
RAY_MEMORY: ${RAY_WORKER_MEMORY:-4294967296}
|
||||
volumes:
|
||||
- *localtime
|
||||
- *timezone
|
||||
deploy:
|
||||
resources:
|
||||
limits:
|
||||
cpus: '2.0'
|
||||
memory: 4G
|
||||
reservations:
|
||||
cpus: '1.0'
|
||||
memory: 2G
|
||||
|
||||
volumes:
|
||||
ray_storage:
|
||||
Reference in New Issue
Block a user