143 lines
3.5 KiB
Markdown
143 lines
3.5 KiB
Markdown
# Ray
|
|
|
|
[English](./README.md) | [中文](./README.zh.md)
|
|
|
|
This service deploys a Ray cluster with 1 head node and 2 worker nodes for distributed computing.
|
|
|
|
## Services
|
|
|
|
- `ray-head`: Ray head node with dashboard.
|
|
- `ray-worker-1`: First Ray worker node.
|
|
- `ray-worker-2`: Second Ray worker node.
|
|
|
|
## Environment Variables
|
|
|
|
| Variable Name | Description | Default Value |
|
|
| --------------------------- | -------------------------- | ------------------ |
|
|
| RAY_VERSION | Ray image version | `2.42.1-py312` |
|
|
| RAY_HEAD_NUM_CPUS | Head node CPU count | `4` |
|
|
| RAY_HEAD_MEMORY | Head node memory (bytes) | `8589934592` (8GB) |
|
|
| RAY_WORKER_NUM_CPUS | Worker node CPU count | `2` |
|
|
| RAY_WORKER_MEMORY | Worker node memory (bytes) | `4294967296` (4GB) |
|
|
| RAY_DASHBOARD_PORT_OVERRIDE | Ray Dashboard port | `8265` |
|
|
| RAY_CLIENT_PORT_OVERRIDE | Ray Client Server port | `10001` |
|
|
| RAY_GCS_PORT_OVERRIDE | Ray GCS Server port | `6379` |
|
|
|
|
Please modify the `.env` file as needed for your use case.
|
|
|
|
## Volumes
|
|
|
|
- `ray_storage`: Shared storage for Ray temporary files.
|
|
|
|
## Usage
|
|
|
|
### Start the Cluster
|
|
|
|
```bash
|
|
docker-compose up -d
|
|
```
|
|
|
|
### Access Ray Dashboard
|
|
|
|
Open your browser and navigate to:
|
|
|
|
```text
|
|
http://localhost:8265
|
|
```
|
|
|
|
The dashboard shows cluster status, running jobs, and resource usage.
|
|
|
|
### Connect from Python Client
|
|
|
|
```python
|
|
import ray
|
|
|
|
# Connect to the Ray cluster
|
|
ray.init(address="ray://localhost:10001")
|
|
|
|
# Run a simple task
|
|
@ray.remote
|
|
def hello_world():
|
|
return "Hello from Ray!"
|
|
|
|
# Execute the task
|
|
result = ray.get(hello_world.remote())
|
|
print(result)
|
|
|
|
# Check cluster resources
|
|
print(ray.cluster_resources())
|
|
```
|
|
|
|
### Distributed Computing Example
|
|
|
|
```python
|
|
import ray
|
|
import time
|
|
|
|
ray.init(address="ray://localhost:10001")
|
|
|
|
@ray.remote
|
|
def compute_task(x):
|
|
time.sleep(1)
|
|
return x * x
|
|
|
|
# Submit 100 tasks in parallel
|
|
results = ray.get([compute_task.remote(i) for i in range(100)])
|
|
print(f"Sum of squares: {sum(results)}")
|
|
```
|
|
|
|
### Using Ray Data
|
|
|
|
```python
|
|
import ray
|
|
|
|
ray.init(address="ray://localhost:10001")
|
|
|
|
# Create a dataset
|
|
ds = ray.data.range(1000)
|
|
|
|
# Process data in parallel
|
|
result = ds.map(lambda x: x * 2).take(10)
|
|
print(result)
|
|
```
|
|
|
|
## Features
|
|
|
|
- **Distributed Computing**: Scale Python applications across multiple nodes
|
|
- **Auto-scaling**: Dynamic resource allocation
|
|
- **Ray Dashboard**: Web UI for monitoring and debugging
|
|
- **Ray Data**: Distributed data processing
|
|
- **Ray Train**: Distributed training for ML models
|
|
- **Ray Serve**: Model serving and deployment
|
|
- **Ray Tune**: Hyperparameter tuning
|
|
|
|
## Notes
|
|
|
|
- Workers automatically connect to the head node
|
|
- The cluster has 1 head node (4 CPU, 8GB RAM) and 2 workers (2 CPU, 4GB RAM each)
|
|
- Total cluster resources: 8 CPUs, 16GB RAM
|
|
- Add more workers by duplicating the worker service definition
|
|
- For GPU support, use `rayproject/ray-ml` image and configure NVIDIA runtime
|
|
- Ray uses Redis protocol on port 6379 for cluster communication
|
|
|
|
## Scaling
|
|
|
|
To add more worker nodes, add new service definitions:
|
|
|
|
```yaml
|
|
ray-worker-3:
|
|
<<: *defaults
|
|
image: rayproject/ray:${RAY_VERSION:-2.42.1-py312}
|
|
container_name: ray-worker-3
|
|
command: ray start --address=ray-head:6379 --block
|
|
depends_on:
|
|
- ray-head
|
|
environment:
|
|
RAY_NUM_CPUS: ${RAY_WORKER_NUM_CPUS:-2}
|
|
RAY_MEMORY: ${RAY_WORKER_MEMORY:-4294967296}
|
|
```
|
|
|
|
## License
|
|
|
|
Ray is licensed under the Apache License 2.0.
|