feat: add more

2025-10-06 21:48:39 +08:00
parent f330e00fa0
commit 3c609b5989
120 changed files with 7698 additions and 59 deletions
--- a/src/vllm/.env.example
+++ b/src/vllm/.env.example
@@ -0,0 +1,13 @@
+# vLLM version
+VLLM_VERSION="v0.8.0"
+
+# Model configuration
+VLLM_MODEL="facebook/opt-125m"
+VLLM_MAX_MODEL_LEN=2048
+VLLM_GPU_MEMORY_UTIL=0.9
+
+# Hugging Face token for model downloads
+HF_TOKEN=""
+
+# Port to bind to on the host machine
+VLLM_PORT_OVERRIDE=8000
--- a/src/vllm/README.md
+++ b/src/vllm/README.md
@@ -0,0 +1,139 @@
+# vLLM
+
+[English](./README.md) | [中文](./README.zh.md)
+
+This service deploys vLLM, a high-throughput and memory-efficient inference and serving engine for LLMs.
+
+## Services
+
+- `vllm`: vLLM OpenAI-compatible API server
+
+## Environment Variables
+
+| Variable Name        | Description                            | Default Value       |
+| -------------------- | -------------------------------------- | ------------------- |
+| VLLM_VERSION         | vLLM image version                     | `v0.8.0`            |
+| VLLM_MODEL           | Model name or path                     | `facebook/opt-125m` |
+| VLLM_MAX_MODEL_LEN   | Maximum context length                 | `2048`              |
+| VLLM_GPU_MEMORY_UTIL | GPU memory utilization (0.0-1.0)       | `0.9`               |
+| HF_TOKEN             | Hugging Face token for model downloads | `""`                |
+| VLLM_PORT_OVERRIDE   | Host port mapping                      | `8000`              |
+
+Please modify the `.env` file as needed for your use case.
+
+## Volumes
+
+- `vllm_models`: Cached model files from Hugging Face
+
+## GPU Support
+
+This service requires NVIDIA GPU to run properly. Uncomment the GPU configuration in `docker-compose.yaml`:
+
+```yaml
+    deploy:
+      resources:
+        reservations:
+          devices:
+            - driver: nvidia
+              count: 1
+              capabilities: [gpu]
+    runtime: nvidia
+```
+
+## Usage
+
+### Start vLLM
+
+```bash
+docker compose up -d
+```
+
+### Access
+
+- API Endpoint: <http://localhost:8000>
+- OpenAI-compatible API: <http://localhost:8000/v1>
+
+### Test the API
+
+```bash
+curl http://localhost:8000/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "facebook/opt-125m",
+    "prompt": "San Francisco is a",
+    "max_tokens": 50,
+    "temperature": 0.7
+  }'
+```
+
+### Chat Completions
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "facebook/opt-125m",
+    "messages": [{"role": "user", "content": "Hello!"}]
+  }'
+```
+
+## Supported Models
+
+vLLM supports a wide range of models:
+
+- **LLaMA**: LLaMA, LLaMA-2, LLaMA-3
+- **Mistral**: Mistral, Mixtral
+- **Qwen**: Qwen, Qwen2
+- **Yi**: Yi, Yi-VL
+- **Many others**: See [vLLM supported models](https://docs.vllm.ai/en/latest/models/supported_models.html)
+
+To use a different model, change the `VLLM_MODEL` environment variable:
+
+```bash
+# Example: Use Qwen2-7B-Instruct
+VLLM_MODEL="Qwen/Qwen2-7B-Instruct"
+```
+
+## Performance Tuning
+
+### GPU Memory
+
+Adjust GPU memory utilization based on your model size and available VRAM:
+
+```bash
+VLLM_GPU_MEMORY_UTIL=0.85  # Use 85% of GPU memory
+```
+
+### Context Length
+
+Set maximum context length according to your needs:
+
+```bash
+VLLM_MAX_MODEL_LEN=4096  # Support up to 4K tokens
+```
+
+### Shared Memory
+
+For larger models, increase shared memory:
+
+```yaml
+shm_size: 8g  # Increase to 8GB
+```
+
+## Notes
+
+- Requires NVIDIA GPU with CUDA support
+- Model downloads can be large (several GB to 100+ GB)
+- First startup may take time as it downloads the model
+- Ensure sufficient GPU memory for the model you want to run
+- Default model is small (125M parameters) for testing purposes
+
+## Security
+
+- The API has no authentication by default
+- Add authentication layer (e.g., nginx with basic auth) for production
+- Restrict network access to trusted sources
+
+## License
+
+vLLM is licensed under Apache License 2.0. See [vLLM GitHub](https://github.com/vllm-project/vllm) for more information.
--- a/src/vllm/docker-compose.yaml
+++ b/src/vllm/docker-compose.yaml
@@ -0,0 +1,52 @@
+x-default: &default
+  restart: unless-stopped
+  volumes:
+    - &localtime /etc/localtime:/etc/localtime:ro
+    - &timezone /etc/timezone:/etc/timezone:ro
+  logging:
+    driver: json-file
+    options:
+      max-size: 100m
+
+services:
+  vllm:
+    <<: *default
+    image: vllm/vllm-openai:${VLLM_VERSION:-v0.8.0}
+    container_name: vllm
+    ports:
+      - "${VLLM_PORT_OVERRIDE:-8000}:8000"
+    volumes:
+      - *localtime
+      - *timezone
+      - vllm_models:/root/.cache/huggingface
+    environment:
+      - HF_TOKEN=${HF_TOKEN:-}
+    command:
+      - --model
+      - ${VLLM_MODEL:-facebook/opt-125m}
+      - --host
+      - "0.0.0.0"
+      - --port
+      - "8000"
+      - --max-model-len
+      - "${VLLM_MAX_MODEL_LEN:-2048}"
+      - --gpu-memory-utilization
+      - "${VLLM_GPU_MEMORY_UTIL:-0.9}"
+    deploy:
+      resources:
+        limits:
+          cpus: '4.0'
+          memory: 8G
+        reservations:
+          cpus: '2.0'
+          memory: 4G
+          # Uncomment for GPU support
+          # devices:
+          #   - driver: nvidia
+          #     count: 1
+          #     capabilities: [gpu]
+    # runtime: nvidia  # Uncomment for GPU support
+    shm_size: 4g
+
+volumes:
+  vllm_models: