feat: add more

This commit is contained in:
Sun-ZhenXing
2025-10-06 21:48:39 +08:00
parent f330e00fa0
commit 3c609b5989
120 changed files with 7698 additions and 59 deletions

13
src/vllm/.env.example Normal file
View File

@@ -0,0 +1,13 @@
# vLLM version
VLLM_VERSION="v0.8.0"
# Model configuration
VLLM_MODEL="facebook/opt-125m"
VLLM_MAX_MODEL_LEN=2048
VLLM_GPU_MEMORY_UTIL=0.9
# Hugging Face token for model downloads
HF_TOKEN=""
# Port to bind to on the host machine
VLLM_PORT_OVERRIDE=8000

139
src/vllm/README.md Normal file
View File

@@ -0,0 +1,139 @@
# vLLM
[English](./README.md) | [中文](./README.zh.md)
This service deploys vLLM, a high-throughput and memory-efficient inference and serving engine for LLMs.
## Services
- `vllm`: vLLM OpenAI-compatible API server
## Environment Variables
| Variable Name | Description | Default Value |
| -------------------- | -------------------------------------- | ------------------- |
| VLLM_VERSION | vLLM image version | `v0.8.0` |
| VLLM_MODEL | Model name or path | `facebook/opt-125m` |
| VLLM_MAX_MODEL_LEN | Maximum context length | `2048` |
| VLLM_GPU_MEMORY_UTIL | GPU memory utilization (0.0-1.0) | `0.9` |
| HF_TOKEN | Hugging Face token for model downloads | `""` |
| VLLM_PORT_OVERRIDE | Host port mapping | `8000` |
Please modify the `.env` file as needed for your use case.
## Volumes
- `vllm_models`: Cached model files from Hugging Face
## GPU Support
This service requires NVIDIA GPU to run properly. Uncomment the GPU configuration in `docker-compose.yaml`:
```yaml
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
runtime: nvidia
```
## Usage
### Start vLLM
```bash
docker compose up -d
```
### Access
- API Endpoint: <http://localhost:8000>
- OpenAI-compatible API: <http://localhost:8000/v1>
### Test the API
```bash
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 50,
"temperature": 0.7
}'
```
### Chat Completions
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"messages": [{"role": "user", "content": "Hello!"}]
}'
```
## Supported Models
vLLM supports a wide range of models:
- **LLaMA**: LLaMA, LLaMA-2, LLaMA-3
- **Mistral**: Mistral, Mixtral
- **Qwen**: Qwen, Qwen2
- **Yi**: Yi, Yi-VL
- **Many others**: See [vLLM supported models](https://docs.vllm.ai/en/latest/models/supported_models.html)
To use a different model, change the `VLLM_MODEL` environment variable:
```bash
# Example: Use Qwen2-7B-Instruct
VLLM_MODEL="Qwen/Qwen2-7B-Instruct"
```
## Performance Tuning
### GPU Memory
Adjust GPU memory utilization based on your model size and available VRAM:
```bash
VLLM_GPU_MEMORY_UTIL=0.85 # Use 85% of GPU memory
```
### Context Length
Set maximum context length according to your needs:
```bash
VLLM_MAX_MODEL_LEN=4096 # Support up to 4K tokens
```
### Shared Memory
For larger models, increase shared memory:
```yaml
shm_size: 8g # Increase to 8GB
```
## Notes
- Requires NVIDIA GPU with CUDA support
- Model downloads can be large (several GB to 100+ GB)
- First startup may take time as it downloads the model
- Ensure sufficient GPU memory for the model you want to run
- Default model is small (125M parameters) for testing purposes
## Security
- The API has no authentication by default
- Add authentication layer (e.g., nginx with basic auth) for production
- Restrict network access to trusted sources
## License
vLLM is licensed under Apache License 2.0. See [vLLM GitHub](https://github.com/vllm-project/vllm) for more information.

View File

@@ -0,0 +1,52 @@
x-default: &default
restart: unless-stopped
volumes:
- &localtime /etc/localtime:/etc/localtime:ro
- &timezone /etc/timezone:/etc/timezone:ro
logging:
driver: json-file
options:
max-size: 100m
services:
vllm:
<<: *default
image: vllm/vllm-openai:${VLLM_VERSION:-v0.8.0}
container_name: vllm
ports:
- "${VLLM_PORT_OVERRIDE:-8000}:8000"
volumes:
- *localtime
- *timezone
- vllm_models:/root/.cache/huggingface
environment:
- HF_TOKEN=${HF_TOKEN:-}
command:
- --model
- ${VLLM_MODEL:-facebook/opt-125m}
- --host
- "0.0.0.0"
- --port
- "8000"
- --max-model-len
- "${VLLM_MAX_MODEL_LEN:-2048}"
- --gpu-memory-utilization
- "${VLLM_GPU_MEMORY_UTIL:-0.9}"
deploy:
resources:
limits:
cpus: '4.0'
memory: 8G
reservations:
cpus: '2.0'
memory: 4G
# Uncomment for GPU support
# devices:
# - driver: nvidia
# count: 1
# capabilities: [gpu]
# runtime: nvidia # Uncomment for GPU support
shm_size: 4g
volumes:
vllm_models: