vLLM
This service deploys vLLM, a high-throughput and memory-efficient inference and serving engine for LLMs.
Services
vllm: vLLM OpenAI-compatible API server
Environment Variables
| Variable Name | Description | Default Value |
|---|---|---|
| VLLM_VERSION | vLLM image version | v0.8.0 |
| VLLM_MODEL | Model name or path | facebook/opt-125m |
| VLLM_MAX_MODEL_LEN | Maximum context length | 2048 |
| VLLM_GPU_MEMORY_UTIL | GPU memory utilization (0.0-1.0) | 0.9 |
| HF_TOKEN | Hugging Face token for model downloads | "" |
| VLLM_PORT_OVERRIDE | Host port mapping | 8000 |
Please modify the .env file as needed for your use case.
Volumes
vllm_models: Cached model files from Hugging Face
GPU Support
This service requires NVIDIA GPU to run properly. Uncomment the GPU configuration in docker-compose.yaml:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
runtime: nvidia
Usage
Start vLLM
docker compose up -d
Access
- API Endpoint: http://localhost:8000
- OpenAI-compatible API: http://localhost:8000/v1
Test the API
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 50,
"temperature": 0.7
}'
Chat Completions
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Supported Models
vLLM supports a wide range of models:
- LLaMA: LLaMA, LLaMA-2, LLaMA-3
- Mistral: Mistral, Mixtral
- Qwen: Qwen, Qwen2
- Yi: Yi, Yi-VL
- Many others: See vLLM supported models
To use a different model, change the VLLM_MODEL environment variable:
# Example: Use Qwen2-7B-Instruct
VLLM_MODEL="Qwen/Qwen2-7B-Instruct"
Performance Tuning
GPU Memory
Adjust GPU memory utilization based on your model size and available VRAM:
VLLM_GPU_MEMORY_UTIL=0.85 # Use 85% of GPU memory
Context Length
Set maximum context length according to your needs:
VLLM_MAX_MODEL_LEN=4096 # Support up to 4K tokens
Shared Memory
For larger models, increase shared memory:
shm_size: 8g # Increase to 8GB
Notes
- Requires NVIDIA GPU with CUDA support
- Model downloads can be large (several GB to 100+ GB)
- First startup may take time as it downloads the model
- Ensure sufficient GPU memory for the model you want to run
- Default model is small (125M parameters) for testing purposes
Security
- The API has no authentication by default
- Add authentication layer (e.g., nginx with basic auth) for production
- Restrict network access to trusted sources
License
vLLM is licensed under Apache License 2.0. See vLLM GitHub for more information.