Variable Name	Description	Default Value
VLLM_VERSION	vLLM image version	`v0.8.0`
VLLM_MODEL	Model name or path	`facebook/opt-125m`
VLLM_MAX_MODEL_LEN	Maximum context length	`2048`
VLLM_GPU_MEMORY_UTIL	GPU memory utilization (0.0-1.0)	`0.9`
HF_TOKEN	Hugging Face token for model downloads	`""`
VLLM_PORT_OVERRIDE	Host port mapping	`8000`

Please modify the .env file as needed for your use case.

Volumes

vllm_models: Cached model files from Hugging Face

GPU Support

This service requires NVIDIA GPU to run properly. Uncomment the GPU configuration in docker-compose.yaml:

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    runtime: nvidia

Usage

Start vLLM

docker compose up -d

Access

API Endpoint: http://localhost:8000
OpenAI-compatible API: http://localhost:8000/v1

Test the API

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "prompt": "San Francisco is a",
    "max_tokens": 50,
    "temperature": 0.7
  }'

Chat Completions

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "facebook/opt-125m",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Supported Models

vLLM supports a wide range of models:

LLaMA: LLaMA, LLaMA-2, LLaMA-3
Mistral: Mistral, Mixtral
Qwen: Qwen, Qwen2
Yi: Yi, Yi-VL
Many others: See vLLM supported models

To use a different model, change the VLLM_MODEL environment variable:

# Example: Use Qwen2-7B-Instruct
VLLM_MODEL="Qwen/Qwen2-7B-Instruct"

Performance Tuning

GPU Memory

Adjust GPU memory utilization based on your model size and available VRAM:

VLLM_GPU_MEMORY_UTIL=0.85  # Use 85% of GPU memory

Context Length

Set maximum context length according to your needs:

VLLM_MAX_MODEL_LEN=4096  # Support up to 4K tokens

Shared Memory

For larger models, increase shared memory:

shm_size: 8g  # Increase to 8GB

Notes

Requires NVIDIA GPU with CUDA support
Model downloads can be large (several GB to 100+ GB)
First startup may take time as it downloads the model
Ensure sufficient GPU memory for the model you want to run
Default model is small (125M parameters) for testing purposes

Security

The API has no authentication by default
Add authentication layer (e.g., nginx with basic auth) for production
Restrict network access to trusted sources

License

vLLM is licensed under Apache License 2.0. See vLLM GitHub for more information.