- Introduced Convex, an open-source reactive database, with README and environment variable configurations. - Added Chinese translation for Convex documentation. - Created docker-compose configuration for Convex services. - Introduced llama-swap, a model swapping proxy for OpenAI/Anthropic compatible servers, with comprehensive README and example configuration. - Added Chinese translation for llama-swap documentation. - Included example environment file and docker-compose setup for llama-swap. - Configured health checks and resource limits for both Convex and llama-swap services.
llama-swap
llama-swap is a lightweight reverse proxy that provides reliable on-demand model swapping for any local OpenAI/Anthropic-compatible inference server (e.g., llama.cpp, vllm). Only one model is loaded at a time, and it is automatically swapped out when a different model is requested, making it easy to work with many models on a single machine.
See also: README.zh.md
Features
- On-demand model swapping: Automatically load/unload models based on API requests with zero manual intervention.
- OpenAI/Anthropic compatible: Drop-in replacement for any client that uses the OpenAI or Anthropic chat completion API.
- Multi-backend support: Works with llama.cpp (llama-server), vllm, and any OpenAI-compatible server.
- Real-time Web UI: Built-in interface for monitoring logs, inspecting requests, and manually managing models.
- TTL-based unloading: Models can be configured to unload automatically after a period of inactivity.
- HuggingFace model downloads: Reference HuggingFace models directly in
config.yamland they are downloaded on first use. - Multi-GPU support: Works with NVIDIA CUDA, AMD ROCm/Vulkan, Intel, and CPU-only setups.
Quick Start
-
Copy the example environment file:
cp .env.example .env -
Edit
config.yamlto add your models. The providedconfig.yamlincludes a commented example for a local GGUF model and a HuggingFace download. See Configuration for details. -
Start the service (CPU-only by default):
docker compose up -d -
For NVIDIA GPU support:
docker compose --profile gpu up -d -
For AMD GPU support (Vulkan):
docker compose --profile gpu-amd up -d
The API and Web UI are available at: http://localhost:9292
Services
| Service | Profile | Description |
|---|---|---|
llama-swap |
(default) | CPU-only inference |
llama-swap-cuda |
gpu |
NVIDIA CUDA GPU inference |
llama-swap-amd |
gpu-amd |
AMD GPU inference (Vulkan / ROCm) |
Note
: Only start one service at a time. All three services bind to the same host port (
LLAMA_SWAP_PORT_OVERRIDE).
Configuration
config.yaml
The config.yaml file defines the models llama-swap manages. It is mounted read-only at /app/config.yaml inside the container. Edit the provided config.yaml to add your models.
Minimal example:
healthCheckTimeout: 300
models:
my-model:
cmd: /app/llama-server --port ${PORT} --model /root/.cache/llama.cpp/model.gguf --ctx-size 4096
proxy: 'http://localhost:${PORT}'
ttl: 900
${PORT}is automatically assigned by llama-swap.ttl(seconds): unload the model after this many seconds of inactivity.cmd: the command to start the inference server.proxy: the address llama-swap forwards requests to.
For downloading models from HuggingFace on first use:
models:
Qwen2.5-7B:
cmd: /app/llama-server --port ${PORT} -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M --ctx-size 8192 --n-gpu-layers 99
proxy: 'http://localhost:${PORT}'
See the official configuration documentation for all options including groups, hooks, macros, aliases, filters, and more.
Models Volume
The named volume llama_swap_models is mounted to /root/.cache/llama.cpp inside the container. To place local GGUF model files inside the volume, you can use:
# Copy a model into the named volume
docker run --rm -v llama_swap_models:/data -v /path/to/model.gguf:/src/model.gguf alpine cp /src/model.gguf /data/model.gguf
Alternatively, change the volume definition in docker-compose.yaml to use a host path:
volumes:
llama_swap_models:
driver: local
driver_opts:
type: none
o: bind
device: /path/to/your/models
Environment Variables
| Variable | Default | Description |
|---|---|---|
TZ |
UTC |
Container timezone |
GHCR_REGISTRY |
ghcr.io/ |
GitHub Container Registry prefix |
LLAMA_SWAP_VERSION |
cpu |
Image tag for the default CPU service |
LLAMA_SWAP_CUDA_VERSION |
cuda |
Image tag for the CUDA service |
LLAMA_SWAP_AMD_VERSION |
vulkan |
Image tag for the AMD service (vulkan or rocm) |
LLAMA_SWAP_PORT_OVERRIDE |
9292 |
Host port for the API and Web UI |
LLAMA_SWAP_GPU_COUNT |
1 |
Number of NVIDIA GPUs to use (CUDA profile) |
LLAMA_SWAP_CPU_LIMIT |
4.0 |
CPU limit in cores |
LLAMA_SWAP_CPU_RESERVATION |
2.0 |
CPU reservation in cores |
LLAMA_SWAP_MEMORY_LIMIT |
8G |
Memory limit |
LLAMA_SWAP_MEMORY_RESERVATION |
4G |
Memory reservation |
Default Ports
| Port | Description |
|---|---|
9292 |
OpenAI/Anthropic-compatible API and Web UI |
API Usage
llama-swap exposes an OpenAI-compatible API. Use any OpenAI client by pointing it to http://localhost:9292:
# List available models
curl http://localhost:9292/v1/models
# Chat completion (automatically loads the model if not running)
curl http://localhost:9292/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "my-model",
"messages": [{"role": "user", "content": "Hello!"}]
}'
The Web UI is available at http://localhost:9292 and provides real-time log streaming, request inspection, and manual model management.
NVIDIA GPU Setup
Requires the NVIDIA Container Toolkit.
docker compose --profile gpu up -d
For non-root security hardening, use the cuda-non-root image tag:
LLAMA_SWAP_CUDA_VERSION=cuda-non-root
AMD GPU Setup
Requires the /dev/dri and /dev/dri devices to be accessible on the host.
docker compose --profile gpu-amd up -d
Use rocm instead of vulkan for full ROCm support:
LLAMA_SWAP_AMD_VERSION=rocm docker compose --profile gpu-amd up -d
Security Notes
- By default, the container runs as root. Use the
cuda-non-rootorrocm-non-rootimage tags for improved security on GPU deployments. - The
config.yamlis mounted read-only (ro). - Consider placing llama-swap behind a reverse proxy (e.g., Nginx, Caddy) when exposing it beyond localhost.
References
License
llama-swap is released under the MIT License. See the LICENSE file for details.