Files

Sun-ZhenXing fbd0c9b7f4 feat: add services

- Introduced Convex, an open-source reactive database, with README and environment variable configurations.
- Added Chinese translation for Convex documentation.
- Created docker-compose configuration for Convex services.
- Introduced llama-swap, a model swapping proxy for OpenAI/Anthropic compatible servers, with comprehensive README and example configuration.
- Added Chinese translation for llama-swap documentation.
- Included example environment file and docker-compose setup for llama-swap.
- Configured health checks and resource limits for both Convex and llama-swap services.

2026-03-09 09:27:06 +08:00

7.2 KiB

Raw Blame History

llama-swap

llama-swap 是一个轻量级反向代理，为任何本地 OpenAI/Anthropic 兼容的推理服务器（如 llama.cpp、vllm 等）提供可靠的按需模型切换功能。同一时间只加载一个模型，当收到对不同模型的请求时，llama-swap 会自动切换，让你可以在单台机器上轻松使用多个模型。

参见：README.md

功能特性

按需模型切换：根据 API 请求自动加载/卸载模型，无需手动干预。
兼容 OpenAI/Anthropic：可直接替代任何使用 OpenAI 或 Anthropic 聊天补全 API 的客户端。
多后端支持：适用于 llama.cpp（llama-server）、vllm 及任何 OpenAI 兼容服务器。
实时 Web UI：内置界面，可监控日志、检查请求、手动管理模型。
基于 TTL 的自动卸载：可配置模型在闲置一段时间后自动卸载。
HuggingFace 模型下载：在 config.yaml 中直接引用 HuggingFace 模型，首次使用时自动下载。
多 GPU 支持：支持 NVIDIA CUDA、AMD ROCm/Vulkan、Intel 及纯 CPU 部署。

快速开始

复制环境变量示例文件：
```
cp .env.example .env
```
编辑 config.yaml，添加你的模型配置。提供的 config.yaml 包含本地 GGUF 模型和 HuggingFace 下载的注释示例。详见配置说明。
启动服务（默认仅使用 CPU）：
```
docker compose up -d
```
启用 NVIDIA GPU 支持：
```
docker compose --profile gpu up -d
```
启用 AMD GPU 支持（Vulkan）：
```
docker compose --profile gpu-amd up -d
```

API 和 Web UI 地址：http://localhost:9292

服务说明

服务名称	Profile	说明
`llama-swap`	（默认）	纯 CPU 推理
`llama-swap-cuda`	`gpu`	NVIDIA CUDA GPU 推理
`llama-swap-amd`	`gpu-amd`	AMD GPU 推理（Vulkan / ROCm）

注意：每次只启动一个服务，三个服务均绑定到同一主机端口（LLAMA_SWAP_PORT_OVERRIDE）。

配置说明

`config.yaml`

config.yaml 文件定义了 llama-swap 管理的模型列表，以只读方式挂载到容器内的 /app/config.yaml。编辑提供的 config.yaml 即可添加你的模型。

最简示例：

healthCheckTimeout: 300

models:
  my-model:
    cmd: /app/llama-server --port ${PORT} --model /root/.cache/llama.cpp/model.gguf --ctx-size 4096
    proxy: 'http://localhost:${PORT}'
    ttl: 900

${PORT} 由 llama-swap 自动分配。
ttl（秒）：模型闲置超过该时长后自动卸载。
cmd：启动推理服务器的命令。
proxy：llama-swap 转发请求的地址。

直接使用 HuggingFace 模型（首次使用时自动下载）：

models:
  Qwen2.5-7B:
    cmd: /app/llama-server --port ${PORT} -hf bartowski/Qwen2.5-7B-Instruct-GGUF:Q4_K_M --ctx-size 8192 --n-gpu-layers 99
    proxy: 'http://localhost:${PORT}'

完整配置选项（包括 groups、hooks、macros、aliases、filters 等）请参阅官方配置文档。

模型卷

命名卷 llama_swap_models 挂载到容器内的 /root/.cache/llama.cpp。可以通过以下方式将本地 GGUF 模型文件放入卷中：

# 将模型文件复制到命名卷
docker run --rm -v llama_swap_models:/data -v /path/to/model.gguf:/src/model.gguf alpine cp /src/model.gguf /data/model.gguf

或者将 docker-compose.yaml 中的卷定义改为主机路径绑定：

volumes:
  llama_swap_models:
    driver: local
    driver_opts:
      type: none
      o: bind
      device: /path/to/your/models

环境变量

变量名	默认值	说明
`TZ`	`UTC`	容器时区
`GHCR_REGISTRY`	`ghcr.io/`	GitHub 容器镜像仓库前缀
`LLAMA_SWAP_VERSION`	`cpu`	默认 CPU 服务镜像标签
`LLAMA_SWAP_CUDA_VERSION`	`cuda`	CUDA 服务镜像标签
`LLAMA_SWAP_AMD_VERSION`	`vulkan`	AMD 服务镜像标签（`vulkan` 或 `rocm`）
`LLAMA_SWAP_PORT_OVERRIDE`	`9292`	API 和 Web UI 的主机端口
`LLAMA_SWAP_GPU_COUNT`	`1`	使用的 NVIDIA GPU 数量（gpu profile）
`LLAMA_SWAP_CPU_LIMIT`	`4.0`	CPU 上限（核心数）
`LLAMA_SWAP_CPU_RESERVATION`	`2.0`	CPU 预留（核心数）
`LLAMA_SWAP_MEMORY_LIMIT`	`8G`	内存上限
`LLAMA_SWAP_MEMORY_RESERVATION`	`4G`	内存预留

默认端口

端口	说明
`9292`	OpenAI/Anthropic 兼容 API 及 Web UI

API 使用示例

llama-swap 暴露 OpenAI 兼容 API。将任何 OpenAI 客户端指向 http://localhost:9292 即可使用：

# 列出可用模型
curl http://localhost:9292/v1/models

# 聊天补全（若模型未运行则自动加载）
curl http://localhost:9292/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "my-model",
    "messages": [{"role": "user", "content": "你好！"}]
  }'

Web UI 可通过 http://localhost:9292 访问，提供实时日志流、请求检查和手动模型管理功能。

NVIDIA GPU 配置

需要安装 NVIDIA Container Toolkit。

docker compose --profile gpu up -d

如需非 root 安全加固，可使用 cuda-non-root 镜像标签：

LLAMA_SWAP_CUDA_VERSION=cuda-non-root docker compose --profile gpu up -d

AMD GPU 配置

需要主机上 /dev/dri 和 /dev/kfd 设备可访问。

docker compose --profile gpu-amd up -d

如需完整 ROCm 支持，可使用 rocm 替代 vulkan：

LLAMA_SWAP_AMD_VERSION=rocm docker compose --profile gpu-amd up -d

安全说明

默认情况下容器以 root 用户运行。GPU 部署时建议使用 cuda-non-root 或 rocm-non-root 镜像标签提升安全性。
config.yaml 以只读方式（ro）挂载。
若需在 localhost 之外暴露服务，建议在 llama-swap 前部署反向代理（如 Nginx、Caddy）。

参考链接

许可证

llama-swap 使用 MIT 许可证发布。详情请参阅 LICENSE 文件。

7.2 KiB Raw Blame History Unescape Escape