feat: Add Chinese documentation and Docker Compose configurations for DeepTutor and llama.cpp

- Created README.zh.md for DeepTutor with comprehensive features, installation steps, and usage instructions in Chinese.
- Added docker-compose.yaml for DeepTutor to define services, environment variables, and resource limits.
- Introduced .env.example for llama.cpp with configuration options for server settings and resource management.
- Added README.md and README.zh.md for llama.cpp detailing features, prerequisites, quick start guides, and API documentation.
- Implemented docker-compose.yaml for llama.cpp to support various server configurations (CPU, CUDA, ROCm) and CLI usage.
This commit is contained in:
Sun-ZhenXing
2026-02-01 16:08:44 +08:00
parent e2ac465417
commit 28ed2462af
10 changed files with 1470 additions and 0 deletions

106
src/llama.cpp/.env.example Normal file
View File

@@ -0,0 +1,106 @@
# =============================================================================
# llama.cpp Configuration
# https://github.com/ggml-org/llama.cpp
# LLM inference in C/C++ with support for various hardware accelerators
# =============================================================================
# -----------------------------------------------------------------------------
# General Settings
# -----------------------------------------------------------------------------
# Timezone for the container (default: UTC)
TZ=UTC
# Global registry prefix (optional)
# Example: docker.io/, ghcr.io/, registry.example.com/
GHCR_REGISTRY=ghcr.io/
# -----------------------------------------------------------------------------
# Server Configuration
# -----------------------------------------------------------------------------
# Server image variant
# Options: server (CPU), server-cuda (NVIDIA GPU), server-rocm (AMD GPU),
# server-musa (Moore Threads GPU), server-intel (Intel GPU),
# server-vulkan (Vulkan GPU)
LLAMA_CPP_SERVER_VARIANT=server
# Server port override (default: 8080)
LLAMA_CPP_SERVER_PORT_OVERRIDE=8080
# Model path inside the container
# You need to mount your model file to this path
# Example: /models/llama-2-7b-chat.Q4_K_M.gguf
LLAMA_CPP_MODEL_PATH=/models/model.gguf
# Context size (number of tokens)
# Larger values allow for more context but require more memory
# Default: 512, Common values: 512, 2048, 4096, 8192, 16384, 32768
LLAMA_CPP_CONTEXT_SIZE=512
# Number of GPU layers to offload
# 0 = CPU only, 99 = all layers on GPU (for GPU variants)
# For CPU variant, keep this at 0
LLAMA_CPP_GPU_LAYERS=0
# Number of GPUs to use (for CUDA variant)
LLAMA_CPP_GPU_COUNT=1
# Server CPU limit (in cores)
LLAMA_CPP_SERVER_CPU_LIMIT=4.0
# Server CPU reservation (in cores)
LLAMA_CPP_SERVER_CPU_RESERVATION=2.0
# Server memory limit
LLAMA_CPP_SERVER_MEMORY_LIMIT=8G
# Server memory reservation
LLAMA_CPP_SERVER_MEMORY_RESERVATION=4G
# -----------------------------------------------------------------------------
# CLI Configuration (Light variant)
# -----------------------------------------------------------------------------
# CLI image variant
# Options: light (CPU), light-cuda (NVIDIA GPU), light-rocm (AMD GPU),
# light-musa (Moore Threads GPU), light-intel (Intel GPU),
# light-vulkan (Vulkan GPU)
LLAMA_CPP_CLI_VARIANT=light
# Default prompt for CLI mode
LLAMA_CPP_PROMPT=Hello, how are you?
# CLI CPU limit (in cores)
LLAMA_CPP_CLI_CPU_LIMIT=2.0
# CLI CPU reservation (in cores)
LLAMA_CPP_CLI_CPU_RESERVATION=1.0
# CLI memory limit
LLAMA_CPP_CLI_MEMORY_LIMIT=4G
# CLI memory reservation
LLAMA_CPP_CLI_MEMORY_RESERVATION=2G
# -----------------------------------------------------------------------------
# Full Toolkit Configuration
# -----------------------------------------------------------------------------
# Full image variant (includes model conversion tools)
# Options: full (CPU), full-cuda (NVIDIA GPU), full-rocm (AMD GPU),
# full-musa (Moore Threads GPU), full-intel (Intel GPU),
# full-vulkan (Vulkan GPU)
LLAMA_CPP_FULL_VARIANT=full
# Full CPU limit (in cores)
LLAMA_CPP_FULL_CPU_LIMIT=2.0
# Full CPU reservation (in cores)
LLAMA_CPP_FULL_CPU_RESERVATION=1.0
# Full memory limit
LLAMA_CPP_FULL_MEMORY_LIMIT=4G
# Full memory reservation
LLAMA_CPP_FULL_MEMORY_RESERVATION=2G

245
src/llama.cpp/README.md Normal file
View File

@@ -0,0 +1,245 @@
# llama.cpp
[中文文档](README.zh.md)
[llama.cpp](https://github.com/ggml-org/llama.cpp) is a high-performance C/C++ implementation for LLM inference with support for various hardware accelerators.
## Features
- **Fast Inference**: Optimized C/C++ implementation for efficient LLM inference
- **Multiple Backends**: CPU, CUDA (NVIDIA), ROCm (AMD), MUSA (Moore Threads), Intel GPU, Vulkan
- **OpenAI-compatible API**: Server mode with OpenAI-compatible REST API
- **CLI Support**: Interactive command-line interface for quick testing
- **Model Conversion**: Full toolkit includes tools to convert and quantize models
- **GGUF Format**: Support for the efficient GGUF model format
- **Cross-platform**: Linux (x86-64, ARM64, s390x), Windows, macOS
## Prerequisites
- Docker and Docker Compose installed
- At least 4GB of RAM (8GB+ recommended)
- For GPU variants:
- **CUDA**: NVIDIA GPU with [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit)
- **ROCm**: AMD GPU with proper ROCm drivers
- **MUSA**: Moore Threads GPU with mt-container-toolkit
- GGUF format model file (e.g., from [Hugging Face](https://huggingface.co/models?library=gguf))
## Quick Start
### 1. Server Mode (CPU)
```bash
# Copy and configure environment
cp .env.example .env
# Edit .env and set your model path
# LLAMA_CPP_MODEL_PATH=/models/your-model.gguf
# Place your GGUF model in a directory, then update docker-compose.yaml
# to mount it, e.g.:
# volumes:
# - ./models:/models
# Start the server
docker compose --profile server up -d
# Test the server (OpenAI-compatible API)
curl http://localhost:8080/v1/models
# Chat completion request
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
```
### 2. Server Mode with NVIDIA GPU
```bash
# Edit .env
# Set LLAMA_CPP_GPU_LAYERS=99 to offload all layers to GPU
# Start GPU-accelerated server
docker compose --profile cuda up -d
# The server will automatically use NVIDIA GPU
```
### 3. Server Mode with AMD GPU
```bash
# Edit .env
# Set LLAMA_CPP_GPU_LAYERS=99 to offload all layers to GPU
# Start GPU-accelerated server
docker compose --profile rocm up -d
# The server will automatically use AMD GPU
```
### 4. CLI Mode
```bash
# Edit .env and configure model path and prompt
# Run CLI
docker compose --profile cli up
# For interactive mode, use:
docker compose run --rm llama-cpp-cli \
-m /models/your-model.gguf \
-p "Your prompt here" \
-n 512
```
### 5. Full Toolkit (Model Conversion)
```bash
# Start the full container
docker compose --profile full up -d
# Execute commands inside the container
docker compose exec llama-cpp-full bash
# Inside container, you can use conversion tools
# Example: Convert a Hugging Face model
# python3 convert_hf_to_gguf.py /models/source-model --outfile /models/output.gguf
```
## Configuration
### Environment Variables
Key environment variables (see [.env.example](.env.example) for all options):
| Variable | Description | Default |
| -------------------------------- | ------------------------------------------------------------- | -------------------- |
| `LLAMA_CPP_SERVER_VARIANT` | Server image variant (server, server-cuda, server-rocm, etc.) | `server` |
| `LLAMA_CPP_MODEL_PATH` | Model file path inside container | `/models/model.gguf` |
| `LLAMA_CPP_CONTEXT_SIZE` | Context window size in tokens | `512` |
| `LLAMA_CPP_GPU_LAYERS` | Number of layers to offload to GPU (0=CPU only, 99=all) | `0` |
| `LLAMA_CPP_SERVER_PORT_OVERRIDE` | Server port on host | `8080` |
| `LLAMA_CPP_SERVER_MEMORY_LIMIT` | Memory limit for server | `8G` |
### Available Profiles
- `server`: CPU-only server
- `cuda`: NVIDIA GPU server (requires nvidia-container-toolkit)
- `rocm`: AMD GPU server (requires ROCm)
- `cli`: Command-line interface
- `full`: Full toolkit with model conversion tools
- `gpu`: Generic GPU profile (includes cuda and rocm)
### Image Variants
Each variant comes in multiple flavors:
- **server**: Only `llama-server` executable (API server)
- **light**: Only `llama-cli` and `llama-completion` executables
- **full**: Complete toolkit including model conversion tools
Backend options:
- Base (CPU)
- `-cuda` (NVIDIA GPU)
- `-rocm` (AMD GPU)
- `-musa` (Moore Threads GPU)
- `-intel` (Intel GPU with SYCL)
- `-vulkan` (Vulkan GPU)
## Server API
The server provides an OpenAI-compatible API:
- `GET /health` - Health check
- `GET /v1/models` - List available models
- `POST /v1/chat/completions` - Chat completion
- `POST /v1/completions` - Text completion
- `POST /v1/embeddings` - Generate embeddings
See the [llama.cpp server documentation](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md) for full API details.
## Model Sources
Download GGUF models from:
- [Hugging Face GGUF Models](https://huggingface.co/models?library=gguf)
- [TheBloke's GGUF Collection](https://huggingface.co/TheBloke)
- Convert your own models using the full toolkit
Popular quantization formats:
- `Q4_K_M`: Good balance of quality and size (recommended)
- `Q5_K_M`: Higher quality, larger size
- `Q8_0`: Very high quality, large size
- `Q2_K`: Smallest size, lower quality
## Resource Requirements
Minimum requirements by model size:
| Model Size | RAM (CPU) | VRAM (GPU) | Context Size |
| ---------- | --------- | ---------- | ------------ |
| 7B Q4_K_M | 6GB | 4GB | 2048 |
| 13B Q4_K_M | 10GB | 8GB | 2048 |
| 34B Q4_K_M | 24GB | 20GB | 2048 |
| 70B Q4_K_M | 48GB | 40GB | 2048 |
Larger context sizes require proportionally more memory.
## Performance Tuning
For CPU inference:
- Increase `LLAMA_CPP_SERVER_CPU_LIMIT` for more cores
- Optimize threads with `-t` flag (default: auto)
For GPU inference:
- Set `LLAMA_CPP_GPU_LAYERS=99` to offload all layers
- Increase context size for longer conversations
- Monitor GPU memory usage
## Security Notes
- The server binds to `0.0.0.0` by default - ensure proper network security
- No authentication is enabled by default
- Consider using a reverse proxy (nginx, Caddy) for production deployments
- Limit resource usage to prevent system exhaustion
## Troubleshooting
### Out of Memory
- Reduce `LLAMA_CPP_CONTEXT_SIZE`
- Use a smaller quantized model (e.g., Q4 instead of Q8)
- Reduce `LLAMA_CPP_GPU_LAYERS` if using GPU
### GPU Not Detected
**NVIDIA**: Verify nvidia-container-toolkit is installed:
```bash
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
```
**AMD**: Ensure ROCm drivers and `/dev/kfd`, `/dev/dri` are accessible.
### Slow Inference
- Check CPU/GPU utilization
- Increase resource limits in `.env`
- For GPU: Verify all layers are offloaded (`LLAMA_CPP_GPU_LAYERS=99`)
## Documentation
- [llama.cpp GitHub](https://github.com/ggml-org/llama.cpp)
- [Docker Documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/docker.md)
- [Server API Docs](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md)
## License
llama.cpp is released under the MIT License. See the [LICENSE](https://github.com/ggml-org/llama.cpp/blob/master/LICENSE) file for details.

244
src/llama.cpp/README.zh.md Normal file
View File

@@ -0,0 +1,244 @@
# llama.cpp
[English Documentation](README.md)
[llama.cpp](https://github.com/ggml-org/llama.cpp) 是一个高性能的 C/C++ 实现的大语言模型推理引擎,支持多种硬件加速器。
## 功能特性
- **高速推理**:优化的 C/C++ 实现,提供高效的 LLM 推理
- **多种后端**:支持 CPU、CUDANVIDIA、ROCmAMD、MUSA摩尔线程、Intel GPU、Vulkan
- **OpenAI 兼容 API**:服务器模式提供 OpenAI 兼容的 REST API
- **CLI 支持**:交互式命令行界面,方便快速测试
- **模型转换**:完整工具包包含模型转换和量化工具
- **GGUF 格式**:支持高效的 GGUF 模型格式
- **跨平台**:支持 Linuxx86-64、ARM64、s390x、Windows、macOS
## 前置要求
- 已安装 Docker 和 Docker Compose
- 至少 4GB 内存(推荐 8GB 以上)
- GPU 版本需要:
- **CUDA**NVIDIA GPU 及 [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit)
- **ROCm**AMD GPU 及相应的 ROCm 驱动
- **MUSA**:摩尔线程 GPU 及 mt-container-toolkit
- GGUF 格式的模型文件(例如从 [Hugging Face](https://huggingface.co/models?library=gguf) 下载)
## 快速开始
### 1. 服务器模式CPU
```bash
# 复制并配置环境变量
cp .env.example .env
# 编辑 .env 并设置模型路径
# LLAMA_CPP_MODEL_PATH=/models/your-model.gguf
# 将 GGUF 模型放在目录中,然后更新 docker-compose.yaml 挂载,例如:
# volumes:
# - ./models:/models
# 启动服务器
docker compose --profile server up -d
# 测试服务器OpenAI 兼容 API
curl http://localhost:8080/v1/models
# 聊天补全请求
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "你好!"}
]
}'
```
### 2. 服务器模式NVIDIA GPU
```bash
# 编辑 .env
# 设置 LLAMA_CPP_GPU_LAYERS=99 将所有层卸载到 GPU
# 启动 GPU 加速服务器
docker compose --profile cuda up -d
# 服务器将自动使用 NVIDIA GPU
```
### 3. 服务器模式AMD GPU
```bash
# 编辑 .env
# 设置 LLAMA_CPP_GPU_LAYERS=99 将所有层卸载到 GPU
# 启动 GPU 加速服务器
docker compose --profile rocm up -d
# 服务器将自动使用 AMD GPU
```
### 4. CLI 模式
```bash
# 编辑 .env 并配置模型路径和提示词
# 运行 CLI
docker compose --profile cli up
# 交互模式:
docker compose run --rm llama-cpp-cli \
-m /models/your-model.gguf \
-p "你的提示词" \
-n 512
```
### 5. 完整工具包(模型转换)
```bash
# 启动完整容器
docker compose --profile full up -d
# 在容器内执行命令
docker compose exec llama-cpp-full bash
# 在容器内可以使用转换工具
# 示例:转换 Hugging Face 模型
# python3 convert_hf_to_gguf.py /models/source-model --outfile /models/output.gguf
```
## 配置说明
### 环境变量
主要环境变量(完整选项请查看 [.env.example](.env.example)
| 变量 | 说明 | 默认值 |
| -------------------------------- | ----------------------------------------------------- | -------------------- |
| `LLAMA_CPP_SERVER_VARIANT` | 服务器镜像变体server、server-cuda、server-rocm 等) | `server` |
| `LLAMA_CPP_MODEL_PATH` | 容器内模型文件路径 | `/models/model.gguf` |
| `LLAMA_CPP_CONTEXT_SIZE` | 上下文窗口大小token 数) | `512` |
| `LLAMA_CPP_GPU_LAYERS` | 卸载到 GPU 的层数0=仅 CPU99=全部) | `0` |
| `LLAMA_CPP_SERVER_PORT_OVERRIDE` | 主机端口 | `8080` |
| `LLAMA_CPP_SERVER_MEMORY_LIMIT` | 服务器内存限制 | `8G` |
### 可用配置文件
- `server`:仅 CPU 服务器
- `cuda`NVIDIA GPU 服务器(需要 nvidia-container-toolkit
- `rocm`AMD GPU 服务器(需要 ROCm
- `cli`:命令行界面
- `full`:包含模型转换工具的完整工具包
- `gpu`:通用 GPU 配置(包括 cuda 和 rocm
### 镜像变体
每个变体都有多种类型:
- **server**:仅包含 `llama-server` 可执行文件API 服务器)
- **light**:仅包含 `llama-cli``llama-completion` 可执行文件
- **full**:完整工具包,包括模型转换工具
后端选项:
- 基础版CPU
- `-cuda`NVIDIA GPU
- `-rocm`AMD GPU
- `-musa`(摩尔线程 GPU
- `-intel`Intel GPU支持 SYCL
- `-vulkan`Vulkan GPU
## 服务器 API
服务器提供 OpenAI 兼容的 API
- `GET /health` - 健康检查
- `GET /v1/models` - 列出可用模型
- `POST /v1/chat/completions` - 聊天补全
- `POST /v1/completions` - 文本补全
- `POST /v1/embeddings` - 生成嵌入向量
完整 API 详情请参阅 [llama.cpp 服务器文档](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md)。
## 模型来源
下载 GGUF 模型:
- [Hugging Face GGUF 模型](https://huggingface.co/models?library=gguf)
- [TheBloke 的 GGUF 合集](https://huggingface.co/TheBloke)
- 使用完整工具包转换您自己的模型
常用量化格式:
- `Q4_K_M`:质量和大小的良好平衡(推荐)
- `Q5_K_M`:更高质量,更大体积
- `Q8_0`:非常高的质量,大体积
- `Q2_K`:最小体积,较低质量
## 资源需求
按模型大小的最低要求:
| 模型大小 | 内存CPU | 显存GPU | 上下文大小 |
| ---------- | ----------- | ----------- | ---------- |
| 7B Q4_K_M | 6GB | 4GB | 2048 |
| 13B Q4_K_M | 10GB | 8GB | 2048 |
| 34B Q4_K_M | 24GB | 20GB | 2048 |
| 70B Q4_K_M | 48GB | 40GB | 2048 |
更大的上下文大小需要成比例的更多内存。
## 性能调优
CPU 推理:
- 增加 `LLAMA_CPP_SERVER_CPU_LIMIT` 以使用更多核心
- 使用 `-t` 参数优化线程数(默认:自动)
GPU 推理:
- 设置 `LLAMA_CPP_GPU_LAYERS=99` 卸载所有层
- 增加上下文大小以支持更长对话
- 监控 GPU 内存使用
## 安全注意事项
- 服务器默认绑定到 `0.0.0.0` - 请确保网络安全
- 默认未启用身份验证
- 生产环境建议使用反向代理nginx、Caddy
- 限制资源使用以防止系统资源耗尽
## 故障排除
### 内存不足
- 减小 `LLAMA_CPP_CONTEXT_SIZE`
- 使用更小的量化模型(例如 Q4 而不是 Q8
- 减少 `LLAMA_CPP_GPU_LAYERS`(如果使用 GPU
### GPU 未检测到
**NVIDIA**:验证 nvidia-container-toolkit 是否已安装:
```bash
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
```
**AMD**:确保 ROCm 驱动已安装且 `/dev/kfd``/dev/dri` 可访问。
### 推理速度慢
- 检查 CPU/GPU 利用率
- 增加 `.env` 中的资源限制
- GPU验证所有层都已卸载`LLAMA_CPP_GPU_LAYERS=99`
## 文档
- [llama.cpp GitHub](https://github.com/ggml-org/llama.cpp)
- [Docker 文档](https://github.com/ggml-org/llama.cpp/blob/master/docs/docker.md)
- [服务器 API 文档](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md)
## 许可证
llama.cpp 使用 MIT 许可证发布。详情请参阅 [LICENSE](https://github.com/ggml-org/llama.cpp/blob/master/LICENSE) 文件。

View File

@@ -0,0 +1,210 @@
# Docker Compose configuration for llama.cpp
# https://github.com/ggml-org/llama.cpp
# LLM inference in C/C++ with support for various hardware accelerators
x-defaults: &defaults
restart: unless-stopped
logging:
driver: json-file
options:
max-size: 100m
max-file: "3"
services:
# llama.cpp server - OpenAI-compatible API server
# Variant: server (CPU), server-cuda (NVIDIA GPU), server-rocm (AMD GPU)
llama-cpp-server:
<<: *defaults
image: ${GHCR_REGISTRY:-ghcr.io/}ggml-org/llama.cpp:${LLAMA_CPP_SERVER_VARIANT:-server}
ports:
- "${LLAMA_CPP_SERVER_PORT_OVERRIDE:-8080}:8080"
volumes:
- llama_cpp_models:/models
command:
- "-m"
- "${LLAMA_CPP_MODEL_PATH:-/models/model.gguf}"
- "--port"
- "8080"
- "--host"
- "0.0.0.0"
- "-n"
- "${LLAMA_CPP_CONTEXT_SIZE:-512}"
- "--n-gpu-layers"
- "${LLAMA_CPP_GPU_LAYERS:-0}"
environment:
- TZ=${TZ:-UTC}
healthcheck:
test:
[
"CMD",
"wget",
"--quiet",
"--tries=1",
"--spider",
"http://localhost:8080/health",
]
interval: 30s
timeout: 10s
retries: 3
start_period: 30s
deploy:
resources:
limits:
cpus: ${LLAMA_CPP_SERVER_CPU_LIMIT:-4.0}
memory: ${LLAMA_CPP_SERVER_MEMORY_LIMIT:-8G}
reservations:
cpus: ${LLAMA_CPP_SERVER_CPU_RESERVATION:-2.0}
memory: ${LLAMA_CPP_SERVER_MEMORY_RESERVATION:-4G}
profiles:
- server
# llama.cpp server with NVIDIA GPU support
llama-cpp-server-cuda:
<<: *defaults
image: ${GHCR_REGISTRY:-ghcr.io/}ggml-org/llama.cpp:server-cuda
ports:
- "${LLAMA_CPP_SERVER_PORT_OVERRIDE:-8080}:8080"
volumes:
- llama_cpp_models:/models
command:
- "-m"
- "${LLAMA_CPP_MODEL_PATH:-/models/model.gguf}"
- "--port"
- "8080"
- "--host"
- "0.0.0.0"
- "-n"
- "${LLAMA_CPP_CONTEXT_SIZE:-512}"
- "--n-gpu-layers"
- "${LLAMA_CPP_GPU_LAYERS:-99}"
environment:
- TZ=${TZ:-UTC}
healthcheck:
test:
[
"CMD",
"wget",
"--quiet",
"--tries=1",
"--spider",
"http://localhost:8080/health",
]
interval: 30s
timeout: 10s
retries: 3
start_period: 30s
deploy:
resources:
limits:
cpus: ${LLAMA_CPP_SERVER_CPU_LIMIT:-4.0}
memory: ${LLAMA_CPP_SERVER_MEMORY_LIMIT:-8G}
reservations:
cpus: ${LLAMA_CPP_SERVER_CPU_RESERVATION:-2.0}
memory: ${LLAMA_CPP_SERVER_MEMORY_RESERVATION:-4G}
devices:
- driver: nvidia
count: ${LLAMA_CPP_GPU_COUNT:-1}
capabilities: [gpu]
profiles:
- gpu
- cuda
# llama.cpp server with AMD ROCm GPU support
llama-cpp-server-rocm:
<<: *defaults
image: ${GHCR_REGISTRY:-ghcr.io/}ggml-org/llama.cpp:server-rocm
ports:
- "${LLAMA_CPP_SERVER_PORT_OVERRIDE:-8080}:8080"
volumes:
- llama_cpp_models:/models
devices:
- /dev/kfd
- /dev/dri
command:
- "-m"
- "${LLAMA_CPP_MODEL_PATH:-/models/model.gguf}"
- "--port"
- "8080"
- "--host"
- "0.0.0.0"
- "-n"
- "${LLAMA_CPP_CONTEXT_SIZE:-512}"
- "--n-gpu-layers"
- "${LLAMA_CPP_GPU_LAYERS:-99}"
environment:
- TZ=${TZ:-UTC}
healthcheck:
test:
[
"CMD",
"wget",
"--quiet",
"--tries=1",
"--spider",
"http://localhost:8080/health",
]
interval: 30s
timeout: 10s
retries: 3
start_period: 30s
deploy:
resources:
limits:
cpus: ${LLAMA_CPP_SERVER_CPU_LIMIT:-4.0}
memory: ${LLAMA_CPP_SERVER_MEMORY_LIMIT:-8G}
reservations:
cpus: ${LLAMA_CPP_SERVER_CPU_RESERVATION:-2.0}
memory: ${LLAMA_CPP_SERVER_MEMORY_RESERVATION:-4G}
profiles:
- gpu
- rocm
# llama.cpp CLI (light) - Interactive command-line interface
llama-cpp-cli:
<<: *defaults
image: ${GHCR_REGISTRY:-ghcr.io/}ggml-org/llama.cpp:${LLAMA_CPP_CLI_VARIANT:-light}
volumes:
- llama_cpp_models:/models
entrypoint: /app/llama-cli
command:
- "-m"
- "${LLAMA_CPP_MODEL_PATH:-/models/model.gguf}"
- "-p"
- "${LLAMA_CPP_PROMPT:-Hello, how are you?}"
- "-n"
- "${LLAMA_CPP_CONTEXT_SIZE:-512}"
environment:
- TZ=${TZ:-UTC}
deploy:
resources:
limits:
cpus: ${LLAMA_CPP_CLI_CPU_LIMIT:-2.0}
memory: ${LLAMA_CPP_CLI_MEMORY_LIMIT:-4G}
reservations:
cpus: ${LLAMA_CPP_CLI_CPU_RESERVATION:-1.0}
memory: ${LLAMA_CPP_CLI_MEMORY_RESERVATION:-2G}
profiles:
- cli
# llama.cpp full - Complete toolkit including model conversion tools
llama-cpp-full:
<<: *defaults
image: ${GHCR_REGISTRY:-ghcr.io/}ggml-org/llama.cpp:${LLAMA_CPP_FULL_VARIANT:-full}
volumes:
- llama_cpp_models:/models
command: ["sleep", "infinity"]
environment:
- TZ=${TZ:-UTC}
deploy:
resources:
limits:
cpus: ${LLAMA_CPP_FULL_CPU_LIMIT:-2.0}
memory: ${LLAMA_CPP_FULL_MEMORY_LIMIT:-4G}
reservations:
cpus: ${LLAMA_CPP_FULL_CPU_RESERVATION:-1.0}
memory: ${LLAMA_CPP_FULL_MEMORY_RESERVATION:-2G}
profiles:
- full
volumes:
llama_cpp_models: