Files
compose-anything/src/llama.cpp/README.md
Sun-ZhenXing 28ed2462af feat: Add Chinese documentation and Docker Compose configurations for DeepTutor and llama.cpp
- Created README.zh.md for DeepTutor with comprehensive features, installation steps, and usage instructions in Chinese.
- Added docker-compose.yaml for DeepTutor to define services, environment variables, and resource limits.
- Introduced .env.example for llama.cpp with configuration options for server settings and resource management.
- Added README.md and README.zh.md for llama.cpp detailing features, prerequisites, quick start guides, and API documentation.
- Implemented docker-compose.yaml for llama.cpp to support various server configurations (CPU, CUDA, ROCm) and CLI usage.
2026-02-01 16:08:44 +08:00

7.4 KiB

llama.cpp

中文文档

llama.cpp is a high-performance C/C++ implementation for LLM inference with support for various hardware accelerators.

Features

  • Fast Inference: Optimized C/C++ implementation for efficient LLM inference
  • Multiple Backends: CPU, CUDA (NVIDIA), ROCm (AMD), MUSA (Moore Threads), Intel GPU, Vulkan
  • OpenAI-compatible API: Server mode with OpenAI-compatible REST API
  • CLI Support: Interactive command-line interface for quick testing
  • Model Conversion: Full toolkit includes tools to convert and quantize models
  • GGUF Format: Support for the efficient GGUF model format
  • Cross-platform: Linux (x86-64, ARM64, s390x), Windows, macOS

Prerequisites

  • Docker and Docker Compose installed
  • At least 4GB of RAM (8GB+ recommended)
  • For GPU variants:
    • CUDA: NVIDIA GPU with nvidia-container-toolkit
    • ROCm: AMD GPU with proper ROCm drivers
    • MUSA: Moore Threads GPU with mt-container-toolkit
  • GGUF format model file (e.g., from Hugging Face)

Quick Start

1. Server Mode (CPU)

# Copy and configure environment
cp .env.example .env

# Edit .env and set your model path
# LLAMA_CPP_MODEL_PATH=/models/your-model.gguf

# Place your GGUF model in a directory, then update docker-compose.yaml
# to mount it, e.g.:
# volumes:
#   - ./models:/models

# Start the server
docker compose --profile server up -d

# Test the server (OpenAI-compatible API)
curl http://localhost:8080/v1/models

# Chat completion request
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

2. Server Mode with NVIDIA GPU

# Edit .env
# Set LLAMA_CPP_GPU_LAYERS=99 to offload all layers to GPU

# Start GPU-accelerated server
docker compose --profile cuda up -d

# The server will automatically use NVIDIA GPU

3. Server Mode with AMD GPU

# Edit .env
# Set LLAMA_CPP_GPU_LAYERS=99 to offload all layers to GPU

# Start GPU-accelerated server
docker compose --profile rocm up -d

# The server will automatically use AMD GPU

4. CLI Mode

# Edit .env and configure model path and prompt

# Run CLI
docker compose --profile cli up

# For interactive mode, use:
docker compose run --rm llama-cpp-cli \
  -m /models/your-model.gguf \
  -p "Your prompt here" \
  -n 512

5. Full Toolkit (Model Conversion)

# Start the full container
docker compose --profile full up -d

# Execute commands inside the container
docker compose exec llama-cpp-full bash

# Inside container, you can use conversion tools
# Example: Convert a Hugging Face model
# python3 convert_hf_to_gguf.py /models/source-model --outfile /models/output.gguf

Configuration

Environment Variables

Key environment variables (see .env.example for all options):

Variable Description Default
LLAMA_CPP_SERVER_VARIANT Server image variant (server, server-cuda, server-rocm, etc.) server
LLAMA_CPP_MODEL_PATH Model file path inside container /models/model.gguf
LLAMA_CPP_CONTEXT_SIZE Context window size in tokens 512
LLAMA_CPP_GPU_LAYERS Number of layers to offload to GPU (0=CPU only, 99=all) 0
LLAMA_CPP_SERVER_PORT_OVERRIDE Server port on host 8080
LLAMA_CPP_SERVER_MEMORY_LIMIT Memory limit for server 8G

Available Profiles

  • server: CPU-only server
  • cuda: NVIDIA GPU server (requires nvidia-container-toolkit)
  • rocm: AMD GPU server (requires ROCm)
  • cli: Command-line interface
  • full: Full toolkit with model conversion tools
  • gpu: Generic GPU profile (includes cuda and rocm)

Image Variants

Each variant comes in multiple flavors:

  • server: Only llama-server executable (API server)
  • light: Only llama-cli and llama-completion executables
  • full: Complete toolkit including model conversion tools

Backend options:

  • Base (CPU)
  • -cuda (NVIDIA GPU)
  • -rocm (AMD GPU)
  • -musa (Moore Threads GPU)
  • -intel (Intel GPU with SYCL)
  • -vulkan (Vulkan GPU)

Server API

The server provides an OpenAI-compatible API:

  • GET /health - Health check
  • GET /v1/models - List available models
  • POST /v1/chat/completions - Chat completion
  • POST /v1/completions - Text completion
  • POST /v1/embeddings - Generate embeddings

See the llama.cpp server documentation for full API details.

Model Sources

Download GGUF models from:

Popular quantization formats:

  • Q4_K_M: Good balance of quality and size (recommended)
  • Q5_K_M: Higher quality, larger size
  • Q8_0: Very high quality, large size
  • Q2_K: Smallest size, lower quality

Resource Requirements

Minimum requirements by model size:

Model Size RAM (CPU) VRAM (GPU) Context Size
7B Q4_K_M 6GB 4GB 2048
13B Q4_K_M 10GB 8GB 2048
34B Q4_K_M 24GB 20GB 2048
70B Q4_K_M 48GB 40GB 2048

Larger context sizes require proportionally more memory.

Performance Tuning

For CPU inference:

  • Increase LLAMA_CPP_SERVER_CPU_LIMIT for more cores
  • Optimize threads with -t flag (default: auto)

For GPU inference:

  • Set LLAMA_CPP_GPU_LAYERS=99 to offload all layers
  • Increase context size for longer conversations
  • Monitor GPU memory usage

Security Notes

  • The server binds to 0.0.0.0 by default - ensure proper network security
  • No authentication is enabled by default
  • Consider using a reverse proxy (nginx, Caddy) for production deployments
  • Limit resource usage to prevent system exhaustion

Troubleshooting

Out of Memory

  • Reduce LLAMA_CPP_CONTEXT_SIZE
  • Use a smaller quantized model (e.g., Q4 instead of Q8)
  • Reduce LLAMA_CPP_GPU_LAYERS if using GPU

GPU Not Detected

NVIDIA: Verify nvidia-container-toolkit is installed:

docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

AMD: Ensure ROCm drivers and /dev/kfd, /dev/dri are accessible.

Slow Inference

  • Check CPU/GPU utilization
  • Increase resource limits in .env
  • For GPU: Verify all layers are offloaded (LLAMA_CPP_GPU_LAYERS=99)

Documentation

License

llama.cpp is released under the MIT License. See the LICENSE file for details.