# llama.cpp [中文文档](README.zh.md) [llama.cpp](https://github.com/ggml-org/llama.cpp) is a high-performance C/C++ implementation for LLM inference with support for various hardware accelerators. ## Features - **Fast Inference**: Optimized C/C++ implementation for efficient LLM inference - **Multiple Backends**: CPU, CUDA (NVIDIA), ROCm (AMD), MUSA (Moore Threads), Intel GPU, Vulkan - **OpenAI-compatible API**: Server mode with OpenAI-compatible REST API - **CLI Support**: Interactive command-line interface for quick testing - **Model Conversion**: Full toolkit includes tools to convert and quantize models - **GGUF Format**: Support for the efficient GGUF model format - **Cross-platform**: Linux (x86-64, ARM64, s390x), Windows, macOS ## Prerequisites - Docker and Docker Compose installed - At least 4GB of RAM (8GB+ recommended) - For GPU variants: - **CUDA**: NVIDIA GPU with [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) - **ROCm**: AMD GPU with proper ROCm drivers - **MUSA**: Moore Threads GPU with mt-container-toolkit - GGUF format model file (e.g., from [Hugging Face](https://huggingface.co/models?library=gguf)) ## Quick Start ### 1. Server Mode (CPU) ```bash # Copy and configure environment cp .env.example .env # Edit .env and set your model path # LLAMA_CPP_MODEL_PATH=/models/your-model.gguf # Place your GGUF model in a directory, then update docker-compose.yaml # to mount it, e.g.: # volumes: # - ./models:/models # Start the server docker compose --profile server up -d # Test the server (OpenAI-compatible API) curl http://localhost:8080/v1/models # Chat completion request curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "user", "content": "Hello!"} ] }' ``` ### 2. Server Mode with NVIDIA GPU ```bash # Edit .env # Set LLAMA_CPP_GPU_LAYERS=99 to offload all layers to GPU # Start GPU-accelerated server docker compose --profile cuda up -d # The server will automatically use NVIDIA GPU ``` ### 3. Server Mode with AMD GPU ```bash # Edit .env # Set LLAMA_CPP_GPU_LAYERS=99 to offload all layers to GPU # Start GPU-accelerated server docker compose --profile rocm up -d # The server will automatically use AMD GPU ``` ### 4. CLI Mode ```bash # Edit .env and configure model path and prompt # Run CLI docker compose --profile cli up # For interactive mode, use: docker compose run --rm llama-cpp-cli \ -m /models/your-model.gguf \ -p "Your prompt here" \ -n 512 ``` ### 5. Full Toolkit (Model Conversion) ```bash # Start the full container docker compose --profile full up -d # Execute commands inside the container docker compose exec llama-cpp-full bash # Inside container, you can use conversion tools # Example: Convert a Hugging Face model # python3 convert_hf_to_gguf.py /models/source-model --outfile /models/output.gguf ``` ## Configuration ### Environment Variables Key environment variables (see [.env.example](.env.example) for all options): | Variable | Description | Default | | -------------------------------- | ------------------------------------------------------------- | -------------------- | | `LLAMA_CPP_SERVER_VARIANT` | Server image variant (server, server-cuda, server-rocm, etc.) | `server` | | `LLAMA_CPP_MODEL_PATH` | Model file path inside container | `/models/model.gguf` | | `LLAMA_CPP_CONTEXT_SIZE` | Context window size in tokens | `512` | | `LLAMA_CPP_GPU_LAYERS` | Number of layers to offload to GPU (0=CPU only, 99=all) | `0` | | `LLAMA_CPP_SERVER_PORT_OVERRIDE` | Server port on host | `8080` | | `LLAMA_CPP_SERVER_MEMORY_LIMIT` | Memory limit for server | `8G` | ### Available Profiles - `server`: CPU-only server - `cuda`: NVIDIA GPU server (requires nvidia-container-toolkit) - `rocm`: AMD GPU server (requires ROCm) - `cli`: Command-line interface - `full`: Full toolkit with model conversion tools - `gpu`: Generic GPU profile (includes cuda and rocm) ### Image Variants Each variant comes in multiple flavors: - **server**: Only `llama-server` executable (API server) - **light**: Only `llama-cli` and `llama-completion` executables - **full**: Complete toolkit including model conversion tools Backend options: - Base (CPU) - `-cuda` (NVIDIA GPU) - `-rocm` (AMD GPU) - `-musa` (Moore Threads GPU) - `-intel` (Intel GPU with SYCL) - `-vulkan` (Vulkan GPU) ## Server API The server provides an OpenAI-compatible API: - `GET /health` - Health check - `GET /v1/models` - List available models - `POST /v1/chat/completions` - Chat completion - `POST /v1/completions` - Text completion - `POST /v1/embeddings` - Generate embeddings See the [llama.cpp server documentation](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md) for full API details. ## Model Sources Download GGUF models from: - [Hugging Face GGUF Models](https://huggingface.co/models?library=gguf) - [TheBloke's GGUF Collection](https://huggingface.co/TheBloke) - Convert your own models using the full toolkit Popular quantization formats: - `Q4_K_M`: Good balance of quality and size (recommended) - `Q5_K_M`: Higher quality, larger size - `Q8_0`: Very high quality, large size - `Q2_K`: Smallest size, lower quality ## Resource Requirements Minimum requirements by model size: | Model Size | RAM (CPU) | VRAM (GPU) | Context Size | | ---------- | --------- | ---------- | ------------ | | 7B Q4_K_M | 6GB | 4GB | 2048 | | 13B Q4_K_M | 10GB | 8GB | 2048 | | 34B Q4_K_M | 24GB | 20GB | 2048 | | 70B Q4_K_M | 48GB | 40GB | 2048 | Larger context sizes require proportionally more memory. ## Performance Tuning For CPU inference: - Increase `LLAMA_CPP_SERVER_CPU_LIMIT` for more cores - Optimize threads with `-t` flag (default: auto) For GPU inference: - Set `LLAMA_CPP_GPU_LAYERS=99` to offload all layers - Increase context size for longer conversations - Monitor GPU memory usage ## Security Notes - The server binds to `0.0.0.0` by default - ensure proper network security - No authentication is enabled by default - Consider using a reverse proxy (nginx, Caddy) for production deployments - Limit resource usage to prevent system exhaustion ## Troubleshooting ### Out of Memory - Reduce `LLAMA_CPP_CONTEXT_SIZE` - Use a smaller quantized model (e.g., Q4 instead of Q8) - Reduce `LLAMA_CPP_GPU_LAYERS` if using GPU ### GPU Not Detected **NVIDIA**: Verify nvidia-container-toolkit is installed: ```bash docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi ``` **AMD**: Ensure ROCm drivers and `/dev/kfd`, `/dev/dri` are accessible. ### Slow Inference - Check CPU/GPU utilization - Increase resource limits in `.env` - For GPU: Verify all layers are offloaded (`LLAMA_CPP_GPU_LAYERS=99`) ## Documentation - [llama.cpp GitHub](https://github.com/ggml-org/llama.cpp) - [Docker Documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/docker.md) - [Server API Docs](https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md) ## License llama.cpp is released under the MIT License. See the [LICENSE](https://github.com/ggml-org/llama.cpp/blob/master/LICENSE) file for details.