# Nexa SDK Nexa SDK is a comprehensive toolkit for running AI models locally. It provides inference for various model types including LLM, VLM (Vision Language Models), TTS (Text-to-Speech), ASR (Automatic Speech Recognition), and more. Built with performance in mind, it supports both CPU and GPU acceleration. ## Features - **Multi-Model Support**: Run LLM, VLM, TTS, ASR, embedding, reranking, and image generation models - **OpenAI-Compatible API**: Provides standard OpenAI API endpoints for easy integration - **GPU Acceleration**: Optional GPU support via NVIDIA CUDA for faster inference - **Resource Management**: Configurable CPU/memory limits and GPU layer offloading - **Model Caching**: Persistent model storage for faster startup - **Profile Support**: Easy switching between CPU-only and GPU-accelerated modes ## Quick Start ### Prerequisites - Docker and Docker Compose - For GPU support: NVIDIA Docker runtime and compatible GPU ### Basic Usage (CPU) ```bash # Copy environment file cp .env.example .env # Edit .env to configure your model and settings # NEXA_MODEL=gemma-2-2b-instruct # Start the service with CPU profile docker compose --profile cpu up -d ``` ### GPU-Accelerated Usage ```bash # Copy environment file cp .env.example .env # Configure for GPU usage # NEXA_MODEL=gemma-2-2b-instruct # NEXA_GPU_LAYERS=-1 # -1 means all layers on GPU # Start the service with GPU profile docker compose --profile gpu up -d ``` ## Configuration ### Environment Variables | Variable | Default | Description | | ------------------------ | --------------------- | ------------------------------------------------------ | | `NEXA_SDK_VERSION` | `latest` | Nexa SDK Docker image version | | `NEXA_SDK_PORT_OVERRIDE` | `8080` | Host port for API access | | `NEXA_MODEL` | `gemma-2-2b-instruct` | Model to load (e.g., qwen3-4b, llama-3-8b, mistral-7b) | | `NEXA_HOST` | `0.0.0.0:8080` | Server bind address | | `NEXA_KEEPALIVE` | `300` | Model keepalive timeout in seconds | | `NEXA_ORIGINS` | `*` | CORS allowed origins | | `NEXA_HFTOKEN` | - | HuggingFace token for private models | | `NEXA_LOG` | `none` | Logging level (none, debug, info, warn, error) | | `NEXA_GPU_LAYERS` | `-1` | GPU layers to offload (-1 = all, 0 = CPU only) | | `NEXA_SHM_SIZE` | `2g` | Shared memory size | | `TZ` | `UTC` | Container timezone | ### Resource Limits | Variable | Default | Description | | ----------------------------- | ------- | ------------------ | | `NEXA_SDK_CPU_LIMIT` | `4.0` | Maximum CPU cores | | `NEXA_SDK_MEMORY_LIMIT` | `8G` | Maximum memory | | `NEXA_SDK_CPU_RESERVATION` | `2.0` | Reserved CPU cores | | `NEXA_SDK_MEMORY_RESERVATION` | `4G` | Reserved memory | ### Profiles - `cpu`: Run with CPU-only inference (default profile needed) - `gpu`: Run with GPU acceleration (requires NVIDIA GPU) ## Usage Examples ### Test the API ```bash # Check available models curl http://localhost:8080/v1/models # Chat completion curl http://localhost:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "gemma-2-2b-instruct", "messages": [ {"role": "user", "content": "Hello!"} ] }' ``` ### Using Different Models Edit `.env` to change the model: ```bash # Small models for limited resources NEXA_MODEL=gemma-2-2b-instruct # or NEXA_MODEL=qwen3-4b # Larger models for better quality NEXA_MODEL=llama-3-8b # or NEXA_MODEL=mistral-7b ``` ### GPU Configuration For GPU acceleration, adjust the number of layers: ```bash # Offload all layers to GPU (fastest) NEXA_GPU_LAYERS=-1 # Offload 30 layers (hybrid mode) NEXA_GPU_LAYERS=30 # CPU only NEXA_GPU_LAYERS=0 ``` ## Model Management Models are automatically downloaded on first run and cached in the `nexa_models` volume. The default cache location inside the container is `/root/.cache/nexa`. To use a different model: 1. Update `NEXA_MODEL` in `.env` 2. Restart the service: `docker compose --profile restart` ## API Endpoints Nexa SDK provides OpenAI-compatible API endpoints: - `GET /v1/models` - List available models - `POST /v1/chat/completions` - Chat completions - `POST /v1/completions` - Text completions - `POST /v1/embeddings` - Text embeddings - `GET /health` - Health check - `GET /docs` - API documentation (Swagger UI) ## Troubleshooting ### Out of Memory Increase memory limits or use a smaller model: ```bash NEXA_SDK_MEMORY_LIMIT=16G NEXA_SDK_MEMORY_RESERVATION=8G # Or switch to a smaller model NEXA_MODEL=gemma-2-2b-instruct ``` ### GPU Not Detected Ensure NVIDIA Docker runtime is installed: ```bash # Check GPU availability docker run --rm --gpus all nvidia/cuda:12.8.1-base-ubuntu22.04 nvidia-smi ``` ### Model Download Issues Set HuggingFace token if accessing private models: ```bash NEXA_HFTOKEN=your_hf_token_here ``` ### Slow Performance - Use GPU profile for better performance - Increase `NEXA_GPU_LAYERS` to offload more computation to GPU - Allocate more resources or use a smaller model ## Advanced Configuration ### Custom Model Path If you want to use local model files, mount them as a volume: ```yaml volumes: - ./models:/models - nexa_models:/root/.cache/nexa ``` Then reference the model by its path in the command. ### HTTPS Configuration Set environment variables for HTTPS: ```bash NEXA_ENABLEHTTPS=true ``` Mount certificate files: ```yaml volumes: - ./certs/cert.pem:/app/cert.pem:ro - ./certs/key.pem:/app/key.pem:ro ``` ## Health Check The service includes a health check that verifies the API is responding: ```bash curl http://localhost:8080/v1/models ``` ## License Nexa SDK is developed by Nexa AI. Please refer to the [official repository](https://github.com/NexaAI/nexa-sdk) for license information. ## Links - [Official Repository](https://github.com/NexaAI/nexa-sdk) - [Nexa AI Website](https://nexa.ai) - [Documentation](https://docs.nexa.ai) - [Model Hub](https://sdk.nexa.ai)