6.4 KiB
Nexa SDK
Nexa SDK is a comprehensive toolkit for running AI models locally. It provides inference for various model types including LLM, VLM (Vision Language Models), TTS (Text-to-Speech), ASR (Automatic Speech Recognition), and more. Built with performance in mind, it supports both CPU and GPU acceleration.
Features
- Multi-Model Support: Run LLM, VLM, TTS, ASR, embedding, reranking, and image generation models
- OpenAI-Compatible API: Provides standard OpenAI API endpoints for easy integration
- GPU Acceleration: Optional GPU support via NVIDIA CUDA for faster inference
- Resource Management: Configurable CPU/memory limits and GPU layer offloading
- Model Caching: Persistent model storage for faster startup
- Profile Support: Easy switching between CPU-only and GPU-accelerated modes
Quick Start
Prerequisites
- Docker and Docker Compose
- For GPU support: NVIDIA Docker runtime and compatible GPU
Basic Usage (CPU)
# Copy environment file
cp .env.example .env
# Edit .env to configure your model and settings
# NEXA_MODEL=gemma-2-2b-instruct
# Start the service with CPU profile
docker compose --profile cpu up -d
GPU-Accelerated Usage
# Copy environment file
cp .env.example .env
# Configure for GPU usage
# NEXA_MODEL=gemma-2-2b-instruct
# NEXA_GPU_LAYERS=-1 # -1 means all layers on GPU
# Start the service with GPU profile
docker compose --profile gpu up -d
Configuration
Environment Variables
| Variable | Default | Description |
|---|---|---|
NEXA_SDK_VERSION |
latest |
Nexa SDK Docker image version |
NEXA_SDK_PORT_OVERRIDE |
8080 |
Host port for API access |
NEXA_MODEL |
gemma-2-2b-instruct |
Model to load (e.g., qwen3-4b, llama-3-8b, mistral-7b) |
NEXA_HOST |
0.0.0.0:8080 |
Server bind address |
NEXA_KEEPALIVE |
300 |
Model keepalive timeout in seconds |
NEXA_ORIGINS |
* |
CORS allowed origins |
NEXA_HFTOKEN |
- | HuggingFace token for private models |
NEXA_LOG |
none |
Logging level (none, debug, info, warn, error) |
NEXA_GPU_LAYERS |
-1 |
GPU layers to offload (-1 = all, 0 = CPU only) |
NEXA_SHM_SIZE |
2g |
Shared memory size |
TZ |
UTC |
Container timezone |
Resource Limits
| Variable | Default | Description |
|---|---|---|
NEXA_SDK_CPU_LIMIT |
4.0 |
Maximum CPU cores |
NEXA_SDK_MEMORY_LIMIT |
8G |
Maximum memory |
NEXA_SDK_CPU_RESERVATION |
2.0 |
Reserved CPU cores |
NEXA_SDK_MEMORY_RESERVATION |
4G |
Reserved memory |
Profiles
cpu: Run with CPU-only inference (default profile needed)gpu: Run with GPU acceleration (requires NVIDIA GPU)
Usage Examples
Test the API
# Check available models
curl http://localhost:8080/v1/models
# Chat completion
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-2-2b-instruct",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
Using Different Models
Edit .env to change the model:
# Small models for limited resources
NEXA_MODEL=gemma-2-2b-instruct
# or
NEXA_MODEL=qwen3-4b
# Larger models for better quality
NEXA_MODEL=llama-3-8b
# or
NEXA_MODEL=mistral-7b
GPU Configuration
For GPU acceleration, adjust the number of layers:
# Offload all layers to GPU (fastest)
NEXA_GPU_LAYERS=-1
# Offload 30 layers (hybrid mode)
NEXA_GPU_LAYERS=30
# CPU only
NEXA_GPU_LAYERS=0
Model Management
Models are automatically downloaded on first run and cached in the nexa_models volume. The default cache location inside the container is /root/.cache/nexa.
To use a different model:
- Update
NEXA_MODELin.env - Restart the service:
docker compose --profile <cpu|gpu> restart
API Endpoints
Nexa SDK provides OpenAI-compatible API endpoints:
GET /v1/models- List available modelsPOST /v1/chat/completions- Chat completionsPOST /v1/completions- Text completionsPOST /v1/embeddings- Text embeddingsGET /health- Health checkGET /docs- API documentation (Swagger UI)
Troubleshooting
Out of Memory
Increase memory limits or use a smaller model:
NEXA_SDK_MEMORY_LIMIT=16G
NEXA_SDK_MEMORY_RESERVATION=8G
# Or switch to a smaller model
NEXA_MODEL=gemma-2-2b-instruct
GPU Not Detected
Ensure NVIDIA Docker runtime is installed:
# Check GPU availability
docker run --rm --gpus all nvidia/cuda:12.8.1-base-ubuntu22.04 nvidia-smi
Model Download Issues
Set HuggingFace token if accessing private models:
NEXA_HFTOKEN=your_hf_token_here
Slow Performance
- Use GPU profile for better performance
- Increase
NEXA_GPU_LAYERSto offload more computation to GPU - Allocate more resources or use a smaller model
Advanced Configuration
Custom Model Path
If you want to use local model files, mount them as a volume:
volumes:
- ./models:/models
- nexa_models:/root/.cache/nexa
Then reference the model by its path in the command.
HTTPS Configuration
Set environment variables for HTTPS:
NEXA_ENABLEHTTPS=true
Mount certificate files:
volumes:
- ./certs/cert.pem:/app/cert.pem:ro
- ./certs/key.pem:/app/key.pem:ro
Health Check
The service includes a health check that verifies the API is responding:
curl http://localhost:8080/v1/models
License
Nexa SDK is developed by Nexa AI. Please refer to the official repository for license information.