Files
compose-anything/builds/nexa-sdk/README.md
2025-11-16 00:12:14 +08:00

234 lines
6.4 KiB
Markdown

# Nexa SDK
Nexa SDK is a comprehensive toolkit for running AI models locally. It provides inference for various model types including LLM, VLM (Vision Language Models), TTS (Text-to-Speech), ASR (Automatic Speech Recognition), and more. Built with performance in mind, it supports both CPU and GPU acceleration.
## Features
- **Multi-Model Support**: Run LLM, VLM, TTS, ASR, embedding, reranking, and image generation models
- **OpenAI-Compatible API**: Provides standard OpenAI API endpoints for easy integration
- **GPU Acceleration**: Optional GPU support via NVIDIA CUDA for faster inference
- **Resource Management**: Configurable CPU/memory limits and GPU layer offloading
- **Model Caching**: Persistent model storage for faster startup
- **Profile Support**: Easy switching between CPU-only and GPU-accelerated modes
## Quick Start
### Prerequisites
- Docker and Docker Compose
- For GPU support: NVIDIA Docker runtime and compatible GPU
### Basic Usage (CPU)
```bash
# Copy environment file
cp .env.example .env
# Edit .env to configure your model and settings
# NEXA_MODEL=gemma-2-2b-instruct
# Start the service with CPU profile
docker compose --profile cpu up -d
```
### GPU-Accelerated Usage
```bash
# Copy environment file
cp .env.example .env
# Configure for GPU usage
# NEXA_MODEL=gemma-2-2b-instruct
# NEXA_GPU_LAYERS=-1 # -1 means all layers on GPU
# Start the service with GPU profile
docker compose --profile gpu up -d
```
## Configuration
### Environment Variables
| Variable | Default | Description |
| ------------------------ | --------------------- | ------------------------------------------------------ |
| `NEXA_SDK_VERSION` | `latest` | Nexa SDK Docker image version |
| `NEXA_SDK_PORT_OVERRIDE` | `8080` | Host port for API access |
| `NEXA_MODEL` | `gemma-2-2b-instruct` | Model to load (e.g., qwen3-4b, llama-3-8b, mistral-7b) |
| `NEXA_HOST` | `0.0.0.0:8080` | Server bind address |
| `NEXA_KEEPALIVE` | `300` | Model keepalive timeout in seconds |
| `NEXA_ORIGINS` | `*` | CORS allowed origins |
| `NEXA_HFTOKEN` | - | HuggingFace token for private models |
| `NEXA_LOG` | `none` | Logging level (none, debug, info, warn, error) |
| `NEXA_GPU_LAYERS` | `-1` | GPU layers to offload (-1 = all, 0 = CPU only) |
| `NEXA_SHM_SIZE` | `2g` | Shared memory size |
| `TZ` | `UTC` | Container timezone |
### Resource Limits
| Variable | Default | Description |
| ----------------------------- | ------- | ------------------ |
| `NEXA_SDK_CPU_LIMIT` | `4.0` | Maximum CPU cores |
| `NEXA_SDK_MEMORY_LIMIT` | `8G` | Maximum memory |
| `NEXA_SDK_CPU_RESERVATION` | `2.0` | Reserved CPU cores |
| `NEXA_SDK_MEMORY_RESERVATION` | `4G` | Reserved memory |
### Profiles
- `cpu`: Run with CPU-only inference (default profile needed)
- `gpu`: Run with GPU acceleration (requires NVIDIA GPU)
## Usage Examples
### Test the API
```bash
# Check available models
curl http://localhost:8080/v1/models
# Chat completion
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma-2-2b-instruct",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
```
### Using Different Models
Edit `.env` to change the model:
```bash
# Small models for limited resources
NEXA_MODEL=gemma-2-2b-instruct
# or
NEXA_MODEL=qwen3-4b
# Larger models for better quality
NEXA_MODEL=llama-3-8b
# or
NEXA_MODEL=mistral-7b
```
### GPU Configuration
For GPU acceleration, adjust the number of layers:
```bash
# Offload all layers to GPU (fastest)
NEXA_GPU_LAYERS=-1
# Offload 30 layers (hybrid mode)
NEXA_GPU_LAYERS=30
# CPU only
NEXA_GPU_LAYERS=0
```
## Model Management
Models are automatically downloaded on first run and cached in the `nexa_models` volume. The default cache location inside the container is `/root/.cache/nexa`.
To use a different model:
1. Update `NEXA_MODEL` in `.env`
2. Restart the service: `docker compose --profile <cpu|gpu> restart`
## API Endpoints
Nexa SDK provides OpenAI-compatible API endpoints:
- `GET /v1/models` - List available models
- `POST /v1/chat/completions` - Chat completions
- `POST /v1/completions` - Text completions
- `POST /v1/embeddings` - Text embeddings
- `GET /health` - Health check
- `GET /docs` - API documentation (Swagger UI)
## Troubleshooting
### Out of Memory
Increase memory limits or use a smaller model:
```bash
NEXA_SDK_MEMORY_LIMIT=16G
NEXA_SDK_MEMORY_RESERVATION=8G
# Or switch to a smaller model
NEXA_MODEL=gemma-2-2b-instruct
```
### GPU Not Detected
Ensure NVIDIA Docker runtime is installed:
```bash
# Check GPU availability
docker run --rm --gpus all nvidia/cuda:12.8.1-base-ubuntu22.04 nvidia-smi
```
### Model Download Issues
Set HuggingFace token if accessing private models:
```bash
NEXA_HFTOKEN=your_hf_token_here
```
### Slow Performance
- Use GPU profile for better performance
- Increase `NEXA_GPU_LAYERS` to offload more computation to GPU
- Allocate more resources or use a smaller model
## Advanced Configuration
### Custom Model Path
If you want to use local model files, mount them as a volume:
```yaml
volumes:
- ./models:/models
- nexa_models:/root/.cache/nexa
```
Then reference the model by its path in the command.
### HTTPS Configuration
Set environment variables for HTTPS:
```bash
NEXA_ENABLEHTTPS=true
```
Mount certificate files:
```yaml
volumes:
- ./certs/cert.pem:/app/cert.pem:ro
- ./certs/key.pem:/app/key.pem:ro
```
## Health Check
The service includes a health check that verifies the API is responding:
```bash
curl http://localhost:8080/v1/models
```
## License
Nexa SDK is developed by Nexa AI. Please refer to the [official repository](https://github.com/NexaAI/nexa-sdk) for license information.
## Links
- [Official Repository](https://github.com/NexaAI/nexa-sdk)
- [Nexa AI Website](https://nexa.ai)
- [Documentation](https://docs.nexa.ai)
- [Model Hub](https://sdk.nexa.ai)