Files
compose-anything/builds/nexa-sdk/README.md
2025-11-16 00:12:14 +08:00

6.4 KiB

Nexa SDK

Nexa SDK is a comprehensive toolkit for running AI models locally. It provides inference for various model types including LLM, VLM (Vision Language Models), TTS (Text-to-Speech), ASR (Automatic Speech Recognition), and more. Built with performance in mind, it supports both CPU and GPU acceleration.

Features

  • Multi-Model Support: Run LLM, VLM, TTS, ASR, embedding, reranking, and image generation models
  • OpenAI-Compatible API: Provides standard OpenAI API endpoints for easy integration
  • GPU Acceleration: Optional GPU support via NVIDIA CUDA for faster inference
  • Resource Management: Configurable CPU/memory limits and GPU layer offloading
  • Model Caching: Persistent model storage for faster startup
  • Profile Support: Easy switching between CPU-only and GPU-accelerated modes

Quick Start

Prerequisites

  • Docker and Docker Compose
  • For GPU support: NVIDIA Docker runtime and compatible GPU

Basic Usage (CPU)

# Copy environment file
cp .env.example .env

# Edit .env to configure your model and settings
# NEXA_MODEL=gemma-2-2b-instruct

# Start the service with CPU profile
docker compose --profile cpu up -d

GPU-Accelerated Usage

# Copy environment file
cp .env.example .env

# Configure for GPU usage
# NEXA_MODEL=gemma-2-2b-instruct
# NEXA_GPU_LAYERS=-1  # -1 means all layers on GPU

# Start the service with GPU profile
docker compose --profile gpu up -d

Configuration

Environment Variables

Variable Default Description
NEXA_SDK_VERSION latest Nexa SDK Docker image version
NEXA_SDK_PORT_OVERRIDE 8080 Host port for API access
NEXA_MODEL gemma-2-2b-instruct Model to load (e.g., qwen3-4b, llama-3-8b, mistral-7b)
NEXA_HOST 0.0.0.0:8080 Server bind address
NEXA_KEEPALIVE 300 Model keepalive timeout in seconds
NEXA_ORIGINS * CORS allowed origins
NEXA_HFTOKEN - HuggingFace token for private models
NEXA_LOG none Logging level (none, debug, info, warn, error)
NEXA_GPU_LAYERS -1 GPU layers to offload (-1 = all, 0 = CPU only)
NEXA_SHM_SIZE 2g Shared memory size
TZ UTC Container timezone

Resource Limits

Variable Default Description
NEXA_SDK_CPU_LIMIT 4.0 Maximum CPU cores
NEXA_SDK_MEMORY_LIMIT 8G Maximum memory
NEXA_SDK_CPU_RESERVATION 2.0 Reserved CPU cores
NEXA_SDK_MEMORY_RESERVATION 4G Reserved memory

Profiles

  • cpu: Run with CPU-only inference (default profile needed)
  • gpu: Run with GPU acceleration (requires NVIDIA GPU)

Usage Examples

Test the API

# Check available models
curl http://localhost:8080/v1/models

# Chat completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-2-2b-instruct",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Using Different Models

Edit .env to change the model:

# Small models for limited resources
NEXA_MODEL=gemma-2-2b-instruct
# or
NEXA_MODEL=qwen3-4b

# Larger models for better quality
NEXA_MODEL=llama-3-8b
# or
NEXA_MODEL=mistral-7b

GPU Configuration

For GPU acceleration, adjust the number of layers:

# Offload all layers to GPU (fastest)
NEXA_GPU_LAYERS=-1

# Offload 30 layers (hybrid mode)
NEXA_GPU_LAYERS=30

# CPU only
NEXA_GPU_LAYERS=0

Model Management

Models are automatically downloaded on first run and cached in the nexa_models volume. The default cache location inside the container is /root/.cache/nexa.

To use a different model:

  1. Update NEXA_MODEL in .env
  2. Restart the service: docker compose --profile <cpu|gpu> restart

API Endpoints

Nexa SDK provides OpenAI-compatible API endpoints:

  • GET /v1/models - List available models
  • POST /v1/chat/completions - Chat completions
  • POST /v1/completions - Text completions
  • POST /v1/embeddings - Text embeddings
  • GET /health - Health check
  • GET /docs - API documentation (Swagger UI)

Troubleshooting

Out of Memory

Increase memory limits or use a smaller model:

NEXA_SDK_MEMORY_LIMIT=16G
NEXA_SDK_MEMORY_RESERVATION=8G
# Or switch to a smaller model
NEXA_MODEL=gemma-2-2b-instruct

GPU Not Detected

Ensure NVIDIA Docker runtime is installed:

# Check GPU availability
docker run --rm --gpus all nvidia/cuda:12.8.1-base-ubuntu22.04 nvidia-smi

Model Download Issues

Set HuggingFace token if accessing private models:

NEXA_HFTOKEN=your_hf_token_here

Slow Performance

  • Use GPU profile for better performance
  • Increase NEXA_GPU_LAYERS to offload more computation to GPU
  • Allocate more resources or use a smaller model

Advanced Configuration

Custom Model Path

If you want to use local model files, mount them as a volume:

volumes:
  - ./models:/models
  - nexa_models:/root/.cache/nexa

Then reference the model by its path in the command.

HTTPS Configuration

Set environment variables for HTTPS:

NEXA_ENABLEHTTPS=true

Mount certificate files:

volumes:
  - ./certs/cert.pem:/app/cert.pem:ro
  - ./certs/key.pem:/app/key.pem:ro

Health Check

The service includes a health check that verifies the API is responding:

curl http://localhost:8080/v1/models

License

Nexa SDK is developed by Nexa AI. Please refer to the official repository for license information.