Files

Sun-ZhenXing 1c42cb2800 feat: add nexa-sdk

2025-11-16 00:12:14 +08:00

6.4 KiB

Raw Blame History

Nexa SDK

Nexa SDK is a comprehensive toolkit for running AI models locally. It provides inference for various model types including LLM, VLM (Vision Language Models), TTS (Text-to-Speech), ASR (Automatic Speech Recognition), and more. Built with performance in mind, it supports both CPU and GPU acceleration.

Features

Multi-Model Support: Run LLM, VLM, TTS, ASR, embedding, reranking, and image generation models
OpenAI-Compatible API: Provides standard OpenAI API endpoints for easy integration
GPU Acceleration: Optional GPU support via NVIDIA CUDA for faster inference
Resource Management: Configurable CPU/memory limits and GPU layer offloading
Model Caching: Persistent model storage for faster startup
Profile Support: Easy switching between CPU-only and GPU-accelerated modes

Quick Start

Prerequisites

Docker and Docker Compose
For GPU support: NVIDIA Docker runtime and compatible GPU

Basic Usage (CPU)

# Copy environment file
cp .env.example .env

# Edit .env to configure your model and settings
# NEXA_MODEL=gemma-2-2b-instruct

# Start the service with CPU profile
docker compose --profile cpu up -d

GPU-Accelerated Usage

# Copy environment file
cp .env.example .env

# Configure for GPU usage
# NEXA_MODEL=gemma-2-2b-instruct
# NEXA_GPU_LAYERS=-1  # -1 means all layers on GPU

# Start the service with GPU profile
docker compose --profile gpu up -d

Configuration

Environment Variables

Variable	Default	Description
`NEXA_SDK_VERSION`	`latest`	Nexa SDK Docker image version
`NEXA_SDK_PORT_OVERRIDE`	`8080`	Host port for API access
`NEXA_MODEL`	`gemma-2-2b-instruct`	Model to load (e.g., qwen3-4b, llama-3-8b, mistral-7b)
`NEXA_HOST`	`0.0.0.0:8080`	Server bind address
`NEXA_KEEPALIVE`	`300`	Model keepalive timeout in seconds
`NEXA_ORIGINS`	`*`	CORS allowed origins
`NEXA_HFTOKEN`	-	HuggingFace token for private models
`NEXA_LOG`	`none`	Logging level (none, debug, info, warn, error)
`NEXA_GPU_LAYERS`	`-1`	GPU layers to offload (-1 = all, 0 = CPU only)
`NEXA_SHM_SIZE`	`2g`	Shared memory size
`TZ`	`UTC`	Container timezone

Resource Limits

Variable	Default	Description
`NEXA_SDK_CPU_LIMIT`	`4.0`	Maximum CPU cores
`NEXA_SDK_MEMORY_LIMIT`	`8G`	Maximum memory
`NEXA_SDK_CPU_RESERVATION`	`2.0`	Reserved CPU cores
`NEXA_SDK_MEMORY_RESERVATION`	`4G`	Reserved memory

Profiles

cpu: Run with CPU-only inference (default profile needed)
gpu: Run with GPU acceleration (requires NVIDIA GPU)

Usage Examples

Test the API

# Check available models
curl http://localhost:8080/v1/models

# Chat completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-2-2b-instruct",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'

Using Different Models

Edit .env to change the model:

# Small models for limited resources
NEXA_MODEL=gemma-2-2b-instruct
# or
NEXA_MODEL=qwen3-4b

# Larger models for better quality
NEXA_MODEL=llama-3-8b
# or
NEXA_MODEL=mistral-7b

GPU Configuration

For GPU acceleration, adjust the number of layers:

# Offload all layers to GPU (fastest)
NEXA_GPU_LAYERS=-1

# Offload 30 layers (hybrid mode)
NEXA_GPU_LAYERS=30

# CPU only
NEXA_GPU_LAYERS=0

Model Management

Models are automatically downloaded on first run and cached in the nexa_models volume. The default cache location inside the container is /root/.cache/nexa.

To use a different model:

Update NEXA_MODEL in .env
Restart the service: docker compose --profile <cpu|gpu> restart

API Endpoints

Nexa SDK provides OpenAI-compatible API endpoints:

GET /v1/models - List available models
POST /v1/chat/completions - Chat completions
POST /v1/completions - Text completions
POST /v1/embeddings - Text embeddings
GET /health - Health check
GET /docs - API documentation (Swagger UI)

Troubleshooting

Out of Memory

Increase memory limits or use a smaller model:

NEXA_SDK_MEMORY_LIMIT=16G
NEXA_SDK_MEMORY_RESERVATION=8G
# Or switch to a smaller model
NEXA_MODEL=gemma-2-2b-instruct

GPU Not Detected

Ensure NVIDIA Docker runtime is installed:

# Check GPU availability
docker run --rm --gpus all nvidia/cuda:12.8.1-base-ubuntu22.04 nvidia-smi

Model Download Issues

Set HuggingFace token if accessing private models:

NEXA_HFTOKEN=your_hf_token_here

Slow Performance

Use GPU profile for better performance
Increase NEXA_GPU_LAYERS to offload more computation to GPU
Allocate more resources or use a smaller model

Advanced Configuration

Custom Model Path

If you want to use local model files, mount them as a volume:

volumes:
  - ./models:/models
  - nexa_models:/root/.cache/nexa

Then reference the model by its path in the command.

HTTPS Configuration

Set environment variables for HTTPS:

NEXA_ENABLEHTTPS=true

Mount certificate files:

volumes:
  - ./certs/cert.pem:/app/cert.pem:ro
  - ./certs/key.pem:/app/key.pem:ro

Health Check

The service includes a health check that verifies the API is responding:

curl http://localhost:8080/v1/models

License

Nexa SDK is developed by Nexa AI. Please refer to the official repository for license information.

6.4 KiB Raw Blame History