compose-anything/builds/nexa-sdk/README.md

# Nexa SDK

Nexa SDK is a comprehensive toolkit for running AI models locally. It provides inference for various model types including LLM, VLM (Vision Language Models), TTS (Text-to-Speech), ASR (Automatic Speech Recognition), and more. Built with performance in mind, it supports both CPU and GPU acceleration.

## Features

- **Multi-Model Support**: Run LLM, VLM, TTS, ASR, embedding, reranking, and image generation models
- **OpenAI-Compatible API**: Provides standard OpenAI API endpoints for easy integration
- **GPU Acceleration**: Optional GPU support via NVIDIA CUDA for faster inference
- **Resource Management**: Configurable CPU/memory limits and GPU layer offloading
- **Model Caching**: Persistent model storage for faster startup
- **Profile Support**: Easy switching between CPU-only and GPU-accelerated modes

## Quick Start

### Prerequisites

- Docker and Docker Compose
- For GPU support: NVIDIA Docker runtime and compatible GPU

### Basic Usage (CPU)

```bash
# Copy environment file
cp .env.example .env

# Edit .env to configure your model and settings
# NEXA_MODEL=gemma-2-2b-instruct

# Start the service with CPU profile
docker compose --profile cpu up -d
```

### GPU-Accelerated Usage

```bash
# Copy environment file
cp .env.example .env

# Configure for GPU usage
# NEXA_MODEL=gemma-2-2b-instruct
# NEXA_GPU_LAYERS=-1  # -1 means all layers on GPU

# Start the service with GPU profile
docker compose --profile gpu up -d
```

## Configuration

### Environment Variables

| Variable                 | Default               | Description                                            |
| ------------------------ | --------------------- | ------------------------------------------------------ |
| `NEXA_SDK_VERSION`       | `latest`              | Nexa SDK Docker image version                          |
| `NEXA_SDK_PORT_OVERRIDE` | `8080`                | Host port for API access                               |
| `NEXA_MODEL`             | `gemma-2-2b-instruct` | Model to load (e.g., qwen3-4b, llama-3-8b, mistral-7b) |
| `NEXA_HOST`              | `0.0.0.0:8080`        | Server bind address                                    |
| `NEXA_KEEPALIVE`         | `300`                 | Model keepalive timeout in seconds                     |
| `NEXA_ORIGINS`           | `*`                   | CORS allowed origins                                   |
| `NEXA_HFTOKEN`           | -                     | HuggingFace token for private models                   |
| `NEXA_LOG`               | `none`                | Logging level (none, debug, info, warn, error)         |
| `NEXA_GPU_LAYERS`        | `-1`                  | GPU layers to offload (-1 = all, 0 = CPU only)         |
| `NEXA_SHM_SIZE`          | `2g`                  | Shared memory size                                     |
| `TZ`                     | `UTC`                 | Container timezone                                     |

### Resource Limits

| Variable                      | Default | Description        |
| ----------------------------- | ------- | ------------------ |
| `NEXA_SDK_CPU_LIMIT`          | `4.0`   | Maximum CPU cores  |
| `NEXA_SDK_MEMORY_LIMIT`       | `8G`    | Maximum memory     |
| `NEXA_SDK_CPU_RESERVATION`    | `2.0`   | Reserved CPU cores |
| `NEXA_SDK_MEMORY_RESERVATION` | `4G`    | Reserved memory    |

### Profiles

- `cpu`: Run with CPU-only inference (default profile needed)
- `gpu`: Run with GPU acceleration (requires NVIDIA GPU)

## Usage Examples

### Test the API

```bash
# Check available models
curl http://localhost:8080/v1/models

# Chat completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-2-2b-instruct",
    "messages": [
      {"role": "user", "content": "Hello!"}
    ]
  }'
```

### Using Different Models

Edit `.env` to change the model:

```bash
# Small models for limited resources
NEXA_MODEL=gemma-2-2b-instruct
# or
NEXA_MODEL=qwen3-4b

# Larger models for better quality
NEXA_MODEL=llama-3-8b
# or
NEXA_MODEL=mistral-7b
```

### GPU Configuration

For GPU acceleration, adjust the number of layers:

```bash
# Offload all layers to GPU (fastest)
NEXA_GPU_LAYERS=-1

# Offload 30 layers (hybrid mode)
NEXA_GPU_LAYERS=30

# CPU only
NEXA_GPU_LAYERS=0
```

## Model Management

Models are automatically downloaded on first run and cached in the `nexa_models` volume. The default cache location inside the container is `/root/.cache/nexa`.

To use a different model:

1. Update `NEXA_MODEL` in `.env`
2. Restart the service: `docker compose --profile <cpu|gpu> restart`

## API Endpoints

Nexa SDK provides OpenAI-compatible API endpoints:

- `GET /v1/models` - List available models
- `POST /v1/chat/completions` - Chat completions
- `POST /v1/completions` - Text completions
- `POST /v1/embeddings` - Text embeddings
- `GET /health` - Health check
- `GET /docs` - API documentation (Swagger UI)

## Troubleshooting

### Out of Memory

Increase memory limits or use a smaller model:

```bash
NEXA_SDK_MEMORY_LIMIT=16G
NEXA_SDK_MEMORY_RESERVATION=8G
# Or switch to a smaller model
NEXA_MODEL=gemma-2-2b-instruct
```

### GPU Not Detected

Ensure NVIDIA Docker runtime is installed:

```bash
# Check GPU availability
docker run --rm --gpus all nvidia/cuda:12.8.1-base-ubuntu22.04 nvidia-smi
```

### Model Download Issues

Set HuggingFace token if accessing private models:

```bash
NEXA_HFTOKEN=your_hf_token_here
```

### Slow Performance

- Use GPU profile for better performance
- Increase `NEXA_GPU_LAYERS` to offload more computation to GPU
- Allocate more resources or use a smaller model

## Advanced Configuration

### Custom Model Path

If you want to use local model files, mount them as a volume:

```yaml
volumes:
  - ./models:/models
  - nexa_models:/root/.cache/nexa
```

Then reference the model by its path in the command.

### HTTPS Configuration

Set environment variables for HTTPS:

```bash
NEXA_ENABLEHTTPS=true
```

Mount certificate files:

```yaml
volumes:
  - ./certs/cert.pem:/app/cert.pem:ro
  - ./certs/key.pem:/app/key.pem:ro
```

## Health Check

The service includes a health check that verifies the API is responding:

```bash
curl http://localhost:8080/v1/models
```

## License

Nexa SDK is developed by Nexa AI. Please refer to the [official repository](https://github.com/NexaAI/nexa-sdk) for license information.

## Links

- [Official Repository](https://github.com/NexaAI/nexa-sdk)
- [Nexa AI Website](https://nexa.ai)
- [Documentation](https://docs.nexa.ai)
- [Model Hub](https://sdk.nexa.ai)