NexaSDK
This service deploys NexaSDK Docker for running AI models with OpenAI-compatible REST API. Supports LLM, Embeddings, Reranking, Computer Vision, and ASR models.
Features
- OpenAI-compatible API: Drop-in replacement for OpenAI API endpoints
- Multiple Model Types: LLM, VLM, Embeddings, Reranking, CV, ASR
- GPU Acceleration: CUDA support for NVIDIA GPUs
- NPU Support: Optimized for Qualcomm NPU on ARM64
Supported Models
| Modality | Models |
|---|---|
| LLM | NexaAI/LFM2-1.2B-npu, NexaAI/Granite-4.0-h-350M-NPU |
| VLM | NexaAI/OmniNeural-4B |
| Embedding | NexaAI/embeddinggemma-300m-npu, NexaAI/EmbedNeural |
| Rerank | NexaAI/jina-v2-rerank-npu |
| CV | NexaAI/yolov12-npu, NexaAI/convnext-tiny-npu-IoT |
| ASR | NexaAI/parakeet-tdt-0.6b-v3-npu |
Usage
CPU Mode
docker compose --profile cpu up -d
GPU Mode (CUDA)
docker compose --profile gpu up -d nexa-sdk-cuda
Pull a Model
docker exec -it nexa-sdk nexa pull NexaAI/Granite-4.0-h-350M-NPU
Interactive CLI
docker exec -it nexa-sdk nexa infer NexaAI/Granite-4.0-h-350M-NPU
API Examples
-
Chat completions:
curl -X POST http://localhost:18181/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "NexaAI/Granite-4.0-h-350M-NPU", "messages": [{"role": "user", "content": "Hello!"}] }' -
Embeddings:
curl -X POST http://localhost:18181/v1/embeddings \ -H "Content-Type: application/json" \ -d '{ "model": "NexaAI/EmbedNeural", "input": "Hello, world!" }' -
Swagger UI: Visit
http://localhost:18181/docs/ui
Services
nexa-sdk: CPU-based NexaSDK service (default)nexa-sdk-cuda: GPU-accelerated service with CUDA support (profile:gpu)
Configuration
| Variable | Description | Default |
|---|---|---|
NEXA_SDK_VERSION |
NexaSDK image version | v0.2.65 |
NEXA_SDK_PORT_OVERRIDE |
Host port for REST API | 18181 |
NEXA_TOKEN |
Nexa API token (required) | - |
TZ |
Timezone | UTC |
Volumes
nexa_data: Volume for storing downloaded models and data
Getting a Token
- Create an account at sdk.nexa.ai
- Go to Deployment → Create Token
- Copy the token to your
.envfile