This service deploys NexaSDK Docker for running AI models with OpenAI-compatible REST API. Supports LLM, Embeddings, Reranking, Computer Vision, and ASR models.

Features

OpenAI-compatible API: Drop-in replacement for OpenAI API endpoints
Multiple Model Types: LLM, VLM, Embeddings, Reranking, CV, ASR
GPU Acceleration: CUDA support for NVIDIA GPUs
NPU Support: Optimized for Qualcomm NPU on ARM64

Supported Models

Modality	Models
LLM	`NexaAI/LFM2-1.2B-npu`, `NexaAI/Granite-4.0-h-350M-NPU`
VLM	`NexaAI/OmniNeural-4B`
Embedding	`NexaAI/embeddinggemma-300m-npu`, `NexaAI/EmbedNeural`
Rerank	`NexaAI/jina-v2-rerank-npu`
CV	`NexaAI/yolov12-npu`, `NexaAI/convnext-tiny-npu-IoT`
ASR	`NexaAI/parakeet-tdt-0.6b-v3-npu`

Usage

CPU Mode

docker compose --profile cpu up -d

GPU Mode (CUDA)

docker compose --profile gpu up -d nexa-sdk-cuda

Pull a Model

docker exec -it nexa-sdk nexa pull NexaAI/Granite-4.0-h-350M-NPU

Interactive CLI

docker exec -it nexa-sdk nexa infer NexaAI/Granite-4.0-h-350M-NPU

API Examples

Chat completions:

curl -X POST http://localhost:18181/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "NexaAI/Granite-4.0-h-350M-NPU",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Embeddings:

curl -X POST http://localhost:18181/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "NexaAI/EmbedNeural",
    "input": "Hello, world!"
  }'

Swagger UI: Visit http://localhost:18181/docs/ui

Services

nexa-sdk: CPU-based NexaSDK service (default)
nexa-sdk-cuda: GPU-accelerated service with CUDA support (profile: gpu)

Configuration

Variable	Description	Default
`NEXA_SDK_VERSION`	NexaSDK image version	`v0.2.65`
`NEXA_SDK_PORT_OVERRIDE`	Host port for REST API	`18181`
`NEXA_TOKEN`	Nexa API token (required)	-
`TZ`	Timezone	`UTC`

README.md

NexaSDK

Features

Supported Models

Usage

CPU Mode

GPU Mode (CUDA)

Pull a Model

Interactive CLI

API Examples

Services

Configuration

Volumes

Getting a Token

References