feat: add more
This commit is contained in:
@@ -0,0 +1,139 @@
|
||||
# vLLM
|
||||
|
||||
[English](./README.md) | [中文](./README.zh.md)
|
||||
|
||||
This service deploys vLLM, a high-throughput and memory-efficient inference and serving engine for LLMs.
|
||||
|
||||
## Services
|
||||
|
||||
- `vllm`: vLLM OpenAI-compatible API server
|
||||
|
||||
## Environment Variables
|
||||
|
||||
| Variable Name | Description | Default Value |
|
||||
| -------------------- | -------------------------------------- | ------------------- |
|
||||
| VLLM_VERSION | vLLM image version | `v0.8.0` |
|
||||
| VLLM_MODEL | Model name or path | `facebook/opt-125m` |
|
||||
| VLLM_MAX_MODEL_LEN | Maximum context length | `2048` |
|
||||
| VLLM_GPU_MEMORY_UTIL | GPU memory utilization (0.0-1.0) | `0.9` |
|
||||
| HF_TOKEN | Hugging Face token for model downloads | `""` |
|
||||
| VLLM_PORT_OVERRIDE | Host port mapping | `8000` |
|
||||
|
||||
Please modify the `.env` file as needed for your use case.
|
||||
|
||||
## Volumes
|
||||
|
||||
- `vllm_models`: Cached model files from Hugging Face
|
||||
|
||||
## GPU Support
|
||||
|
||||
This service requires NVIDIA GPU to run properly. Uncomment the GPU configuration in `docker-compose.yaml`:
|
||||
|
||||
```yaml
|
||||
deploy:
|
||||
resources:
|
||||
reservations:
|
||||
devices:
|
||||
- driver: nvidia
|
||||
count: 1
|
||||
capabilities: [gpu]
|
||||
runtime: nvidia
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
### Start vLLM
|
||||
|
||||
```bash
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
### Access
|
||||
|
||||
- API Endpoint: <http://localhost:8000>
|
||||
- OpenAI-compatible API: <http://localhost:8000/v1>
|
||||
|
||||
### Test the API
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "facebook/opt-125m",
|
||||
"prompt": "San Francisco is a",
|
||||
"max_tokens": 50,
|
||||
"temperature": 0.7
|
||||
}'
|
||||
```
|
||||
|
||||
### Chat Completions
|
||||
|
||||
```bash
|
||||
curl http://localhost:8000/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "facebook/opt-125m",
|
||||
"messages": [{"role": "user", "content": "Hello!"}]
|
||||
}'
|
||||
```
|
||||
|
||||
## Supported Models
|
||||
|
||||
vLLM supports a wide range of models:
|
||||
|
||||
- **LLaMA**: LLaMA, LLaMA-2, LLaMA-3
|
||||
- **Mistral**: Mistral, Mixtral
|
||||
- **Qwen**: Qwen, Qwen2
|
||||
- **Yi**: Yi, Yi-VL
|
||||
- **Many others**: See [vLLM supported models](https://docs.vllm.ai/en/latest/models/supported_models.html)
|
||||
|
||||
To use a different model, change the `VLLM_MODEL` environment variable:
|
||||
|
||||
```bash
|
||||
# Example: Use Qwen2-7B-Instruct
|
||||
VLLM_MODEL="Qwen/Qwen2-7B-Instruct"
|
||||
```
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
### GPU Memory
|
||||
|
||||
Adjust GPU memory utilization based on your model size and available VRAM:
|
||||
|
||||
```bash
|
||||
VLLM_GPU_MEMORY_UTIL=0.85 # Use 85% of GPU memory
|
||||
```
|
||||
|
||||
### Context Length
|
||||
|
||||
Set maximum context length according to your needs:
|
||||
|
||||
```bash
|
||||
VLLM_MAX_MODEL_LEN=4096 # Support up to 4K tokens
|
||||
```
|
||||
|
||||
### Shared Memory
|
||||
|
||||
For larger models, increase shared memory:
|
||||
|
||||
```yaml
|
||||
shm_size: 8g # Increase to 8GB
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- Requires NVIDIA GPU with CUDA support
|
||||
- Model downloads can be large (several GB to 100+ GB)
|
||||
- First startup may take time as it downloads the model
|
||||
- Ensure sufficient GPU memory for the model you want to run
|
||||
- Default model is small (125M parameters) for testing purposes
|
||||
|
||||
## Security
|
||||
|
||||
- The API has no authentication by default
|
||||
- Add authentication layer (e.g., nginx with basic auth) for production
|
||||
- Restrict network access to trusted sources
|
||||
|
||||
## License
|
||||
|
||||
vLLM is licensed under Apache License 2.0. See [vLLM GitHub](https://github.com/vllm-project/vllm) for more information.
|
||||
Reference in New Issue
Block a user