vLLM
vLLM is a high-throughput serving engine for language models that optimizes inference performance through advanced memory management and batching techniques. It provides easy integration with popular model architectures while maximizing GPU utilization for production deployments.
The deployments in this document are for deploying Trinity-Nano-6B; however, they work the exact same for all Arcee AI models. To deploy a different model, simply change the model name to the model you'd like to deploy.
Docker Container for vLLM
Prerequisite
Sufficient VRAM (refer to Hardware Prerequisites)
A Hugging Face account
Docker and NVIDIA Container Toolkit installed on your instance
If you need assistance, see Install Docker Engine and Installing the NVIDIA Container Toolkit
Deployment
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=your_hf_token_here" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model arcee-ai/trinity-nano-thinking \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--served-model-name afm \
--model_impl transformers \
--trust-remote-codeManual Install using vLLM
Prerequisite
Sufficient VRAM (refer to Hardware Prerequisites)
A Hugging Face account
Deployment
Ensure your NVIDIA Driver is configured.
If information about your GPU is returned, skip this step. If not, run the following commands.
Install necessary dev tools.
Setup a python virtual environment. In this guide, we'll use
uv.
Install necessary dev tools, vLLM, and Hugging Face.
Login to your Hugging Face Account using a HF Access Token.
Host the model.
For
max-model-lenyou can specify a context length of up to 65536For additional configuration options, see vLLM Configurations.
Run Inference using the Chat Completions endpoint.
Last updated

