vLLM

vLLM is a high-throughput serving engine for language models that optimizes inference performance through advanced memory management and batching techniques. It provides easy integration with popular model architectures while maximizing GPU utilization for production deployments.

The deployments in this document are for deploying Trinity-Nano-6B; however, they work the exact same for all Arcee AI models. To deploy a different model, simply change the model name to the model you'd like to deploy.

Docker Container for vLLM

Prerequisite

Sufficient VRAM (refer to Hardware Prerequisites)
A Hugging Face account
Docker and NVIDIA Container Toolkit installed on your instance
1. If you need assistance, see Install Docker Engine and Installing the NVIDIA Container Toolkit

Deployment

docker run --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=your_hf_token_here" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model arcee-ai/Trinity-Mini \
    --dtype bfloat16 \
    --enable-auto-tool-choice \
    --reasoning-parser deepseek_r1 \
    --port 8000 \
    --tool-call-parser hermes

Replace your_hf_token_here with your Hugging Face token

Manual Install using vLLM

Prerequisites

Sufficient VRAM (refer to Hardware Prerequisites)
A Hugging Face account

These commands are for an instance running Ubuntu. They will need to be modified for other operating systems.

Deployment

Ensure your NVIDIA Driver is configured.

nvidia-smi

If information about your GPU is returned, skip this step. If not, run the following commands.

sudo apt update
sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers install
sudo reboot

# Once you reconnect, check for correct driver configuration
nvidia-smi

Install necessary dev tools.

sudo apt install -y build-essential python3.12-dev

Setup a python virtual environment. In this guide, we'll use uv .

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

uv venv --python 3.12 --seed
source .venv/bin/activate

Install necessary dev tools, vLLM, and Hugging Face.

uv pip install vllm --torch-backend=auto
uv pip install -U "transformers<4.55"
uv pip install --upgrade huggingface_hub[cli]
sudo apt-get install git-lfs
git lfs install

hf auth login

Host the model.

vllm serve arcee-train/Trinity-Mini \
  --dtype bfloat16 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_r1 \
  --port 8000 \
  --tool-call-parser hermes

For max-model-len you can specify a context length of up to 65536
For additional configuration options, see vLLM Configurations.

Run Inference using the Chat Completions endpoint.

curl http://Your.IP.Address:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "trinity",
        "messages": [
          { "role": "user", "content": "What are the benefits of model merging" }
        ],
        "temperature": 0.7,
        "top_k": 50,
        "repeat_penalty": 1.1
      }'

Ensure you replace Your.IP.Address with the IP address of the instance you're hosting the model on

PreviousInference Engines Nextllama.cpp

Last updated 2 months ago

hashtagDocker Container for vLLM

hashtagManual Install using vLLM

hashtagRun Inference using the Chat Completions endpoint.

Docker Container for vLLM

Manual Install using vLLM

Run Inference using the Chat Completions endpoint.