Page cover
For the complete documentation index, see llms.txt. This page is also available as Markdown.

vLLM

vLLM is a high-throughput serving engine for language models that optimizes inference performance through advanced memory management and batching techniques. It provides easy integration with popular model architectures while maximizing GPU utilization for production deployments.

Docker Container for vLLM

Prerequisite

  1. Sufficient VRAM (refer to Hardware Prerequisites)

  2. A Hugging Face account

  3. Docker and NVIDIA Container Toolkit installed on your instance

Deployment

docker run --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=your_hf_token_here" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model arcee-ai/Trinity-Mini \
    --dtype bfloat16 \
    --enable-auto-tool-choice \
    --reasoning-parser deepseek_r1 \
    --port 8000 \
    --tool-call-parser hermes

Replace your_hf_token_here with your Hugging Face token

Manual Install using vLLM

Prerequisites

  1. Sufficient VRAM (refer to Hardware Prerequisites)

  2. A Hugging Face account

These commands are for an instance running Ubuntu. They will need to be modified for other operating systems.

Deployment

  1. Ensure your NVIDIA Driver is configured.

  1. If information about your GPU is returned, skip this step. If not, run the following commands.

  1. Install necessary dev tools.

  1. Setup a python virtual environment. In this guide, we'll use uv .

  1. Install necessary dev tools, vLLM, and Hugging Face.

  1. Login to your Hugging Face Account using a HF Access Token.

  1. Host the model.

  • For max-model-len you can specify a context length of up to 65536

  • For additional configuration options, see vLLM Configurations.

Trinity-Large-Thinking note (multi-turn agents): If you are deploying Trinity-Large-Thinking for tool-calling agents, preserve assistant reasoning across turns. In some vLLM versions, input reasoning_content may be ignored while reasoning is honored. For best compatibility, map SDK output reasoning_content to assistant input reasoning, and avoid content: null on assistant tool-call turns (use ""). See Reasoning Traces for full Python/TypeScript examples and troubleshooting.

Run Inference using the Chat Completions endpoint.

Ensure you replace Your.IP.Address with the IP address of the instance you're hosting the model on

Last updated