# vLLM

vLLM is a high-throughput serving engine for language models that optimizes inference performance through advanced memory management and batching techniques. It provides easy integration with popular model architectures while maximizing GPU utilization for production deployments.

{% hint style="warning" %}
The deployments in this document are for deploying Trinity-Nano-6B; however, they work the exact same for all Arcee AI models. To deploy a different model, simply change the model name to the model you'd like to deploy.
{% endhint %}

### Docker Container for vLLM

**Prerequisite**

1. Sufficient VRAM (refer to [Hardware Prerequisites](https://docs.arcee.ai/~/revisions/UOfL3qIelQCFUdc2TpQu/quick-deploys/hardware-prerequisites))&#x20;
2. A Hugging Face account
3. Docker and NVIDIA Container Toolkit installed on your instance
   1. If you need assistance, see [Install Docker Engine](https://docs.docker.com/engine/install/) and [Installing the NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)

**Deployment**

```bash
docker run --gpus all \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HUGGING_FACE_HUB_TOKEN=your_hf_token_here" \
    -p 8000:8000 \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model arcee-ai/trinity-nano-thinking \
    --host 0.0.0.0 \
    --port 8000 \
    --max-model-len 8192 \
    --served-model-name afm \
    --model_impl transformers \
    --trust-remote-code
```

{% hint style="info" %}
Replace `your_hf_token_here` with your Hugging Face token
{% endhint %}

### Manual Install using vLLM

**Prerequisite**

1. Sufficient VRAM (refer to [Hardware Prerequisites](https://docs.arcee.ai/~/revisions/UOfL3qIelQCFUdc2TpQu/quick-deploys/hardware-prerequisites))&#x20;
2. A Hugging Face account

{% hint style="info" %}
These commands are for an instance running Ubuntu. They will need to be modified for other operating systems.
{% endhint %}

**Deployment**

1. Ensure your NVIDIA Driver is configured.

```bash
nvidia-smi
```

2. If information about your GPU is returned, skip this step. If not, run the following commands.

```bash
sudo apt update
sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers install
sudo reboot

# Once you reconnect, check for correct driver configuration
nvidia-smi
```

3. Install necessary dev tools.

```bash
sudo apt install -y build-essential python3.12-dev
```

4. Setup a python virtual environment. In this guide, we'll use `uv` .

```bash
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

uv venv --python 3.12 --seed
source .venv/bin/activate
```

5. Install necessary dev tools, vLLM, and Hugging Face.

```bash
uv pip install vllm --torch-backend=auto
uv pip install -U "transformers<4.55"
uv pip install --upgrade huggingface_hub[cli]
sudo apt-get install git-lfs
git lfs install
```

6. Login to your Hugging Face Account using a [HF Access Token](https://huggingface.co/docs/hub/en/security-tokens).

```bash
hf auth login
```

7. Host the model.

```bash
vllm serve arcee-ai/trinity-nano-thinking \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192 \
  --served-model-name afm \
  --model_impl transformers \
  --trust-remote-code
```

* For `max-model-len`  you can specify a context length of up to 65536
* For additional configuration options, see [vLLM Configurations](https://docs.vllm.ai/en/stable/api/vllm/config.html).

### Run Inference using the Chat Completions endpoint.

```bash
curl http://Your.IP.Address:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "trinity",
        "messages": [
          { "role": "user", "content": "What are the benefits of model merging" }
        ],
        "temperature": 0.7,
        "top_k": 50,
        "repeat_penalty": 1.1
      }'
```

{% hint style="info" %}
Ensure you replace `Your.IP.Address` with the IP address of the instance you're hosting the model on
{% endhint %}
