vLLM
vLLM is a high-throughput serving engine for language models that optimizes inference performance through advanced memory management and batching techniques. It provides easy integration with popular model architectures while maximizing GPU utilization for production deployments.
Docker Container for vLLM
Prerequisite
GPU Instance with > 9 GB VRAM (if running the model in bf16)
A Hugging Face account with access to arcee-ai/AFM-4.5B
Docker and NVIDIA Container Toolkit installed on your instance
If you need assistance, see Install Docker Engine and Installing the NVIDIA Container Toolkit
Deployment
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=your_hf_token_here" \
-p 8000:8000 \
--ipc=host \
vllm/vllm-openai:latest \
--model arcee-ai/AFM-4.5B \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--served-model-name afm \
--model_impl transformers \
--trust-remote-codeManual Install using vLLM
Prerequisite
A NVIDIA GPU Instance with > 9 GB VRAM (if running the model in bf16)
A Hugging Face account with access to arcee-ai/AFM-4.5B
Deployment
Ensure your NVIDIA Driver is configured.
nvidia-smiIf information about your GPU is returned, skip this step. If not, run the following commands.
sudo apt update
sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers install
sudo reboot
# Once you reconnect, check for correct driver configuration
nvidia-smiInstall necessary dev tools.
sudo apt install -y build-essential python3.12-devSetup a python virtual environment. In this guide, we'll use
uv.
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
uv venv --python 3.12 --seed
source .venv/bin/activateInstall necessary dev tools, vLLM, and Hugging Face.
uv pip install vllm --torch-backend=auto
uv pip install -U "transformers<4.55"
uv pip install --upgrade huggingface_hub[cli]
sudo apt-get install git-lfs
git lfs installLogin to your Hugging Face Account using a HF Access Token.
hf auth loginHost AFM-4.5B.
vllm serve arcee-ai/AFM-4.5B \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 8192 \
--served-model-name afm \
--model_impl transformers \
--trust-remote-codeFor
max-model-lenyou can specify a context length of up to 65536For additional configuration options, see vLLM Configurations.
Run Inference on AFM-4.5B using the Chat Completions endpoint.
curl http://Your.IP.Address:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "afm",
"messages": [
{ "role": "user", "content": "What are the benefits of model merging" }
],
"temperature": 0.7,
"top_k": 50,
"repeat_penalty": 1.1
}'Last updated

