Page cover

AFM-4.5B GPU Quick Deploy

AFM-4.5B is designed to run efficiently on low VRAM GPUs and CPUs. For GPU deployments, we recommend using vLLM or sglang. You will need at least 9 GB of VRAM to load AFM-4.5B in bf16.

In this guide, we'll walk you through how to deploy the instruct version of AFM-4.5B on a NVIDIA GPU Instance. This guide will work for any hardware or virtual machine that has NVIDIA GPUs, regardless of cloud or on-premise.

Prerequisite

  1. A NVIDIA GPU Instance with > 9 GB VRAM (if running the model in bf16)

  2. A Hugging Face account with access to arcee-ai/AFM-4.5B

These commands are for an instance running Ubuntu. They will need to be modified for other operating systems.

Deployment Steps

  1. Ensure your NVIDIA Driver is configured.

nvidia-smi
  1. If information about your GPU is returned, skip this step. If not, run the following commands.

sudo apt update
sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers install
sudo reboot

# Once you reconnect, check for correct driver configuration
nvidia-smi
  1. Install necessary dev tools.

sudo apt install -y build-essential python3.12-dev
  1. Setup a python virtual environment. In this guide, we'll use uv .

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

uv venv --python 3.12 --seed
source .venv/bin/activate
  1. Install necessary dev tools, vLLM, and Hugging Face.

uv pip install vllm --torch-backend=auto
uv pip install -U "transformers<4.55"
uv pip install --upgrade huggingface_hub[cli]
sudo apt-get install git-lfs
git lfs install
  1. Login to your Hugging Face Account using a HF Access Token.

hf auth login
  1. Host AFM-4.5B.

vllm serve arcee-ai/AFM-4.5B \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 8192 \
  --served-model-name afm \
  --model_impl transformers \
  --trust-remote-code
  • For max-model-len you can specify a context length of up to 65536

  • For additional configuration options, see vLLM Configurations.

  1. Run Inference on AFM-4.5B using the Chat Completions endpoint.

curl http://Your.IP.Address:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "afm",
        "messages": [
          { "role": "user", "content": "What are the benefits of model merging" }
        ],
        "temperature": 0.7,
        "top_k": 50,
        "repeat_penalty": 1.1
      }'
  • Ensure you replace Your.IP.Address with the IP address of the instance you're hosting the model on

Last updated