Page cover

AFM-4.5B CPU Quick Deploy

AFM-4.5B is designed to run efficiently on low VRAM GPUs and CPUs. For CPU deployments, we recommend using llama.cpp. You will need at least 9 GB of VRAM to load AFM-4.5B in bf16. For a breakdown of AFM-4.5B performance on Intel Sapphire Rapids, AWS Graviton4, and Qualcomm Z1E-80-100 processors, read Is Running Language Models on CPU Really Viable?

In this guide, we'll walk you through how to deploy the instruct version of AFM-4.5B on a CPU.

Prerequisite

  1. Computer or Instance with > 9 GB RAM (if running the model in bf16)

  2. A Hugging Face account with access to arcee-ai/AFM-4.5B

Deployment Steps

  1. Setup a python virtual environment. In this guide, we'll use uv .

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

uv venv
source .venv/bin/activate
  1. Clone the llama.cpp repo

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
  1. Build and Install Dependencies

cmake .
make -j8
uv pip install -r requirements.txt --prerelease=allow --index-strategy unsafe-best-match
  1. Install Hugging Face and Login

uv pip install --upgrade huggingface_hub[cli]
hf auth login
  1. Download the model size you want to run

    1. The larger the model, the more memory it will require and the slower it will run

# Create a directory to store the model(s)
mkdir afm

# bf16
hf download arcee-ai/AFM-4.5B-GGUF AFM-4.5B-bf16.gguf --repo-type model --local-dir ./afm

# Q8_0
hf download arcee-ai/AFM-4.5B-GGUF AFM-4.5B-Q8_0.gguf --repo-type model --local-dir ./afm

# Q4_0
hf download arcee-ai/AFM-4.5B-GGUF AFM-4.5B-Q4_0.gguf --repo-type model --local-dir ./afm
  1. Host AFM-4.5B.

    1. If you downloaded a different model size, ensure the name of the model is correct

bin/llama-server -m ./afm/AFM-4.5B-bf16.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --jinja \
  --ctx-size 8192
  1. Run Inference on AFM-4.5B using the Chat Completions endpoint.

curl http://Your.IP.Address:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "afm",
        "messages": [
          { "role": "user", "content": "What are the benefits of model merging" }
        ],
        "temperature": 0.7,
        "top_k": 50,
        "repeat_penalty": 1.1
      }'
  • Ensure you replace Your.IP.Address with the IP address of the instance you're hosting the model on

Last updated