Page cover

llama.cpp

llama.cpp is a C++ implementation focused on running transformer models efficiently on consumer hardware with minimal dependencies. It emphasizes CPU inference optimization and quantization techniques to enable local model execution across diverse platforms including mobile and edge devices.

Prerequisite

  1. Computer or Instance with > 9 GB RAM (if running the model in bf16)

  2. A Hugging Face account with access to arcee-ai/AFM-4.5B

Deployment

  1. Setup a python virtual environment. In this guide, we'll use uv .

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

uv venv
source .venv/bin/activate
  1. Clone the llama.cpp repo

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
  1. Build and Install Dependencies

cmake .
make -j8
uv pip install -r requirements.txt --prerelease=allow --index-strategy unsafe-best-match
  1. Install Hugging Face and Login

uv pip install --upgrade huggingface_hub[cli]
hf auth login
  1. Download the model size you want to run

    1. The larger the model, the more memory it will require and the slower it will run

# Create a directory to store the model(s)
mkdir afm

# bf16
hf download arcee-ai/AFM-4.5B-GGUF AFM-4.5B-bf16.gguf --repo-type model --local-dir ./afm

# Q8_0
hf download arcee-ai/AFM-4.5B-GGUF AFM-4.5B-Q8_0.gguf --repo-type model --local-dir ./afm

# Q4_0
hf download arcee-ai/AFM-4.5B-GGUF AFM-4.5B-Q4_0.gguf --repo-type model --local-dir ./afm
  1. Host AFM-4.5B.

    1. If you downloaded a different model size, ensure the name of the model is correct

bin/llama-server -m ./afm/AFM-4.5B-bf16.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --jinja \
  --ctx-size 8192
  1. Run Inference on AFM-4.5B using the Chat Completions endpoint.

curl http://Your.IP.Address:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "afm",
        "messages": [
          { "role": "user", "content": "What are the benefits of model merging" }
        ],
        "temperature": 0.7,
        "top_k": 50,
        "repeat_penalty": 1.1
      }'
  • Ensure you replace Your.IP.Address with the IP address of the instance you're hosting the model on

Last updated