llama.cpp

llama.cpp is a C++ implementation focused on running transformer models efficiently on consumer hardware with minimal dependencies. It emphasizes CPU inference optimization and quantization techniques to enable local model execution across diverse platforms including mobile and edge devices.

Prerequisite

Computer or Instance with > 9 GB RAM (if running the model in bf16)
A Hugging Face account with access to arcee-ai/AFM-4.5B

Deployment

Setup a python virtual environment. In this guide, we'll use uv .

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

uv venv
source .venv/bin/activate

Clone the llama.cpp repo

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Build and Install Dependencies

cmake .
make -j8
uv pip install -r requirements.txt --prerelease=allow --index-strategy unsafe-best-match

Install Hugging Face and Login

uv pip install --upgrade huggingface_hub[cli]
hf auth login

Download the model size you want to run
1. The larger the model, the more memory it will require and the slower it will run

# Create a directory to store the model(s)
mkdir afm

# bf16
hf download arcee-ai/AFM-4.5B-GGUF AFM-4.5B-bf16.gguf --repo-type model --local-dir ./afm

# Q8_0
hf download arcee-ai/AFM-4.5B-GGUF AFM-4.5B-Q8_0.gguf --repo-type model --local-dir ./afm

# Q4_0
hf download arcee-ai/AFM-4.5B-GGUF AFM-4.5B-Q4_0.gguf --repo-type model --local-dir ./afm

Host AFM-4.5B.
1. If you downloaded a different model size, ensure the name of the model is correct

bin/llama-server -m ./afm/AFM-4.5B-bf16.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --jinja \
  --ctx-size 8192

Run Inference on AFM-4.5B using the Chat Completions endpoint.

curl http://Your.IP.Address:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "afm",
        "messages": [
          { "role": "user", "content": "What are the benefits of model merging" }
        ],
        "temperature": 0.7,
        "top_k": 50,
        "repeat_penalty": 1.1
      }'

Ensure you replace Your.IP.Address with the IP address of the instance you're hosting the model on

PreviousSGLang Nextollama

Last updated 6 days ago