llama.cpp
llama.cpp is a C++ implementation focused on running transformer models efficiently on consumer hardware with minimal dependencies. It emphasizes CPU inference optimization and quantization techniques to enable local model execution across diverse platforms including mobile and edge devices.
Prerequisite
Computer or Instance with > 9 GB RAM (if running the model in bf16)
A Hugging Face account with access to arcee-ai/AFM-4.5B
Deployment
Setup a python virtual environment. In this guide, we'll use
uv
.
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
uv venv
source .venv/bin/activate
Clone the llama.cpp repo
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
Build and Install Dependencies
cmake .
make -j8
uv pip install -r requirements.txt --prerelease=allow --index-strategy unsafe-best-match
Install Hugging Face and Login
uv pip install --upgrade huggingface_hub[cli]
hf auth login
Download the model size you want to run
The larger the model, the more memory it will require and the slower it will run
# Create a directory to store the model(s)
mkdir afm
# bf16
hf download arcee-ai/AFM-4.5B-GGUF AFM-4.5B-bf16.gguf --repo-type model --local-dir ./afm
# Q8_0
hf download arcee-ai/AFM-4.5B-GGUF AFM-4.5B-Q8_0.gguf --repo-type model --local-dir ./afm
# Q4_0
hf download arcee-ai/AFM-4.5B-GGUF AFM-4.5B-Q4_0.gguf --repo-type model --local-dir ./afm
Host AFM-4.5B.
If you downloaded a different model size, ensure the name of the model is correct
bin/llama-server -m ./afm/AFM-4.5B-bf16.gguf \
--host 0.0.0.0 \
--port 8000 \
--jinja \
--ctx-size 8192
Run Inference on AFM-4.5B using the Chat Completions endpoint.
curl http://Your.IP.Address:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "afm",
"messages": [
{ "role": "user", "content": "What are the benefits of model merging" }
],
"temperature": 0.7,
"top_k": 50,
"repeat_penalty": 1.1
}'
Ensure you replace
Your.IP.Address
with the IP address of the instance you're hosting the model on
Last updated