Page cover

llama.cpp

llama.cpp is a C++ implementation focused on running transformer models efficiently on consumer hardware with minimal dependencies. It emphasizes CPU inference optimization and quantization techniques to enable local model execution across diverse platforms including mobile and edge devices.

Prerequisites

  1. Sufficient RAM (refer to Hardware Prerequisites)

  2. A Hugging Face account

Deployment

  1. Setup a python virtual environment. In this guide, we'll use uv .

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

uv venv
source .venv/bin/activate
  1. Clone the llama.cpp repo

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
  1. Build and Install Dependencies

cmake .
make -j8
uv pip install -r requirements.txt --prerelease=allow --index-strategy unsafe-best-match
  1. Install Hugging Face and Login

uv pip install --upgrade huggingface_hub[cli]
hf auth login
  1. Host the model

llama-server -hf arcee-ai/Trinity-Mini-GGUF:q4_k_m \
  --host 0.0.0.0 \
  --port 8000 \
  --temp 0.15 \
  --top-k 50 \
  --top-p 0.75
  --min-p 0.06
  1. Run Inference using the Chat Completions endpoint.

curl http://Your.IP.Address:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "trinity",
        "messages": [
          { "role": "user", "content": "What are the benefits of model merging" }
        ],
      }'

Ensure you replace Your.IP.Address with the IP address of the instance you're hosting the model on

Last updated