llama.cpp

llama.cpp is a C++ implementation focused on running transformer models efficiently on consumer hardware with minimal dependencies. It emphasizes CPU inference optimization and quantization techniques to enable local model execution across diverse platforms including mobile and edge devices.

The deployments in this document are for deploying Trinity-Nano-6B; however, they work the exact same for all Arcee AI models. To deploy a different model, simply change the model name to the model you'd like to deploy.

Prerequisites

Sufficient RAM (refer to Hardware Prerequisites)
A Hugging Face account

Deployment

Setup a python virtual environment. In this guide, we'll use uv .

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

uv venv
source .venv/bin/activate

Clone the llama.cpp repo

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

Build and Install Dependencies

cmake .
make -j8
uv pip install -r requirements.txt --prerelease=allow --index-strategy unsafe-best-match

Install Hugging Face and Login

uv pip install --upgrade huggingface_hub[cli]
hf auth login

Download the model size you want to run
1. The larger the model, the more memory it will require and the slower it will run
2. model_name:The exact name of the model you want to deploy, like trinity-nano-6b. This tells the system which model to download from Hugging Face.
3. model_quant:Indicates the quantization format of the model, such as bf16, q4_0, or q8_0. Choose based on your hardware; lower-bit formats run faster and use less memory but may reduce accuracy slightly.

# Create a directory to store the model(s)
mkdir afm

# bf16
hf download arcee-ai/model_name model_quant.gguf --repo-type model --local-dir ./afm

Host the model

bin/llama-server -m ./afm/{model_name}.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --jinja \
  --ctx-size 8192

Run Inference using the Chat Completions endpoint.

curl http://Your.IP.Address:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "trinity",
        "messages": [
          { "role": "user", "content": "What are the benefits of model merging" }
        ],
        "temperature": 0.7,
        "top_k": 50,
        "repeat_penalty": 1.1
      }'

Ensure you replace Your.IP.Address with the IP address of the instance you're hosting the model on

PreviousSGLang NextDeprecation Policy

Last updated 20 days ago