llama.cpp
llama.cpp is a C++ implementation focused on running transformer models efficiently on consumer hardware with minimal dependencies. It emphasizes CPU inference optimization and quantization techniques to enable local model execution across diverse platforms including mobile and edge devices.
The deployments in this document are for deploying Trinity-Nano-6B; however, they work the exact same for all Arcee AI models. To deploy a different model, simply change the model name to the model you'd like to deploy.
Prerequisites
Sufficient RAM (refer to Hardware Prerequisites)
A Hugging Face account
Deployment
Setup a python virtual environment. In this guide, we'll use
uv.
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
uv venv
source .venv/bin/activateClone the llama.cpp repo
git clone https://github.com/ggerganov/llama.cpp
cd llama.cppBuild and Install Dependencies
cmake .
make -j8
uv pip install -r requirements.txt --prerelease=allow --index-strategy unsafe-best-matchInstall Hugging Face and Login
uv pip install --upgrade huggingface_hub[cli]
hf auth loginHost the model
llama-server -hf arcee-ai/Trinity-Mini-GGUF:q4_k_m \
--host 0.0.0.0 \
--port 8000 \
--temp 0.15 \
--top-k 50 \
--top-p 0.75
--min-p 0.06Run Inference using the Chat Completions endpoint.
curl http://Your.IP.Address:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "trinity",
"messages": [
{ "role": "user", "content": "What are the benefits of model merging" }
],
}'Last updated


