DeepGram
Deepgram Voice Agents is a flexible conversational AI stack that includes speech‑to‑text (STT), text‑to‑speech (TTS), and pluggable language models, designed to power real‑time, multi‑turn voice applications.
This tutorial will guide you through how to integrate Arcee AI model's as the LLM backbone for your Deepgram voice agent. The first section will showcase how to utilize our models through Together.ai and the second will showcase a self-hosting option.
Using Arcee Models with Deepgram Voice Agents (via Together.ai)
Step 1: Create a Together AI API Key
Click Create API Key
Copy the key and store it securely, you’ll need it to authorize model requests
Step 2: Configure the Deepgram Agent to Use AFM‑4.5B
Navigate to the Deepgram Voice Agent section of the Deepgram playground
Scroll to the Model section
Under Select a Large Language Model, choose:
Other – Custom modelFill in the following fields (and replace the API key under authorization with your together API key from Step 1)
Custom Model Name
arcee-ai/AFM-4.5B (or any Arcee model)
Custom Model URL
https://api.together.xyz/v1/chat/completions
Custom Model API Format
OpenAI
Authorization Header
Authorization → Bearer YOUR_TOGETHER_API_KEY
Step 3: Test Your Agent
Scroll down and click Talk to your Agent
Speak to your agent or type a message
Open the Developer Console to view the underlying API calls and verify responses.
You’ll see the full conversation log in real time, and hear your model’s response played back using Deepgram’s TTS engine.
Using AFM‑4.5B with Deepgram Voice Agents (Self-Hosted)
This guide explains how to integrate a self hosted Arcee model as the LLM backbone for your Deepgram voice agent using a self-hosted setup. We will use AFM 4.5B for this example.
Step 1: Deploy AFM‑4.5B with an OpenAI-Compatible Server
AFM‑4.5B can be deployed using any inference server that supports the OpenAI /v1/chat/completions format.
Some popular options:
Hardware Notes:
For 4-bit quantized: 3–4 GB RAM
For bf16 inference: ≥ 9 GB RAM
Context window: 8192 tokens recommended
Example using llama.cpp:
./bin/llama-server -m ./afm/AFM-4.5B-bf16.gguf \
--host 0.0.0.0 \
--port 8000 \
--ctx-size 8192 \
--jinja Make sure --jinja is included to enable the OpenAI-compatible API.
Step 2: Expose the AFM Server via ngrok
Deepgram needs a public HTTPS endpoint to reach your model.
Use ngrok or any tunneling tool:
ngrok http 8000This will forward to your local server and give you a public URL like:
https://your-subdomain.ngrok-free.dev → http://localhost:8000Keep this tunnel active while your Deepgram agent is running.
Step 3: Configure the Deepgram Agent to Use AFM‑4.5B
Navigate to the DeepGram VoiceAgent section of the DeepGram playground
Scroll to the Model section
Under Select a Large Language Model, choose:
Other – Custom modelFill in the following fields:
Custom Model Name
AFM
Custom Model URL
https://your-subdomain.ngrok-free.dev/v1/chat/completions
Custom Model API Format
OpenAI
Authorization Header
Authorization → Bearer None
Step 4: Test Your Agent
Scroll down and click "Talk to your Agent"
Speak to your agent
Examine the calls under the developer console
You’ll see the full conversation log in real time, and hear your model's response played back using Deepgram’s TTS engine.
Last updated

