DeepGram

Deepgram Voice Agents is a flexible conversational AI stack that includes speech‑to‑text (STT), text‑to‑speech (TTS), and pluggable language models, designed to power real‑time, multi‑turn voice applications.

This tutorial will guide you through how to integrate Arcee AI model's as the LLM backbone for your Deepgram voice agent. The first section will showcase how to utilize our models through Together.ai and the second will showcase a self-hosting option.

Using Arcee Models with Deepgram Voice Agents (via Together.ai)

Step 1: Create a Together AI API Key

Go to api.together.xyz/settings/api-key
Click Create API Key
Copy the key and store it securely, you’ll need it to authorize model requests

Step 2: Configure the Deepgram Agent to Use AFM‑4.5B

Navigate to the Deepgram Voice Agent section of the Deepgram playground
Scroll to the Model section
Under Select a Large Language Model, choose:
```
Other – Custom model
```
Fill in the following fields (and replace the API key under authorization with your together API key from Step 1)

Field

Value

Custom Model Name

arcee-ai/AFM-4.5B (or any Arcee model)

Custom Model URL

https://api.together.xyz/v1/chat/completions

Custom Model API Format

OpenAI

Authorization Header

Authorization → Bearer YOUR_TOGETHER_API_KEY

Step 3: Test Your Agent

Scroll down and click Talk to your Agent
Speak to your agent or type a message
Open the Developer Console to view the underlying API calls and verify responses.
You’ll see the full conversation log in real time, and hear your model’s response played back using Deepgram’s TTS engine.

Using AFM‑4.5B with Deepgram Voice Agents (Self-Hosted)

This guide explains how to integrate a self hosted Arcee model as the LLM backbone for your Deepgram voice agent using a self-hosted setup. We will use AFM 4.5B for this example.

Step 1: Deploy AFM‑4.5B with an OpenAI-Compatible Server

AFM‑4.5B can be deployed using any inference server that supports the OpenAI /v1/chat/completions format.

Some popular options:

Hardware Notes:
For 4-bit quantized: 3–4 GB RAM
For bf16 inference: ≥ 9 GB RAM
Context window: 8192 tokens recommended

Example using llama.cpp:

./bin/llama-server -m ./afm/AFM-4.5B-bf16.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --ctx-size 8192 \
  --jinja

Make sure --jinja is included to enable the OpenAI-compatible API.

Step 2: Expose the AFM Server via ngrok

Deepgram needs a public HTTPS endpoint to reach your model.

Use ngrok or any tunneling tool:

ngrok http 8000

This will forward to your local server and give you a public URL like:

https://your-subdomain.ngrok-free.dev → http://localhost:8000

Keep this tunnel active while your Deepgram agent is running.

Step 3: Configure the Deepgram Agent to Use AFM‑4.5B

Navigate to the DeepGram VoiceAgent section of the DeepGram playground
Scroll to the Model section
Under Select a Large Language Model, choose: Other – Custom model
Fill in the following fields:

Field

Value

Custom Model Name

AFM

Custom Model URL

https://your-subdomain.ngrok-free.dev/v1/chat/completions

Custom Model API Format

OpenAI

Authorization Header

Authorization → Bearer None

Step 4: Test Your Agent

Scroll down and click "Talk to your Agent"
Speak to your agent
Examine the calls under the developer console

You’ll see the full conversation log in real time, and hear your model's response played back using Deepgram’s TTS engine.

PreviousElevenLabs NextLearning Paths

Last updated 8 days ago