Page cover

DeepGram

Deepgram Voice Agents is a flexible conversational AI stack that includes speech‑to‑text (STT), text‑to‑speech (TTS), and pluggable language models, designed to power real‑time, multi‑turn voice applications.

This tutorial will guide you through how to integrate Arcee AI model's as the LLM backbone for your Deepgram voice agent. The first section will showcase how to utilize our models through Together.ai and the second will showcase a self-hosting option.


Using Arcee Models with Deepgram Voice Agents (via Together.ai)

Step 1: Create a Together AI API Key

  1. Click Create API Key

  2. Copy the key and store it securely, you’ll need it to authorize model requests

Step 2: Configure the Deepgram Agent to Use AFM‑4.5B

  1. Navigate to the Deepgram Voice Agent section of the Deepgram playground

  2. Scroll to the Model section

  3. Under Select a Large Language Model, choose:

    Other – Custom model
  4. Fill in the following fields (and replace the API key under authorization with your together API key from Step 1)

Field
Value

Custom Model Name

arcee-ai/AFM-4.5B (or any Arcee model)

Custom Model URL

https://api.together.xyz/v1/chat/completions

Custom Model API Format

OpenAI

Authorization Header

AuthorizationBearer YOUR_TOGETHER_API_KEY

Step 3: Test Your Agent

  1. Scroll down and click Talk to your Agent

  2. Speak to your agent or type a message

  3. Open the Developer Console to view the underlying API calls and verify responses.

  4. You’ll see the full conversation log in real time, and hear your model’s response played back using Deepgram’s TTS engine.


Using AFM‑4.5B with Deepgram Voice Agents (Self-Hosted)

This guide explains how to integrate a self hosted Arcee model as the LLM backbone for your Deepgram voice agent using a self-hosted setup. We will use AFM 4.5B for this example.

Step 1: Deploy AFM‑4.5B with an OpenAI-Compatible Server

AFM‑4.5B can be deployed using any inference server that supports the OpenAI /v1/chat/completions format.

Some popular options:

Hardware Notes:

  • For 4-bit quantized: 3–4 GB RAM

  • For bf16 inference: ≥ 9 GB RAM

  • Context window: 8192 tokens recommended

Example using llama.cpp:

./bin/llama-server -m ./afm/AFM-4.5B-bf16.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --ctx-size 8192 \
  --jinja

Make sure --jinja is included to enable the OpenAI-compatible API.

Step 2: Expose the AFM Server via ngrok

Deepgram needs a public HTTPS endpoint to reach your model.

Use ngrok or any tunneling tool:

ngrok http 8000

This will forward to your local server and give you a public URL like:

https://your-subdomain.ngrok-free.dev → http://localhost:8000

Keep this tunnel active while your Deepgram agent is running.

Step 3: Configure the Deepgram Agent to Use AFM‑4.5B

  1. Navigate to the DeepGram VoiceAgent section of the DeepGram playground

  2. Scroll to the Model section

  3. Under Select a Large Language Model, choose: Other – Custom model

  4. Fill in the following fields:

Field
Value

Custom Model Name

AFM

Custom Model URL

https://your-subdomain.ngrok-free.dev/v1/chat/completions

Custom Model API Format

OpenAI

Authorization Header

AuthorizationBearer None

Step 4: Test Your Agent

  1. Scroll down and click "Talk to your Agent"

  2. Speak to your agent

  3. Examine the calls under the developer console

You’ll see the full conversation log in real time, and hear your model's response played back using Deepgram’s TTS engine.

Last updated