ElevenLabs

ElevenLabs Agents is a conversational voice agent platform that combines automatic speech recognition (ASR), a pluggable language model, human-like TTS, and a turn-taking engine into a complete voice stack.

This tutorial will guide you through how to integrate Arcee AI model's as the language model for your ElevenLabs agent. The first section will showcase how to utilize our models through Together.ai and the second will showcase a self-hosting option.

Using Arcee Models with ElevenLabs Agents (via Together.ai)

Step 1: Create a Together.ai API Key

Go to api.together.xyz/settings/api-keys
Click “Create API Key”
Copy the key and store it securely

Step 2: Connect an Arcee model to Your ElevenLabs Agent

In the ElevenLabs dashboard, go to Settings → Workspace Secrets
Click “Add a Secret”
- Name: together-ai-api-key
- Value: Paste your Together AI API key
Click “Add a Secret” to save it to your workspace
Go to the Agents tab from the left pane
Select your existing agent or create a new one
Scroll to the LLM section
Beside "Select which provider and model to use for the LLM", select “Custom LLM”
Fill in the following fields:

Click Save to apply the agent configuration

Step 3: Test the Agent

Click "Test AI Agent" in the ElevenLabs dashboard to chat with the model.

Using Arcee Models with ElevenLabs Agents (Self-Hosted Example)

This section explains how to use one of our self hosted models as the LLM backbone for your agent by self-hosting the model on your own infrastructure. We will use AFM 4.5B in this example

Step 1: Deploy the model

Our models can be deployed using any framework that exposes an OpenAI-compatible endpoint. Choose one of the following depending on your environment:

Inference Engine

Recommended For

vLLM

GPU servers with high throughput needs

SGLang

GPU, fast routing, OpenAI-style APIs

llama.cpp

CPU or edge devices, quantized inference

ollama

Lightweight local deployments with simple CLI

Hardware Notes:

AFM‑4.5B can run on as little as 3 GB RAM when quantized to 4-bit
For bf16 inference, allocate at least 9 GB RAM

For deployment guides, visit:

In this example, we will use a self-hosted version of AFM‑4.5B using llama.cpp . After you've followed the steps to download the model in the llama.cpp guide, complete the following:

Step 2: Launch the OpenAI-Compatible Server

Start llama-server with the correct model and context size. This will expose an OpenAI-compatible /v1/chat/completions endpoint:

bin/llama-server -m ./afm/AFM-4.5B-bf16.gguf \
  --host 0.0.0.0 \
  --port 8000 \
  --jinja \
  --ctx-size 8192

Make sure the --jinja flag is included. This is required to enable the OpenAI-compatible API.

Step 3: Expose the Server with ngrok (Required)

To make your server accessible, create a public URL using a tunneling tool like ngrok:

ngrok http 8000

This will generate a public HTTPS URL like:

https://your-subdomain.ngrok-free.dev → http://localhost:8000

Keep this ngrok tunnel open while the agent is active.

Step 4: Configure ElevenLabs Agent to Use Your Self Hosted Model

Configure your agent

Go to the Agents tab and open your agent
In the Model Configuration section, enter the ngrok url with "/v1" at the end, a placeholder model ID and select "None" for the API key:

Step 5: Test the Agent

Click "Test AI Agent" in the ElevenLabs dashboard to chat with the model.

Previousn8n NextDeepGram

Last updated 8 days ago