Inference Engines

Arcee models can be deployed across several popular inference engines depending on your hardware, performance goals, and integration needs. Each engine offers different strengths, from high-throughput GPU serving to lightweight local CPU inference. The table below summarizes the recommended environments and use cases for each option to help you choose the best deployment path for your application.

Inference Engine

Recommended For

vLLM

GPU servers with high-throughput needs; predictable prompts, batch processing, and structured workflows

SGLang

Dynamic, multi-turn GPU workloads such as chat applications and assistants

llama.cpp

CPU or edge devices, quantized inference and environments where you need efficient inference without a GPU

To learn more about supported hardware and recommended setups, visit Hardware Prerequisites.

PreviousHardware Prerequisites NextvLLM

Last updated 1 month ago