Inference Engines
Arcee models can be deployed across several popular inference engines depending on your hardware, performance goals, and integration needs. Each engine offers different strengths, from high-throughput GPU serving to lightweight local CPU inference. The table below summarizes the recommended environments and use cases for each option to help you choose the best deployment path for your application.
vLLM
GPU servers with high-throughput needs; predictable prompts, batch processing, and structured workflows
SGLang
Dynamic, multi-turn GPU workloads such as chat applications and assistants
llama.cpp
CPU or edge devices, quantized inference and environments where you need efficient inference without a GPU
To learn more about supported hardware and recommended setups, visit Hardware Prerequisites.
Last updated


