# Inference Engines

Arcee models can be deployed across several popular inference engines depending on your hardware, performance goals, and integration needs. Each engine offers different strengths, from high-throughput GPU serving to lightweight local CPU inference. The table below summarizes the recommended environments and use cases for each option to help you choose the best deployment path for your application.&#x20;

| Inference Engine | Recommended For                                                                                            |
| ---------------- | ---------------------------------------------------------------------------------------------------------- |
| **vLLM**         | GPU servers with high-throughput needs; predictable prompts, batch processing, and structured workflows    |
| **SGLang**       | Dynamic, multi-turn GPU workloads such as chat applications and assistants                                 |
| **llama.cpp**    | CPU or edge devices, quantized inference and environments where you need efficient inference without a GPU |

To learn more about supported hardware and recommended setups, visit [Hardware Prerequisites](/quick-deploys/hardware-prerequisites.md).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.arcee.ai/quick-deploys/inference-engines.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
