# SGLang

SGLang is a fast serving framework for language models which makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.&#x20;

{% hint style="warning" %}
The deployments in this document are for deploying Trinity-Nano-6B; however, they work the exact same for all Arcee AI models. To deploy a different model, simply change the model name to the model you'd like to deploy.
{% endhint %}

### Docker Container for SGLang

**Prerequisite**

1. Sufficient VRAM (refer to [Hardware Prerequisites](/~/revisions/UOfL3qIelQCFUdc2TpQu/quick-deploys/hardware-prerequisites.md))&#x20;
2. A Hugging Face account
3. Docker and NVIDIA Container Toolkit installed on your instance
   1. If you need assistance, see [Install Docker Engine](https://docs.docker.com/engine/install/) and [Installing the NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)

**Deployment**

```bash
docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HUGGING_FACE_HUB_TOKEN=your_hf_token_here" \
  -p 8000:8000 \
  --ipc=host \
  lmsysorg/sglang:latest \
  python -m sglang.launch_server \
  --model-path arcee-ai/trinity-nano-thinking \
  --host 0.0.0.0 \
  --port 8000 \
  --max-total-tokens 8192 \
  --served-model-name afm \
  --trust-remote-code
```

{% hint style="info" %}
Replace `your_hf_token_here` with your Hugging Face token
{% endhint %}

### Run Inference using the Chat Completions endpoint.

```bash
curl http://Your.IP.Address:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
        "model": "trinity",
        "messages": [
          { "role": "user", "content": "What are the benefits of model merging" }
        ],
        "temperature": 0.7,
        "top_k": 50,
        "repeat_penalty": 1.1
      }'
```

{% hint style="info" %}
Ensure you replace `Your.IP.Address` with the IP address of the instance you're hosting the model on
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.arcee.ai/~/revisions/UOfL3qIelQCFUdc2TpQu/quick-deploys/sglang.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
