SGLang

SGLang is a fast serving framework for language models which makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.

Docker Container for SGLang

Prerequisite

  1. Sufficient VRAM (refer to Hardware Prerequisites)

  2. A Hugging Face account

  3. Docker and NVIDIA Container Toolkit installed on your instance

Deployment

docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HUGGING_FACE_HUB_TOKEN=your_hf_token_here" \
  -p 8000:8000 \
  --ipc=host \
  lmsysorg/sglang:latest \
  python -m sglang.launch_server \
  --model-path arcee-ai/trinity-nano-thinking \
  --host 0.0.0.0 \
  --port 8000 \
  --max-total-tokens 8192 \
  --served-model-name afm \
  --trust-remote-code

Replace your_hf_token_here with your Hugging Face token

Run Inference using the Chat Completions endpoint.

Ensure you replace Your.IP.Address with the IP address of the instance you're hosting the model on

Last updated