SGLang

SGLang is a fast serving framework for language models which makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.

circle-exclamation

Docker Container for SGLang

Prerequisite

  1. Sufficient VRAM (refer to Hardware Prerequisites)

  2. A Hugging Face account

  3. Docker and NVIDIA Container Toolkit installed on your instance

Deployment

docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HUGGING_FACE_HUB_TOKEN=your_hf_token_here" \
  -p 8000:8000 \
  --ipc=host \
  lmsysorg/sglang:latest \
  python -m sglang.launch_server \
  --model-path arcee-ai/trinity-nano-thinking \
  --host 0.0.0.0 \
  --port 8000 \
  --max-total-tokens 8192 \
  --served-model-name afm \
  --trust-remote-code
circle-info

Replace your_hf_token_here with your Hugging Face token

Run Inference using the Chat Completions endpoint.

circle-info

Ensure you replace Your.IP.Address with the IP address of the instance you're hosting the model on

Last updated