SGLang
SGLang is a fast serving framework for language models which makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
Docker Container for SGLang
Prerequisite
GPU Instance with > 9 GB VRAM (if running the model in bf16)
A Hugging Face account with access to arcee-ai/AFM-4.5B
Docker and NVIDIA Container Toolkit installed on your instance
If you need assistance, see Install Docker Engine and Installing the NVIDIA Container Toolkit
Deployment
docker run --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=your_hf_token_here" \
-p 8000:8000 \
--ipc=host \
lmsysorg/sglang:latest \
python -m sglang.launch_server \
--model-path arcee-ai/AFM-4.5B \
--host 0.0.0.0 \
--port 8000 \
--max-total-tokens 8192 \
--served-model-name afm \
--trust-remote-code
Run Inference on AFM-4.5B using the Chat Completions endpoint.
curl http://Your.IP.Address:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "afm",
"messages": [
{ "role": "user", "content": "What are the benefits of model merging" }
],
"temperature": 0.7,
"top_k": 50,
"repeat_penalty": 1.1
}'
Last updated