Skip to content

Commit 8d1095d

Browse files
authored
[Docs] Improve documentations (#1368)
1 parent 743007e commit 8d1095d

File tree

6 files changed

+474
-124
lines changed

6 files changed

+474
-124
lines changed

.gitignore

+3
Original file line numberDiff line numberDiff line change
@@ -166,6 +166,9 @@ cython_debug/
166166
# Vim
167167
*.swp
168168

169+
# Documentation
170+
docs/en/_build
171+
169172
# SGL
170173
benchmark/mmlu/data
171174
benchmark/mmlu/data.tar

README.md

+12-7
Original file line numberDiff line numberDiff line change
@@ -15,10 +15,12 @@
1515

1616
SGLang is a fast serving framework for large language models and vision language models.
1717
It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
18-
1918
The core features include:
20-
- **Fast Backend Runtime**: Efficient serving with RadixAttention for prefix caching, jump-forward constrained decoding, continuous batching, token attention (paged attention), tensor parallelism, FlashInfer kernels, and quantization (AWQ/FP8/GPTQ/Marlin).
21-
- **Flexible Frontend Language**: Enables easy programming of LLM applications with chained generation calls, advanced prompting, control flow, multiple modalities, parallelism, and external interactions.
19+
20+
- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, jump-forward constrained decoding, continuous batching, token attention (paged attention), tensor parallelism, FlashInfer kernels, chunked prefill, and quantization (INT4/FP8/AWQ/GPTQ).
21+
- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
22+
- **Extensive Model Support**: Supports a wide range of generative models (Llama 3, Gemma 2, Mistral, QWen, DeepSeek, LLaVA, etc.) and embedding models (e5-mistral), with easy extensibility for integrating new models.
23+
- **Active Community**: SGLang is open-source and backed by an active community with industry adoption, welcoming contributions to improve LLM and VLM serving.
2224

2325
## News
2426
- [2024/09] 🔥 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
@@ -44,6 +46,8 @@ The core features include:
4446

4547
## Install
4648

49+
You can install SGLang using any of the methods below.
50+
4751
### Method 1: With pip
4852
```
4953
pip install --upgrade pip
@@ -67,7 +71,7 @@ pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
6771
```
6872

6973
### Method 3: Using docker
70-
The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](docker).
74+
The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
7175
Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
7276

7377
```bash
@@ -218,6 +222,10 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
218222
```
219223
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --chunked-prefill-size 4096
220224
```
225+
- To enable torch.compile support, you can add `--enable-torch-compile`. It accelerates small models on small batch sizes.
226+
- To enable fp8 weight quantization, you can add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
227+
- To enable fp8 kv cache quanzation, you can add `--kv-cache-dtype fp8_e5m2`.
228+
- If the model does not have a template in the Hugging Face tokenizer, you can specify a [custom chat template](docs/en/custom_chat_template.md).
221229
- Add `--nnodes 2` to run tensor parallelism on multiple nodes. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port.
222230
```
223231
# Node 0
@@ -226,9 +234,6 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
226234
# Node 1
227235
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 1
228236
```
229-
- If the model does not have a template in the Hugging Face tokenizer, you can specify a [custom chat template](docs/en/custom_chat_template.md).
230-
- To enable experimental torch.compile support, you can add `--enable-torch-compile`. It accelerates small models on small batch sizes.
231-
- To enable fp8 quantization, you can add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
232237

233238
### Supported Models
234239

docs/en/backend.md

+171
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
## Backend: SGLang Runtime (SRT)
2+
The SGLang Runtime (SRT) is an efficient serving engine.
3+
4+
### Quick Start
5+
Launch a server
6+
```
7+
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
8+
```
9+
10+
Send a request
11+
```
12+
curl http://localhost:30000/generate \
13+
-H "Content-Type: application/json" \
14+
-d '{
15+
"text": "Once upon a time,",
16+
"sampling_params": {
17+
"max_new_tokens": 16,
18+
"temperature": 0
19+
}
20+
}'
21+
```
22+
Learn more about the argument format [here](docs/en/sampling_params.md).
23+
24+
### OpenAI Compatible API
25+
In addition, the server supports OpenAI-compatible APIs.
26+
27+
```python
28+
import openai
29+
client = openai.Client(
30+
base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
31+
32+
# Text completion
33+
response = client.completions.create(
34+
model="default",
35+
prompt="The capital of France is",
36+
temperature=0,
37+
max_tokens=32,
38+
)
39+
print(response)
40+
41+
# Chat completion
42+
response = client.chat.completions.create(
43+
model="default",
44+
messages=[
45+
{"role": "system", "content": "You are a helpful AI assistant"},
46+
{"role": "user", "content": "List 3 countries and their capitals."},
47+
],
48+
temperature=0,
49+
max_tokens=64,
50+
)
51+
print(response)
52+
53+
# Text embedding
54+
response = client.embeddings.create(
55+
model="default",
56+
input="How are you today",
57+
)
58+
print(response)
59+
```
60+
61+
It supports streaming, vision, and most features of the Chat/Completions/Models/Batch endpoints specified by the [OpenAI API Reference](https://platform.openai.com/docs/api-reference/).
62+
63+
### Additional Server Arguments
64+
- Add `--tp 2` to enable multi-GPU tensor parallelism. If it reports the error "peer access is not supported between these two devices", add `--enable-p2p-check` to the server launch command.
65+
```
66+
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --tp 2
67+
```
68+
- Add `--dp 2` to enable multi-GPU data parallelism. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total.
69+
```
70+
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --dp 2 --tp 2
71+
```
72+
- If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`.
73+
```
74+
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --mem-fraction-static 0.7
75+
```
76+
- See [hyperparameter_tuning.md](docs/en/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
77+
- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
78+
```
79+
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --chunked-prefill-size 4096
80+
```
81+
- To enable torch.compile support, you can add `--enable-torch-compile`. It accelerates small models on small batch sizes.
82+
- To enable fp8 weight quantization, you can add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
83+
- To enable fp8 kv cache quanzation, you can add `--kv-cache-dtype fp8_e5m2`.
84+
- If the model does not have a template in the Hugging Face tokenizer, you can specify a [custom chat template](docs/en/custom_chat_template.md).
85+
- Add `--nnodes 2` to run tensor parallelism on multiple nodes. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port.
86+
```
87+
# Node 0
88+
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 0
89+
90+
# Node 1
91+
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 1
92+
```
93+
94+
### Supported Models
95+
96+
**Generative Models**
97+
- Llama / Llama 2 / Llama 3 / Llama 3.1
98+
- Mistral / Mixtral / Mistral NeMo
99+
- Gemma / Gemma 2
100+
- Qwen / Qwen 2 / Qwen 2 MoE
101+
- DeepSeek / DeepSeek 2
102+
- [LLaVA-OneVision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/)
103+
- `python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --port=30000 --chat-template=chatml-llava`
104+
- `python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-72b-ov --port=30000 --tp-size=8 --chat-template=chatml-llava`
105+
- Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](test/srt/test_vision_openai_server.py)
106+
- LLaVA 1.5 / 1.6 / NeXT
107+
- `python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --port=30000 --tp-size=1 --chat-template=llava_llama_3`
108+
- `python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --port=30000 --tp-size=8 --chat-template=chatml-llava`
109+
- Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](test/srt/test_vision_openai_server.py)
110+
- Yi-VL
111+
- StableLM
112+
- Command-R
113+
- DBRX
114+
- Grok
115+
- ChatGLM
116+
- InternLM 2
117+
- Exaone 3
118+
119+
**Embedding Models**
120+
121+
- e5-mistral
122+
- gte-Qwen2
123+
- `python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct --is-embedding`
124+
125+
Instructions for supporting a new model are [here](https://github.com/sgl-project/sglang/blob/main/docs/en/model_support.md).
126+
127+
#### Use Models From ModelScope
128+
<details>
129+
<summary>More</summary>
130+
131+
To use a model from [ModelScope](https://www.modelscope.cn), set the environment variable SGLANG_USE_MODELSCOPE.
132+
```
133+
export SGLANG_USE_MODELSCOPE=true
134+
```
135+
Launch [Qwen2-7B-Instruct](https://www.modelscope.cn/models/qwen/qwen2-7b-instruct) Server
136+
```
137+
SGLANG_USE_MODELSCOPE=true python -m sglang.launch_server --model-path qwen/Qwen2-7B-Instruct --port 30000
138+
```
139+
140+
</details>
141+
142+
#### Run Llama 3.1 405B
143+
<details>
144+
<summary>More</summary>
145+
146+
```bash
147+
# Run 405B (fp8) on a single node
148+
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
149+
150+
# Run 405B (fp16) on two nodes
151+
## on the first node, replace the `172.16.4.52:20000` with your own first node ip address and port
152+
GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0 --disable-cuda-graph
153+
154+
## on the first node, replace the `172.16.4.52:20000` with your own first node ip address and port
155+
GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1 --disable-cuda-graph
156+
```
157+
158+
</details>
159+
160+
### Benchmark Performance
161+
162+
- Benchmark a single static batch by running the following command without launching a server. The arguments are the same as for `launch_server.py`.
163+
Note that this is not a dynamic batching server, so it may run out of memory for a batch size that a real server can handle.
164+
A real server truncates the prefill into several batches, while this unit test does not. For accurate large batch testing, please use `sglang.bench_serving` instead.
165+
```
166+
python -m sglang.bench_latency --model-path meta-llama/Meta-Llama-3-8B-Instruct --batch 32 --input-len 256 --output-len 32
167+
```
168+
- Benchmark online serving. Launch a server first and run the following command.
169+
```
170+
python3 -m sglang.bench_serving --backend sglang --num-prompt 10
171+
```

0 commit comments

Comments
 (0)