-
Notifications
You must be signed in to change notification settings - Fork 665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update doc for server arguments #2742
base: main
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,77 +1,68 @@ | ||
# Server Arguments | ||
|
||
- To enable multi-GPU tensor parallelism, add `--tp 2`. If it reports the error "peer access is not supported between these two devices", add `--enable-p2p-check` to the server launch command. | ||
``` | ||
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2 | ||
``` | ||
- To enable multi-GPU data parallelism, add `--dp 2`. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total. | ||
``` | ||
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2 | ||
``` | ||
- If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`. | ||
``` | ||
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7 | ||
``` | ||
- See [hyperparameter tuning](../references/hyperparameter_tuning.md) on tuning hyperparameters for better performance. | ||
- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size. | ||
``` | ||
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096 | ||
``` | ||
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currently. | ||
- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports other [quantization strategies (INT8/FP8)](https://github.com/sgl-project/sglang/blob/v0.3.6/python/sglang/srt/server_args.py#L671) as well. | ||
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments. | ||
- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`. | ||
- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](../references/custom_chat_template.md). | ||
|
||
- To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port, you can use the following commands. If you meet deadlock, please try to add `--disable-cuda-graph` | ||
``` | ||
# Node 0 | ||
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 0 | ||
|
||
# Node 1 | ||
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 1 | ||
``` | ||
|
||
## Use Models From ModelScope | ||
<details> | ||
<summary>More</summary> | ||
|
||
To use a model from [ModelScope](https://www.modelscope.cn), set the environment variable SGLANG_USE_MODELSCOPE. | ||
``` | ||
export SGLANG_USE_MODELSCOPE=true | ||
``` | ||
Launch [Qwen2-7B-Instruct](https://www.modelscope.cn/models/qwen/qwen2-7b-instruct) Server | ||
``` | ||
SGLANG_USE_MODELSCOPE=true python -m sglang.launch_server --model-path qwen/Qwen2-7B-Instruct --port 30000 | ||
``` | ||
|
||
Or start it by docker. | ||
```bash | ||
docker run --gpus all \ | ||
-p 30000:30000 \ | ||
-v ~/.cache/modelscope:/root/.cache/modelscope \ | ||
--env "SGLANG_USE_MODELSCOPE=true" \ | ||
--ipc=host \ | ||
lmsysorg/sglang:latest \ | ||
python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 30000 | ||
``` | ||
|
||
</details> | ||
|
||
## Example: Run Llama 3.1 405B | ||
<details> | ||
<summary>More</summary> | ||
|
||
```bash | ||
# Run 405B (fp8) on a single node | ||
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8 | ||
|
||
# Run 405B (fp16) on two nodes | ||
## on the first node, replace the `172.16.4.52:20000` with your own first node ip address and port | ||
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0 | ||
|
||
## on the first node, replace the `172.16.4.52:20000` with your own first node ip address and port | ||
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1 | ||
``` | ||
|
||
</details> | ||
## Model and tokenizer | ||
|
||
* `model_path`: The model we want to serve. We can get the path from the corresponding Huggingface Repo. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For SGLang official docs, we want to keep the contents concise. So I think you can move this detailed version into your own learning note? Thanks! |
||
* `tokenizer_path`: The path to the tokenizer. If not provided this will agree with the `model_path` by default. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The path to the tokenizer defaults to the |
||
* `tokenizer_mode`: By default `auto`. If set to `slow` this will disable the fast version of Huggingface tokenizer, see [here](https://huggingface.co/docs/transformers/en/main_classes/tokenizer) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
* `load_format`: The format the weights are loaded in. Defaults to `*.safetensors`/`*.bin` format. See [python/sglang/srt/model_loader/loader.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/model_loader/loader.py) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
* `trust_remote_code`: Needed for Config from HF models. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
* `dtype`: The dtype we use our model in. Note that compute capability must not be below `sm80` to use `bfloat16`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
* `kv_cache_dtype`: Dtype of the kv cache. By default is set to the dtype of the model, see [python/sglang/srt/model_executor/model_runner.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/model_loader/loader.py) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
* `quantization`: For a list of supported quantizations see: [python/sglang/srt/configs/model_config.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/model_config.py). Defaults to `None`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We will have a docs on quantization, you can leave the link to blank now. |
||
* `context_length`: The number of tokens our model can process *including the input*. By default this is derived from the HF model. Be aware that inappropriate setting of this (i.e. exceeding the default context length) might lead to strange behavior. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Really good, but could you make this concise. I love the explanation here! |
||
* `device`: The device we put the model on. Defaults to `cuda`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The device to put the model, defaults to |
||
* `served_model_name`: We might serve the same model multiple times. This parameter let's us distinguish between them.. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is indeed an invalid parameter, just want to keep the same as OpenAI API. This argument does not need to be set. |
||
* `chat_template`: The chat template we use. See [python/sglang/lang/chat_template.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/lang/chat_template.py). Be aware that the wrong chat template might lead to unexpeted behavior. By default this is chosen for us. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah. We are also making a learning material for |
||
* `is_embedding`: Set to true if we want to perform embedding extraction. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
* `revision`: If we want to choose a specific revision of our model. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure about this. Could you double check this? |
||
* `skip_tokenizer_init`: Set to true if you want to provide the tokenized text instead of text (See [test/srt/test_skip_tokenizer_init.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_skip_tokenizer_init.py) for usage). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
* `return_token_ids`: Set to true if we don't want to decode the model output. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This argument will be deleted. |
||
|
||
## Port for the HTTP server | ||
|
||
* Use `port` and `host` to setup the host for your HTTP server. By default `host: str = "127.0.0.1"` and `port: int = 30000` | ||
|
||
## Memory and scheduling | ||
|
||
* `mem_fraction_static`: Fraction of the GPU used for static memory like model weights and KV cache. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you meets OOM while serving, trying to decrease this argument. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fraction of the left GPU used for static memory like model weights and KV cache. |
||
* `max_running_requests`: The maximum number of requests to run concurrently. | ||
* `max_total_tokens`: Global capacity of tokens that can be stored in the KV cache. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Double-check this. |
||
* `chunked_prefill_size`: Perform the prefill in chunks of these size. Larger chunk size speeds up the prefill phase but increases the time taken to complete decoding of other ongoing requests. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Larger chunk size speeds up the prefill phase but increases the VRAM consumption. |
||
* `max_prefill_tokens`: The maximum number of tokens we can prefill. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. double check this. What's the difference between |
||
* `schedule_policy`: The policy which controls in which order to process the waiting prefill requests.. See [python/sglang/srt/managers/schedule_policy.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/schedule_policy.py) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
* `schedule_conservativeness`: If we decrease this parameter from 1 towards 0 we will make the server less conservative about taking new requests. Similar increasing to a value above one will make the server more conservative. A lower value indicates we suspect `max_total_tokens` is set to a value that is too large. See [python/sglang/srt/managers/scheduler.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/scheduler.py) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Double check this. I think this parameter decides the conservativeness of the scheduler. If too conservative, the scheduler takes requests as first in, first served, which makes it slow. If not conservative, the scheduler takes an aggressive order, making it quick but some requests may be starved. |
||
* `cpu_offload_gb`: TODO | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah. I also do not know this 😂 |
||
* `prefill_only_one_req`: When this flag is turned on we prefill only one request at a time | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. use |
||
|
||
## Other runtime options | ||
|
||
* `tp_size`: This parameter is important if we have multiple GPUs and our model doesn't fit on a single GPU. *Tensor parallelism* means we distribute our model weights over multiple GPUs. Note that his technique is mainly aimed at *memory efficency* and not at a *higher throughput* as there is inter GPU communication needed to obtain the final output of each layer. For better understanding of the concept you may look for example [here](https://pytorch.org/tutorials/intermediate/TP_tutorial.html#how-tensor-parallel-works). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nice explanation. But keep this concise, thanks! |
||
|
||
* `stream_interval`: If we stream the output to the user this parameter determines at which interval we perform streaming. The interval length is measured in tokens. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not so sure. Could you double check this and make it more clear. |
||
|
||
* `random_seed`: Can be used to enforce deterministic behavior. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. more deterministic behavior |
||
|
||
* `constrained_json_whitespace_pattern`: When using `Outlines` grammar backend we can use this to allow JSON with syntatic newlines, tabs or multiple spaces. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we can create a |
||
|
||
* `watchdog_timeout`: With this flag we can adjust the timeout the watchdog thread in the Scheduler uses to kill of the server if a batch generation takes too much time. | ||
|
||
* `download_dir`: By default the model weights get loaded into the huggingface cache directory. This parameter can be used to adjust this behavior. | ||
|
||
* `base_gpu_id`: This parameter is used to initialize from which GPU we start to distribute our model onto the available GPUs. | ||
Comment on lines
+46
to
+50
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Cool. Be concise. |
||
|
||
## Logging | ||
|
||
TODO | ||
|
||
## API related | ||
|
||
TODO | ||
|
||
## Data parallelism | ||
|
||
* `dp_size`: In the case of data parallelism we distribute our weights onto multiple GPUs and divide the batch on multiple GPUs. Note that this can also be combined with tensor parallelism. For example if we had 4 GPUS and our model doesn't fit onto a single GPUs but on two we might choose `tp_size=2` and `dp_size=2`. This would than mean that we have 2 full copies of the model, each sharded onto two GPUs. We can than feed half of the batch two the first copy on the model and the other half to the second copy. If memory allows you should prefer data parallism to tensor parallelism as it doesn't require the overhead of GPU inter communication. Keep in mind that if `N` is the number of GPUs in order to leverage full compute we must choose `dp_size * tp_size = N`. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Really nice explanation! |
||
|
||
* `load_balance_method`: TODO | ||
|
||
## Expert parallelism | ||
|
||
* `ep_size`: This can be used for M(ixture)O(f)E(xperts) Models like `neuralmagic/DeepSeek-Coder-V2-Instruct-FP8`. With this flag each expert layer is distributed according to this flag. The flag should match `tp_size`. For example we have a model with 4 experts and 4 GPUs than `tp_size=4` and `ep_size=4` will result in the usual sharding for all but the expert layers. The expert layers get than sharded such that each GPU processes one expert. A detailed performance analysis was performed [in the PR that implemented this technique](https://github.com/sgl-project/sglang/pull/2203). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. But for the docs, always keep one first-order title
#
and several second-order title##
, do not use forth-order title####
.