Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update doc for server arguments #2742

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
145 changes: 68 additions & 77 deletions docs/backend/server_arguments.md
Original file line number Diff line number Diff line change
@@ -1,77 +1,68 @@
# Server Arguments

- To enable multi-GPU tensor parallelism, add `--tp 2`. If it reports the error "peer access is not supported between these two devices", add `--enable-p2p-check` to the server launch command.
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2
```
- To enable multi-GPU data parallelism, add `--dp 2`. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total.
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2
```
- If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`.
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
```
- See [hyperparameter tuning](../references/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
```
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currently.
- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports other [quantization strategies (INT8/FP8)](https://github.com/sgl-project/sglang/blob/v0.3.6/python/sglang/srt/server_args.py#L671) as well.
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](../references/custom_chat_template.md).

- To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port, you can use the following commands. If you meet deadlock, please try to add `--disable-cuda-graph`
```
# Node 0
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 0

# Node 1
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 1
```

## Use Models From ModelScope
<details>
<summary>More</summary>

To use a model from [ModelScope](https://www.modelscope.cn), set the environment variable SGLANG_USE_MODELSCOPE.
```
export SGLANG_USE_MODELSCOPE=true
```
Launch [Qwen2-7B-Instruct](https://www.modelscope.cn/models/qwen/qwen2-7b-instruct) Server
```
SGLANG_USE_MODELSCOPE=true python -m sglang.launch_server --model-path qwen/Qwen2-7B-Instruct --port 30000
```

Or start it by docker.
```bash
docker run --gpus all \
-p 30000:30000 \
-v ~/.cache/modelscope:/root/.cache/modelscope \
--env "SGLANG_USE_MODELSCOPE=true" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 30000
```

</details>

## Example: Run Llama 3.1 405B
<details>
<summary>More</summary>

```bash
# Run 405B (fp8) on a single node
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8

# Run 405B (fp16) on two nodes
## on the first node, replace the `172.16.4.52:20000` with your own first node ip address and port
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0

## on the first node, replace the `172.16.4.52:20000` with your own first node ip address and port
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1
```

</details>
## Model and tokenizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. But for the docs, always keep one first-order title # and several second-order title ##, do not use forth-order title ####.


* `model_path`: The model we want to serve. We can get the path from the corresponding Huggingface Repo.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For SGLang official docs, we want to keep the contents concise. So I think you can move this detailed version into your own learning note? Thanks!

* `tokenizer_path`: The path to the tokenizer. If not provided this will agree with the `model_path` by default.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path to the tokenizer defaults to the model_path.

* `tokenizer_mode`: By default `auto`. If set to `slow` this will disable the fast version of Huggingface tokenizer, see [here](https://huggingface.co/docs/transformers/en/main_classes/tokenizer)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • tokenizer_mode: By default auto, refer to here.

* `load_format`: The format the weights are loaded in. Defaults to `*.safetensors`/`*.bin` format. See [python/sglang/srt/model_loader/loader.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/model_loader/loader.py)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* `trust_remote_code`: Needed for Config from HF models.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trust_remote_code: If True, will use local cached config files, other wise use remote configs in HuggingFace.

* `dtype`: The dtype we use our model in. Note that compute capability must not be below `sm80` to use `bfloat16`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • dtype: The dtype we use our model in, defaults bfloat16.

* `kv_cache_dtype`: Dtype of the kv cache. By default is set to the dtype of the model, see [python/sglang/srt/model_executor/model_runner.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/model_loader/loader.py)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kv_cache_dtype: Dtype of the kv cache, defaults to the dtype.

* `quantization`: For a list of supported quantizations see: [python/sglang/srt/configs/model_config.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/model_config.py). Defaults to `None`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have a docs on quantization, you can leave the link to blank now.

* `context_length`: The number of tokens our model can process *including the input*. By default this is derived from the HF model. Be aware that inappropriate setting of this (i.e. exceeding the default context length) might lead to strange behavior.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good, but could you make this concise. I love the explanation here!

* `device`: The device we put the model on. Defaults to `cuda`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The device to put the model, defaults to cuda.

* `served_model_name`: We might serve the same model multiple times. This parameter let's us distinguish between them..
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is indeed an invalid parameter, just want to keep the same as OpenAI API. This argument does not need to be set.

* `chat_template`: The chat template we use. See [python/sglang/lang/chat_template.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/lang/chat_template.py). Be aware that the wrong chat template might lead to unexpeted behavior. By default this is chosen for us.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. We are also making a learning material for chat_template. Could be linked to it later. But keep this concise right now 😂

* `is_embedding`: Set to true if we want to perform embedding extraction.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • is_embedding: Set to true if we want to perform embedding task.

* `revision`: If we want to choose a specific revision of our model.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this. Could you double check this?

* `skip_tokenizer_init`: Set to true if you want to provide the tokenized text instead of text (See [test/srt/test_skip_tokenizer_init.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_skip_tokenizer_init.py) for usage).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • skip_tokenizer_init: Set to true if you want to provide the tokens to the engine and get the output tokens directly.

* `return_token_ids`: Set to true if we don't want to decode the model output.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument will be deleted.


## Port for the HTTP server

* Use `port` and `host` to setup the host for your HTTP server. By default `host: str = "127.0.0.1"` and `port: int = 30000`

## Memory and scheduling

* `mem_fraction_static`: Fraction of the GPU used for static memory like model weights and KV cache.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you meets OOM while serving, trying to decrease this argument.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fraction of the left GPU used for static memory like model weights and KV cache.

* `max_running_requests`: The maximum number of requests to run concurrently.
* `max_total_tokens`: Global capacity of tokens that can be stored in the KV cache.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double-check this.

* `chunked_prefill_size`: Perform the prefill in chunks of these size. Larger chunk size speeds up the prefill phase but increases the time taken to complete decoding of other ongoing requests.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Larger chunk size speeds up the prefill phase but increases the VRAM consumption.

* `max_prefill_tokens`: The maximum number of tokens we can prefill.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double check this. What's the difference between context_length? Explain it here maybe.

* `schedule_policy`: The policy which controls in which order to process the waiting prefill requests.. See [python/sglang/srt/managers/schedule_policy.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/schedule_policy.py)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* `schedule_conservativeness`: If we decrease this parameter from 1 towards 0 we will make the server less conservative about taking new requests. Similar increasing to a value above one will make the server more conservative. A lower value indicates we suspect `max_total_tokens` is set to a value that is too large. See [python/sglang/srt/managers/scheduler.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/scheduler.py)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double check this. I think this parameter decides the conservativeness of the scheduler. If too conservative, the scheduler takes requests as first in, first served, which makes it slow. If not conservative, the scheduler takes an aggressive order, making it quick but some requests may be starved.

* `cpu_offload_gb`: TODO
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I also do not know this 😂

* `prefill_only_one_req`: When this flag is turned on we prefill only one request at a time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use . at the end. 😂


## Other runtime options

* `tp_size`: This parameter is important if we have multiple GPUs and our model doesn't fit on a single GPU. *Tensor parallelism* means we distribute our model weights over multiple GPUs. Note that his technique is mainly aimed at *memory efficency* and not at a *higher throughput* as there is inter GPU communication needed to obtain the final output of each layer. For better understanding of the concept you may look for example [here](https://pytorch.org/tutorials/intermediate/TP_tutorial.html#how-tensor-parallel-works).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice explanation. But keep this concise, thanks!


* `stream_interval`: If we stream the output to the user this parameter determines at which interval we perform streaming. The interval length is measured in tokens.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not so sure. Could you double check this and make it more clear.


* `random_seed`: Can be used to enforce deterministic behavior.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more deterministic behavior


* `constrained_json_whitespace_pattern`: When using `Outlines` grammar backend we can use this to allow JSON with syntatic newlines, tabs or multiple spaces.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can create a ## for constraint decoding parameters.


* `watchdog_timeout`: With this flag we can adjust the timeout the watchdog thread in the Scheduler uses to kill of the server if a batch generation takes too much time.

* `download_dir`: By default the model weights get loaded into the huggingface cache directory. This parameter can be used to adjust this behavior.

* `base_gpu_id`: This parameter is used to initialize from which GPU we start to distribute our model onto the available GPUs.
Comment on lines +46 to +50
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Be concise.


## Logging

TODO

## API related

TODO

## Data parallelism

* `dp_size`: In the case of data parallelism we distribute our weights onto multiple GPUs and divide the batch on multiple GPUs. Note that this can also be combined with tensor parallelism. For example if we had 4 GPUS and our model doesn't fit onto a single GPUs but on two we might choose `tp_size=2` and `dp_size=2`. This would than mean that we have 2 full copies of the model, each sharded onto two GPUs. We can than feed half of the batch two the first copy on the model and the other half to the second copy. If memory allows you should prefer data parallism to tensor parallelism as it doesn't require the overhead of GPU inter communication. Keep in mind that if `N` is the number of GPUs in order to leverage full compute we must choose `dp_size * tp_size = N`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice explanation!


* `load_balance_method`: TODO

## Expert parallelism

* `ep_size`: This can be used for M(ixture)O(f)E(xperts) Models like `neuralmagic/DeepSeek-Coder-V2-Instruct-FP8`. With this flag each expert layer is distributed according to this flag. The flag should match `tp_size`. For example we have a model with 4 experts and 4 GPUs than `tp_size=4` and `ep_size=4` will result in the usual sharding for all but the expert layers. The expert layers get than sharded such that each GPU processes one expert. A detailed performance analysis was performed [in the PR that implemented this technique](https://github.com/sgl-project/sglang/pull/2203).