Update doc for server arguments #2742

simveit · 2025-01-05T19:37:20Z

Motivation

As explained here the current documentation of the backend needs update which we intend to implement here.

Checklist

Update documentation as needed, including docstrings or example tutorials.

zhaochenyang20

I love the detailed and educational docs of parameters. Having two suggestions:

We are documenting the official usage, so we can move the educational part to other unoffical repos, like my ML sys tutorial. 😂
Keep the things concise. If we want to explain the concept, I think just one sentence of educational explanation and give a link to details, which could be bettter.

zhaochenyang20 · 2025-01-06T19:34:10Z

docs/backend/server_arguments.md

-```
-
-</details>
+## Model and tokenizer


Cool. But for the docs, always keep one first-order title # and several second-order title ##, do not use forth-order title ####.

zhaochenyang20 · 2025-01-06T19:35:30Z

docs/backend/server_arguments.md

-</details>
+## Model and tokenizer
+
+* `model_path`: The model we want to serve. We can get the path from the corresponding Huggingface Repo.


For SGLang official docs, we want to keep the contents concise. So I think you can move this detailed version into your own learning note? Thanks!

zhaochenyang20 · 2025-01-06T19:38:03Z

docs/backend/server_arguments.md

+## Model and tokenizer
+
+* `model_path`: The model we want to serve. We can get the path from the corresponding Huggingface Repo.
+* `tokenizer_path`: The path to the tokenizer. If not provided this will agree with the `model_path` by default.


The path to the tokenizer defaults to the model_path.

zhaochenyang20 · 2025-01-06T19:39:30Z

docs/backend/server_arguments.md

+
+* `model_path`: The model we want to serve. We can get the path from the corresponding Huggingface Repo.
+* `tokenizer_path`: The path to the tokenizer. If not provided this will agree with the `model_path` by default.
+* `tokenizer_mode`: By default `auto`. If set to `slow` this will disable the fast version of Huggingface tokenizer, see [here](https://huggingface.co/docs/transformers/en/main_classes/tokenizer)


tokenizer_mode: By default auto, refer to here.

zhaochenyang20 · 2025-01-06T19:40:01Z

docs/backend/server_arguments.md

+* `model_path`: The model we want to serve. We can get the path from the corresponding Huggingface Repo.
+* `tokenizer_path`: The path to the tokenizer. If not provided this will agree with the `model_path` by default.
+* `tokenizer_mode`: By default `auto`. If set to `slow` this will disable the fast version of Huggingface tokenizer, see [here](https://huggingface.co/docs/transformers/en/main_classes/tokenizer)
+* `load_format`: The format the weights are loaded in. Defaults to `*.safetensors`/`*.bin` format. See [python/sglang/srt/model_loader/loader.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/model_loader/loader.py)


load_format: The format the weights are loaded in. Defaults to *.safetensors and *.bin. See python/sglang/srt/model_loader/loader.py.

zhaochenyang20 · 2025-01-06T20:09:16Z

docs/backend/server_arguments.md

+
+* `tp_size`: This parameter is important if we have multiple GPUs and our model doesn't fit on a single GPU. *Tensor parallelism* means we distribute our model weights over multiple GPUs. Note that his technique is mainly aimed at *memory efficency* and not at a *higher throughput* as there is inter GPU communication needed to obtain the final output of each layer. For better understanding of the concept you may look for example [here](https://pytorch.org/tutorials/intermediate/TP_tutorial.html#how-tensor-parallel-works).
+
+* `stream_interval`: If we stream the output to the user this parameter determines at which interval we perform streaming. The interval length is measured in tokens.


I am not so sure. Could you double check this and make it more clear.

zhaochenyang20 · 2025-01-06T20:09:33Z

docs/backend/server_arguments.md

+
+* `stream_interval`: If we stream the output to the user this parameter determines at which interval we perform streaming. The interval length is measured in tokens.
+
+* `random_seed`: Can be used to enforce deterministic behavior. 


more deterministic behavior

zhaochenyang20 · 2025-01-06T20:10:44Z

docs/backend/server_arguments.md

+
+* `random_seed`: Can be used to enforce deterministic behavior. 
+
+* `constrained_json_whitespace_pattern`: When using `Outlines` grammar backend we can use this to allow JSON with syntatic newlines, tabs or multiple spaces.


I think we can create a ## for constraint decoding parameters.

zhaochenyang20 · 2025-01-06T20:11:23Z

docs/backend/server_arguments.md

+* `watchdog_timeout`: With this flag we can adjust the timeout the watchdog thread in the Scheduler uses to kill of the server if a batch generation takes too much time.
+
+* `download_dir`: By default the model weights get loaded into the huggingface cache directory. This parameter can be used to adjust this behavior.
+
+* `base_gpu_id`: This parameter is used to initialize from which GPU we start to distribute our model onto the available GPUs.


Cool. Be concise.

zhaochenyang20 · 2025-01-06T20:12:14Z

docs/backend/server_arguments.md

+
+## Data parallelism 
+
+* `dp_size`: In the case of data parallelism we distribute our weights onto multiple GPUs and divide the batch on multiple GPUs. Note that this can also be combined with tensor parallelism. For example if we had 4 GPUS and our model doesn't fit onto a single GPUs but on two we might choose `tp_size=2` and `dp_size=2`. This would than mean that we have 2 full copies of the model, each sharded onto two GPUs. We can than feed half of the batch two the first copy on the model and the other half to the second copy. If memory allows you should prefer data parallism to tensor parallelism as it doesn't require the overhead of GPU inter communication. Keep in mind that if `N` is the number of GPUs in order to leverage full compute we must choose `dp_size * tp_size = N`.


Really nice explanation!

Added model arguments

452a766

simveit force-pushed the feature/server-arguments-docs branch 2 times, most recently from fbc1a63 to abb44cf Compare January 6, 2025 19:13

Added sections on tensor and data parallelism

0a288d7

simveit force-pushed the feature/server-arguments-docs branch from abb44cf to 0a288d7 Compare January 6, 2025 19:18

Merge branch 'main' into feature/server-arguments-docs

58efd67

zhaochenyang20 requested changes Jan 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update doc for server arguments #2742

Update doc for server arguments #2742

simveit commented Jan 5, 2025

zhaochenyang20 left a comment

zhaochenyang20 Jan 6, 2025

zhaochenyang20 Jan 6, 2025

zhaochenyang20 Jan 6, 2025

zhaochenyang20 Jan 6, 2025

zhaochenyang20 Jan 6, 2025

zhaochenyang20 Jan 6, 2025

zhaochenyang20 Jan 6, 2025

zhaochenyang20 Jan 6, 2025

zhaochenyang20 Jan 6, 2025

zhaochenyang20 Jan 6, 2025


		* `tp_size`: This parameter is important if we have multiple GPUs and our model doesn't fit on a single GPU. Tensor parallelism means we distribute our model weights over multiple GPUs. Note that his technique is mainly aimed at memory efficency and not at a higher throughput as there is inter GPU communication needed to obtain the final output of each layer. For better understanding of the concept you may look for example [here](https://pytorch.org/tutorials/intermediate/TP_tutorial.html#how-tensor-parallel-works).

		* `stream_interval`: If we stream the output to the user this parameter determines at which interval we perform streaming. The interval length is measured in tokens.


		* `stream_interval`: If we stream the output to the user this parameter determines at which interval we perform streaming. The interval length is measured in tokens.

		* `random_seed`: Can be used to enforce deterministic behavior.


		* `random_seed`: Can be used to enforce deterministic behavior.

		* `constrained_json_whitespace_pattern`: When using `Outlines` grammar backend we can use this to allow JSON with syntatic newlines, tabs or multiple spaces.


		## Data parallelism

		* `dp_size`: In the case of data parallelism we distribute our weights onto multiple GPUs and divide the batch on multiple GPUs. Note that this can also be combined with tensor parallelism. For example if we had 4 GPUS and our model doesn't fit onto a single GPUs but on two we might choose `tp_size=2` and `dp_size=2`. This would than mean that we have 2 full copies of the model, each sharded onto two GPUs. We can than feed half of the batch two the first copy on the model and the other half to the second copy. If memory allows you should prefer data parallism to tensor parallelism as it doesn't require the overhead of GPU inter communication. Keep in mind that if `N` is the number of GPUs in order to leverage full compute we must choose `dp_size * tp_size = N`.

Update doc for server arguments #2742

Are you sure you want to change the base?

Update doc for server arguments #2742

Conversation

simveit commented Jan 5, 2025

Motivation

Checklist

zhaochenyang20 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment