sgl-project · simveit · Jan 4, 2025 · Jan 6, 2025 · Jan 6, 2025 · zhaochenyang20
diff --git a/docs/backend/server_arguments.md b/docs/backend/server_arguments.md
@@ -1,77 +1,68 @@
-# Server Arguments
-
-- To enable multi-GPU tensor parallelism, add `--tp 2`. If it reports the error "peer access is not supported between these two devices", add `--enable-p2p-check` to the server launch command.
-```
-python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2
-```
-- To enable multi-GPU data parallelism, add `--dp 2`. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total.
-```
-python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2
-```
-- If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`.
-```
-python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
-```
-- See [hyperparameter tuning](../references/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
-- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
-```
-python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
-```
-- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currently.
-- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports other [quantization strategies (INT8/FP8)](https://github.com/sgl-project/sglang/blob/v0.3.6/python/sglang/srt/server_args.py#L671) as well.
-- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
-- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
-- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](../references/custom_chat_template.md).
-
-- To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port, you can use the following commands. If you meet deadlock, please try to add `--disable-cuda-graph`
-```
-# Node 0
-python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 0
-
-# Node 1
-python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 1
-```
-
-## Use Models From ModelScope
-<details>
-<summary>More</summary>
-
-To use a model from [ModelScope](https://www.modelscope.cn), set the environment variable SGLANG_USE_MODELSCOPE.
-```
-export SGLANG_USE_MODELSCOPE=true
-```
-Launch [Qwen2-7B-Instruct](https://www.modelscope.cn/models/qwen/qwen2-7b-instruct) Server
-```
-SGLANG_USE_MODELSCOPE=true python -m sglang.launch_server --model-path qwen/Qwen2-7B-Instruct --port 30000
-```
-
-Or start it by docker.
-```bash
-docker run --gpus all \
-    -p 30000:30000 \
-    -v ~/.cache/modelscope:/root/.cache/modelscope \
-    --env "SGLANG_USE_MODELSCOPE=true" \
-    --ipc=host \
-    lmsysorg/sglang:latest \
-    python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 30000
-```
-
-</details>
-
-## Example: Run Llama 3.1 405B
-<details>
-<summary>More</summary>
-
-```bash
-# Run 405B (fp8) on a single node
-python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
-
-# Run 405B (fp16) on two nodes
-## on the first node, replace the `172.16.4.52:20000` with your own first node ip address and port
-python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0
-
-## on the first node, replace the `172.16.4.52:20000` with your own first node ip address and port
-python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1
-```
-
-</details>
+## Model and tokenizer
+
+* `model_path`: The model we want to serve. We can get the path from the corresponding Huggingface Repo.
+* `tokenizer_path`: The path to the tokenizer. If not provided this will agree with the `model_path` by default.
+* `tokenizer_mode`: By default `auto`. If set to `slow` this will disable the fast version of Huggingface tokenizer, see [here](https://huggingface.co/docs/transformers/en/main_classes/tokenizer)
+* `load_format`: The format the weights are loaded in. Defaults to `*.safetensors`/`*.bin` format. See [python/sglang/srt/model_loader/loader.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/model_loader/loader.py)
+* `trust_remote_code`: Needed for Config from HF models.
+* `dtype`: The dtype we use our model in. Note that compute capability must not be below `sm80` to use `bfloat16`. 
+* `kv_cache_dtype`: Dtype of the kv cache. By default is set to the dtype of the model, see [python/sglang/srt/model_executor/model_runner.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/model_loader/loader.py)
+* `quantization`: For a list of supported quantizations see: [python/sglang/srt/configs/model_config.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/configs/model_config.py). Defaults to `None`.
+* `context_length`: The number of tokens our model can process *including the input*. By default this is derived from the HF model. Be aware that inappropriate setting of this (i.e. exceeding the default context length) might lead to strange behavior.
+* `device`: The device we put the model on. Defaults to `cuda`.
+* `served_model_name`: We might serve the same model multiple times. This parameter let's us distinguish between them..
+* `chat_template`: The chat template we use. See [python/sglang/lang/chat_template.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/lang/chat_template.py). Be aware that the wrong chat template might lead to unexpeted behavior. By default this is chosen for us.
+* `is_embedding`: Set to true if we want to perform embedding extraction.
+* `revision`: If we want to choose a specific revision of our model.
+* `skip_tokenizer_init`: Set to true if you want to provide the tokenized text instead of text (See [test/srt/test_skip_tokenizer_init.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_skip_tokenizer_init.py) for usage). 
+* `return_token_ids`: Set to true if we don't want to decode the model output.
+
+## Port for the HTTP server
+
+* Use `port` and `host` to setup the host for your HTTP server. By default `host: str = "127.0.0.1"` and `port: int = 30000`
+
+## Memory and scheduling
+
+* `mem_fraction_static`: Fraction of the GPU used for static memory like model weights and KV cache.
+* `max_running_requests`: The maximum number of requests to run concurrently.
+* `max_total_tokens`: Global capacity of tokens that can be stored in the KV cache.
+* `chunked_prefill_size`: Perform the prefill in chunks of these size. Larger chunk size speeds up the prefill phase but increases the time taken to complete decoding of other ongoing requests.
+* `max_prefill_tokens`: The maximum number of tokens we can prefill.
+* `schedule_policy`: The policy which controls in which order to process the waiting prefill requests.. See [python/sglang/srt/managers/schedule_policy.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/schedule_policy.py)
+* `schedule_conservativeness`: If we decrease this parameter from 1 towards 0 we will make the server less conservative about taking new requests. Similar increasing to a value above one will make the server more conservative. A lower value indicates we suspect `max_total_tokens` is set to a value that is too large. See [python/sglang/srt/managers/scheduler.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/scheduler.py)
+* `cpu_offload_gb`: TODO
+* `prefill_only_one_req`: When this flag is turned on we prefill only one request at a time
+
+## Other runtime options
+
+* `tp_size`: This parameter is important if we have multiple GPUs and our model doesn't fit on a single GPU. *Tensor parallelism* means we distribute our model weights over multiple GPUs. Note that his technique is mainly aimed at *memory efficency* and not at a *higher throughput* as there is inter GPU communication needed to obtain the final output of each layer. For better understanding of the concept you may look for example [here](https://pytorch.org/tutorials/intermediate/TP_tutorial.html#how-tensor-parallel-works).
+
+* `stream_interval`: If we stream the output to the user this parameter determines at which interval we perform streaming. The interval length is measured in tokens.
+
+* `random_seed`: Can be used to enforce deterministic behavior. 
+
+* `constrained_json_whitespace_pattern`: When using `Outlines` grammar backend we can use this to allow JSON with syntatic newlines, tabs or multiple spaces.
+
+* `watchdog_timeout`: With this flag we can adjust the timeout the watchdog thread in the Scheduler uses to kill of the server if a batch generation takes too much time.
+
+* `download_dir`: By default the model weights get loaded into the huggingface cache directory. This parameter can be used to adjust this behavior.
+
+* `base_gpu_id`: This parameter is used to initialize from which GPU we start to distribute our model onto the available GPUs.
+
+## Logging
+
+TODO
+
+## API related
+
+TODO
+
+## Data parallelism 
+
+* `dp_size`: In the case of data parallelism we distribute our weights onto multiple GPUs and divide the batch on multiple GPUs. Note that this can also be combined with tensor parallelism. For example if we had 4 GPUS and our model doesn't fit onto a single GPUs but on two we might choose `tp_size=2` and `dp_size=2`. This would than mean that we have 2 full copies of the model, each sharded onto two GPUs. We can than feed half of the batch two the first copy on the model and the other half to the second copy. If memory allows you should prefer data parallism to tensor parallelism as it doesn't require the overhead of GPU inter communication. Keep in mind that if `N` is the number of GPUs in order to leverage full compute we must choose `dp_size * tp_size = N`.
+
+* `load_balance_method`: TODO
+
+## Expert parallelism
+
+* `ep_size`: This can be used for M(ixture)O(f)E(xperts) Models like `neuralmagic/DeepSeek-Coder-V2-Instruct-FP8`. With this flag each expert layer is distributed according to this flag. The flag should match `tp_size`. For example we have a model with 4 experts and 4 GPUs than `tp_size=4` and `ep_size=4` will result in the usual sharding for all but the expert layers. The expert layers get than sharded such that each GPU processes one expert. A detailed performance analysis was performed [in the PR that implemented this technique](https://github.com/sgl-project/sglang/pull/2203).