From 9f5eeb7b8553b9483ec5fbf009307b75d2fe767a Mon Sep 17 00:00:00 2001 From: Guoxiang Zu Date: Thu, 7 Aug 2025 18:09:54 +0800 Subject: [PATCH] Update deploy_guidance.md Add the setting --gpu-memory-utilization 0.85 for Data Parallelism + Tensor Parallelism(Serving on 8xH20). In case no this setting, following error will occur when deploy with Data Parallelism + Tensor Parallelism on 8 * H20 96G node: (EngineCore_2 pid=28306) ERROR 08-06 14:37:29 [core.py:683] raise RuntimeError( (EngineCore_2 pid=28306) ERROR 08-06 14:37:29 [core.py:683] RuntimeError: CUDA out of memory occurred when warming up sampler with 1024 dummy requests. Please try lowering `max_num_seqs` or `gpu_memory_utilization` when initializing the engine. --- docs/deploy_guidance.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/deploy_guidance.md b/docs/deploy_guidance.md index eceeb72..244e244 100644 --- a/docs/deploy_guidance.md +++ b/docs/deploy_guidance.md @@ -86,6 +86,7 @@ vllm serve /path/to/step3-fp8 \ --reasoning-parser step3 \ --enable-auto-tool-choice \ --tool-call-parser step3 \ + --gpu-memory-utilization 0.85 \ --max-num-batched-tokens 4096 \ --trust-remote-code \ ``` @@ -223,4 +224,4 @@ print("Chat response:", chat_response) ``` -Note: In our image preprocessing pipeline, we implement a multi-patch mechanism to handle large images. If the input image exceeds 728x728 pixels, the system will automatically apply image cropping logic to get patches of the image. \ No newline at end of file +Note: In our image preprocessing pipeline, we implement a multi-patch mechanism to handle large images. If the input image exceeds 728x728 pixels, the system will automatically apply image cropping logic to get patches of the image.