From 9f5eeb7b8553b9483ec5fbf009307b75d2fe767a Mon Sep 17 00:00:00 2001
From: Guoxiang Zu <zuguoxiang@gmail.com>
Date: Thu, 7 Aug 2025 18:09:54 +0800
Subject: [PATCH] Update deploy_guidance.md

Add the setting --gpu-memory-utilization 0.85 for Data Parallelism + Tensor Parallelism(Serving on 8xH20).
In case no this setting, following error will occur when deploy with Data Parallelism + Tensor Parallelism on 8 * H20 96G node:
(EngineCore_2 pid=28306) ERROR 08-06 14:37:29 [core.py:683]     raise RuntimeError(
(EngineCore_2 pid=28306) ERROR 08-06 14:37:29 [core.py:683] RuntimeError: CUDA out of memory occurred when warming up sampler with 1024 dummy requests. Please try lowering `max_num_seqs` or `gpu_memory_utilization` when initializing the engine.
---
 docs/deploy_guidance.md | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/docs/deploy_guidance.md b/docs/deploy_guidance.md
index eceeb72..244e244 100644
--- a/docs/deploy_guidance.md
+++ b/docs/deploy_guidance.md
@@ -86,6 +86,7 @@ vllm serve /path/to/step3-fp8 \
     --reasoning-parser step3 \
     --enable-auto-tool-choice \
     --tool-call-parser step3 \
+    --gpu-memory-utilization 0.85 \
     --max-num-batched-tokens 4096 \
     --trust-remote-code \
 ```
@@ -223,4 +224,4 @@ print("Chat response:", chat_response)
 
 ```
 
-Note: In our image preprocessing pipeline, we implement a multi-patch mechanism to handle large images. If the input image exceeds 728x728 pixels, the system will automatically apply image cropping logic to get patches of the image.
\ No newline at end of file
+Note: In our image preprocessing pipeline, we implement a multi-patch mechanism to handle large images. If the input image exceeds 728x728 pixels, the system will automatically apply image cropping logic to get patches of the image.