Update deploy_guidance.md #14

GuoxiangZu · 2025-08-07T10:11:34Z

Add the setting --gpu-memory-utilization 0.85 for Data Parallelism + Tensor Parallelism(Serving on 8xH20). In case no this setting, following error will occur when deploy with Data Parallelism + Tensor Parallelism on 8 * H20 96G node:
(EngineCore_2 pid=28306) ERROR 08-06 14:37:29 [core.py:683] raise RuntimeError(
(EngineCore_2 pid=28306) ERROR 08-06 14:37:29 [core.py:683] RuntimeError: CUDA out of memory occurred when warming up sampler with 1024 dummy requests. Please try lowering max_num_seqs or gpu_memory_utilization when initializing the engine.

Add the setting --gpu-memory-utilization 0.85 for Data Parallelism + Tensor Parallelism(Serving on 8xH20). In case no this setting, following error will occur when deploy with Data Parallelism + Tensor Parallelism on 8 * H20 96G node: (EngineCore_2 pid=28306) ERROR 08-06 14:37:29 [core.py:683] raise RuntimeError( (EngineCore_2 pid=28306) ERROR 08-06 14:37:29 [core.py:683] RuntimeError: CUDA out of memory occurred when warming up sampler with 1024 dummy requests. Please try lowering `max_num_seqs` or `gpu_memory_utilization` when initializing the engine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update deploy_guidance.md #14

Update deploy_guidance.md #14

Uh oh!

GuoxiangZu commented Aug 7, 2025

Uh oh!

Uh oh!

Update deploy_guidance.md #14

Are you sure you want to change the base?

Update deploy_guidance.md #14

Uh oh!

Conversation

GuoxiangZu commented Aug 7, 2025

Uh oh!

Uh oh!