Skip to content

Conversation

GuoxiangZu
Copy link

Add the setting --gpu-memory-utilization 0.85 for Data Parallelism + Tensor Parallelism(Serving on 8xH20). In case no this setting, following error will occur when deploy with Data Parallelism + Tensor Parallelism on 8 * H20 96G node:
(EngineCore_2 pid=28306) ERROR 08-06 14:37:29 [core.py:683] raise RuntimeError(
(EngineCore_2 pid=28306) ERROR 08-06 14:37:29 [core.py:683] RuntimeError: CUDA out of memory occurred when warming up sampler with 1024 dummy requests. Please try lowering max_num_seqs or gpu_memory_utilization when initializing the engine.

Add the setting --gpu-memory-utilization 0.85 for Data Parallelism + Tensor Parallelism(Serving on 8xH20).
In case no this setting, following error will occur when deploy with Data Parallelism + Tensor Parallelism on 8 * H20 96G node:
(EngineCore_2 pid=28306) ERROR 08-06 14:37:29 [core.py:683]     raise RuntimeError(
(EngineCore_2 pid=28306) ERROR 08-06 14:37:29 [core.py:683] RuntimeError: CUDA out of memory occurred when warming up sampler with 1024 dummy requests. Please try lowering `max_num_seqs` or `gpu_memory_utilization` when initializing the engine.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant