[rollout][trtllm] chore: adopt SleepConfig for fine-grained sleeping#5527
[rollout][trtllm] chore: adopt SleepConfig for fine-grained sleeping#5527hchings wants to merge 3 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces changes for 235B MoE models and their configuration, primarily focusing on trtllm rollout. A new run script for qwen3-235b is added, and there are updates to the trtllm async server and rollout logic. My review focuses on ensuring the new configuration script is clear and maintainable. I've pointed out some redundant and conflicting settings in the new shell script that should be addressed to avoid confusion.
| actor_rollout_ref.rollout.name=vllm \ | ||
| actor_rollout_ref.rollout.enforce_eager=True \ |
There was a problem hiding this comment.
There are redundant and conflicting settings for actor_rollout_ref.rollout in this script.
actor_rollout_ref.rollout.nameis set tovllmon line 102, but overridden totrtllmon line 153.actor_rollout_ref.rollout.enforce_eageris set toTrueon line 103, but overridden toFalseon line 149.
This makes the script confusing and prone to errors if someone modifies only the first occurrence. To improve clarity and maintainability, please remove these initial, overridden settings.
3ff2ec5 to
ceac890
Compare
Signed-off-by: Erin Ho <[email protected]> cleanup
Signed-off-by: Erin Ho <[email protected]>
ceac890 to
1885689
Compare
|
Closed for #5841 |
What does this PR do?
WIP on config tuning. Might move config to verl-recipe.
Reset of changes is to adopt Yuan's NVIDIA/TensorRT-LLM#11889 to avoid CPU OOM on 8 nodes for a 235B.
This MR should be merged after NVIDIA/TensorRT-LLM#11889 and #5444.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.