[rollout] chore: bump up trtllm version to 1.3.0rc10#5841
[rollout] chore: bump up trtllm version to 1.3.0rc10#5841Superjomn wants to merge 17 commits intoverl-project:mainfrom
Conversation
Update DeepEP branch from v1.2.1 to hybrid-ep (removing the now-unnecessary patch) and add CCCL CPATH for build compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com> cleanup
Signed-off-by: Erin Ho <14718778+hchings@users.noreply.github.com>
There was a problem hiding this comment.
Code Review
This pull request updates the TRT-LLM rollout implementation and Docker environment, including upgrading Megatron-LM to v0.16.0 and transitioning to SleepConfig for server management. It also removes the single-node restriction for TRT-LLM replicas. Review feedback points out that using a branch name for the DeepEP dependency in the Dockerfile compromises build reproducibility and identifies a potential IndexError in the placement group indexing logic that requires a bounds check.
|
Should we also bump ci image? |
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d import Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
ExecutorMemoryType is not yet available in trtllm v1.3.0rc10. Use try/except fallback so sleep mode gracefully degrades on older versions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
hchings
left a comment
There was a problem hiding this comment.
Please address the cherry-pick comment. Other parts LGTM.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The job was renamed on main from e2e_grpo_trainer_fsdp-vlm to e2e_grpo_trainer_megatron-vlm. Align our branch to avoid merge validation error in cleanup job's needs list. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The TRT-LLM 1.3.0rc10 base image ships with a newer CUDA toolkit that requires a newer NVIDIA driver than the CI runners have, causing "No CUDA GPUs are available" errors. Adding the cuda-compat package and setting LD_LIBRARY_PATH enables the container to run on hosts with older drivers (>= R535). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
440c5f3 to
029b394
Compare
| "model_extra", | ||
| "executor_extra", | ||
| "model", | ||
| "model_weights", |
There was a problem hiding this comment.
@Superjomn Is this section for backward compatibility (for older trtllm version before we have the fine-grained labels)? If yes, then it should be exactly the same as the old tags at https://github.com/verl-project/verl/pull/5841/changes#diff-4d19b99d5dc8054a16c391ce00301671727c4c3549ecb6d904d33c2aa1f552beL263 (aka without model_weights and draft_model_weights). Otherwise I think it'll error out.
There was a problem hiding this comment.
Sure, let me update it.
…allback _WEIGHTS_TAGS The fallback else branch (for older trtllm without ExecutorMemoryType) should match the original hard-coded list, which never included these tags. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
What does this PR do?
This PR bump up the trtllm docker image to v1.3.0rc10.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.