[rollout] chore: bump up trtllm image version to 1.3.0rc10#5841
[rollout] chore: bump up trtllm image version to 1.3.0rc10#5841Superjomn wants to merge 3 commits intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the TRT-LLM rollout implementation and Docker environment, including upgrading Megatron-LM to v0.16.0 and transitioning to SleepConfig for server management. It also removes the single-node restriction for TRT-LLM replicas. Review feedback points out that using a branch name for the DeepEP dependency in the Dockerfile compromises build reproducibility and identifies a potential IndexError in the placement group indexing logic that requires a bounds check.
|
Should we also bump ci image? |
hchings
left a comment
There was a problem hiding this comment.
Please address the cherry-pick comment. Other parts LGTM.
440c5f3 to
029b394
Compare
| "model_extra", | ||
| "executor_extra", | ||
| "model", | ||
| "model_weights", |
There was a problem hiding this comment.
@Superjomn Is this section for backward compatibility (for older trtllm version before we have the fine-grained labels)? If yes, then it should be exactly the same as the old tags at https://github.com/verl-project/verl/pull/5841/changes#diff-4d19b99d5dc8054a16c391ce00301671727c4c3549ecb6d904d33c2aa1f552beL263 (aka without model_weights and draft_model_weights). Otherwise I think it'll error out.
There was a problem hiding this comment.
Sure, let me update it.
bc9205b to
cc8943f
Compare
….0rc10 - Upgrade Megatron-LM to core_v0.16.0, switch DeepEP branch from v1.2.1 to hybrid-ep (removing now-unnecessary patch), add CCCL CPATH for build compat - Bump TRT-LLM base image from 1.3.0rc4 to 1.3.0rc10 - Pin trl==0.27.0 to fix AutoModelForCausalLMWithValueHead import Co-authored-by: Claude Sonnet 4.6 <[email protected]>
- Fix ExecutorMemoryType import path change in 1.3.0rc10 - Remove model_weights/draft_model_weights from fallback _WEIGHTS_TAGS - Defer ServerAdapter import to avoid FlashInfer crash on CPU orchestrator - Resolve CUTLASS SM 9.0 PTX failures on L20 (SM 8.9) GPUs via sm89 target - Disable PDL, fix MoE backend case, guard FlashInfer import - Update trtllm_worker.rst docs for new API Co-authored-by: Claude Sonnet 4.6 <[email protected]>
- Bump CI image tag to 1.3.0rc10 - Restore TORCH_CUDA_ARCH_LIST as runtime fallback for Ray workers Co-authored-by: Claude Sonnet 4.6 <[email protected]>
3b286a0 to
687adb0
Compare
What does this PR do?
This PR bump up the trtllm docker image to v1.3.0rc10.
Checklist Before Starting
[{modules}] {type}: {description}(This will be checked by the CI){modules}includefsdp,megatron,veomni,sglang,vllm,rollout,trainer,ci,training_utils,recipe,hardware,deployment,ray,worker,single_controller,misc,perf,model,algo,env,tool,ckpt,doc,data,cfg,reward,fully_async,one_step_off,like[megatron, fsdp, doc]{type}is infeat,fix,refactor,chore,test[BREAKING]to the beginning of the title.[BREAKING][fsdp, megatron] feat: dynamic batchingTest
API and Usage Example
# Add code snippet or script demonstrating how to use thisDesign & Code Changes
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)recipesubmodule, please also update the reference to the submodule commit viagit submodule update --remoteorcd recipe && git pull origin main.