TensorRT-LLM Release 0.20.0
Key Features and Enhancements
- Model Support
- Added Qwen3 support.Refer to “Qwen3” section in
examples/models/core/qwen/README.md
. - Added HyperCLOVAX-SEED-Vision support in PyTorch flow. Refer to
examples/models/contrib/hyperclovax/README.md
- Added Dynasor-CoT in scaffolding examples. Refer to
examples/scaffolding/contrib/Dynasor/README.md
- Added Mistral Small 3.1 24B VLM support in TRT workflow
- Added Gemma3-1b-it support in PyTorch workflow
- Added Nemotron-H model support
- Added Eagle-3 support for LLAMA4
- Added Qwen3 support.Refer to “Qwen3” section in
- PyTorch workflow
- Added lora support
- Added return logits support
- Adopt new logprob definition in PyTorch flow
- Enabled per-request stats with PyTorch backend
- Enabled LogitsProcessor in PyTorch backend
- Benchmark:
- Add beam width to low latency.
- Fix trtllm-bench iter_stats and cuda_graph_batch_sizes errors.
- Remove deprecated Python runtime benchmark
- Add benchmark support for scaffolding
- Multimodal models
- Added support in trtllm-serve
- Added support in trtllm-bench, the support is limited to image only for now
- Supported DeepSeek-R1 W4A8 on Hopper
- Add the RTX Pro 6000 support on single GPU
- Integrated Llama4 input processor
- Added CGA reduction FHMA kernels on Blackwell
- Enabled chunked context for FlashInfer
- Supported KV cache reuse for MLA
- Added Piecewise CUDA Graph support
- Supported multiple LoRA adapters and TP
- Added KV cache-aware router for disaggregated serving
- Unfused attention for native support
- Added group_rms_norm kernel to normalize multiple inputs in a single operator
- Added smart router for the MoE module
- Added head size 72 support for QKV preprocessing kernel
- Added MNNVL MoE A2A support
- Optimized Large Embedding Tables in Multimodal Models
- Supported Top-K logprobs and prompt_logprobs in LLMAPI
- Enabled overlap scheduler in TRT workflow via executor API
Infrastructure Changes
- TRT-LLM team formally releases docker image on NGC.
- The pre-built TensorRT-LLM wheel on PyPI is linked against PyTorch 2.7.0 now, which uses the CXX11 ABI
- The dependent TensorRT version is updated to 10.10.0
- The dependent CUDA version is updated to 12.9.0
- The dependent public PyTorch version is updated to 2.7.0
- The dependent NVIDIA ModelOpt version is updated to 0.29.0
- The dependent NCCL version is maintained at 2.25.1
- Open-sourced XQA kernels
- Dependent datasets version was upgraded to 3.1.0
- Migrate Triton Backend to TensorRT LLM repo to TensorRT LLM submodule
- Downgrade gcc toolset version from 13 to 11
API Changes
- [Breaking Change]:Enable scheduling overlap by default
- Remove deprecated GptSession/V1 from TRT workflow
- Set _AutoDeployLlmArgs as primary config object
- Allow overriding CLI arguments with YAML file in trtllm-serve
- Introduced multimodal embedding field in LlmRequest
Fixed Issues
- Fix hang bug when context server doesn't have enough capacity for KV Cache (#3095)
- Fix C++ decoder synchronization in PyTorch (#3106)
- Fix bug of create cuda stream as default parameter which will be initialized during importing (#3764)
- Fix bug related to creating CUDA stream as default parameter, which will be initialized during importing (#3764)
- Fix attention DP bug on Qwen3 MoE model (#4141)
- Fix illegal memory access when running LLaMA 4 with CUDA Graph enabled (#4101)
- Reset planned states to avoid memory leak in TrtllmAttentionWrapper (#4227)
Known Issues
- multi-GPU model support on RTX Pro 6000
What's Changed
- Refine doc by @juney-nvidia in #4420
- Refine doc by @juney-nvidia in #4421
- refine doc by @juney-nvidia in #4422
- Remove vila test by @Tabrizian in #4376
- [TRTLLM-4618][feat] Add Nemotron Super 49B FP8 test on RTX6000 Pro (SM120) by @farazkh80 in #4363
- tests: add qa test mentioned in docs by @crazydemo in #4357
- [Infra] - Always push the release images in the post-merge job by @chzblych in #4426
- tests: Add test cases for rcca cases by @crazydemo in #4347
- chore: cleanup perf_evaluator code by @Superjomn in #3833
- feat: Add pp support for hybrid attn/mamba model by @yuxianq in #4358
- fix: wrong argument name
enable_overlap_scheduler
by @kaiyux in #4433 - Update "Roadmap" link under README.md to the issues with Roadmap label by @AdamzNV in #4425
- fix potential issues in allreduce fusion kernel and ut by @yilin-void in #4226
- [TRTLLM-4638] feat(scaffolding): update Reward Controller to PRM specific controller with step split by @dc3671 in #4337
- feat: NIXL interface integration by @Shixiaowei02 in #3934
- Downgrade the logger level for fallback tactic warning. by @hyukn in #4440
- Test: Improve model re-use in C++ DGX tests for CI stability by @DomBrown in #4263
- fix: temp disable the problem test by @Shixiaowei02 in #4445
- Add llama4 disagg accuracy tests by @Tabrizian in #4336
- [https://nvbugs/5123103][fix] Fix torch compile for DeepSeekV3 by @liji-nv in #3952
- [Docs] - Reapply #4220 by @chzblych in #4434
- [TRTLLM-4618][feat] Fix cutlass MoE GEMM fallback failure on FP8 + add e2e test for Mixtral 8x7B FP8 on RTX6000 Pro (SM120) by @farazkh80 in #4335
- [Feat] add chunked-attention kernels on Hopper (for llama4) by @PerkzZheng in #4291
- test(perf): Add some
Llama-3_3-Nemotron-Super-49B-v1
integration-perf-tests (TRT flow, trtllm-bench) by @venkywonka in #4128 - fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. by @yuxianq in #4399
- feat: Low Precision Allreduce for PCIe based GPU by @kanghui0204 in #4344
- test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4429
- [TRTLLM-4932] Add CLI accuracy tests for Llama-3.3-70B-Instruct and LLM API BF16 variant by @moraxu in #4362
- test: update test filter in perf test yml file to select cases by gpu name and add cases for RTX 6000 pro by @ruodil in #4282
- [AutoDeploy] HF factory improvements by @lucaslie in #4371
- chore: bump version to 0.21.0rc0 by @ZhanruiSunCh in #4465
- doc: [TRTLLM-325]Integrate the NGC image in Makefile automation and document by @MartinMarciniszyn in #4400
- chore: bump version to 0.20.0 by @ZhanruiSunCh in #4469
- fix: replace the image links in the blog by @Shixiaowei02 in #4490
- fix: cleanup process tree for disaggregated test by @tongyuantongyu in #4116
- Cherry pick #4508 by @QiJune in #4512
- Cherry pick #4447 by @yuxianq in #4517
- chore: Remove unused script by @kaiyux in #4485
- chore: Deprecate autopp. by @yuxianq in #4471
- fix: Fix trtllm sampler beam width bug by @dcampora in #4507
- tests: update api change from decoder to sampler in test by @crazydemo in #4479
- docs: Add KV Cache Management documentation by @Funatiq in #3908
- test: add failed case in waive list and fix some test script issue for perf test by @ruodil in #4528
- Add tritonrelease container by @Tabrizian in #4544
- fix: [TRTLLM-325]WAR against security vulnerabilities in Python packages by @MartinMarciniszyn in #4539
- [5141290][5273694][5260696] fix: Fix mrope argument missing issue in the summary tasks for Qwen model. by @hyukn in #4432
- test: waive hanging cases for perf test by @ruodil in #4563
- [nvbugs/5274894] fix: Moving finished context requests to generation by @Funatiq in #4576
- [5234029][5226211] chore: Unwaive multimodal tests for Qwen model. by @hyukn in #4519
- test(perf): Extend the Llama-Nemotron-Nano-8B perf-integration-tests (pyt) by @venkywonka in #4407
- test: fix for perf sanity test and skip fp8 deepseek blackwell cases by @ruodil in #4598
- [5180961] chore: Unwaive test for Qwen model. by @hyukn in #4524
- [https://nvbugs/5289907][fix] Restore per-channel pre-quant by @Barry-Delaney in #4545
- Update internal cutlass kernels commit id by @Barry-Delaney in #4619
- ci: waive testcase [NVBUG 5297821] by @stnie in #4616
- [nvbugs/5274894] fix: Sort requests for functional correctness and performance by @Funatiq in #4608
- [CI] Waive known errors with test TestDeepSeekV3Lite::test_fp8_block_scales_4gpus by @SimengLiu-nv in #4627
- [TR[TLLM-4618][feat] Add remaining NVFP4 Nemotron Super 49B test on RTX6000 Pro (SM120) by @farazkh80 in https://github.com//pull/4548
- [TRTLLM-4932] Add CLI accuracy tests for Llama-3_3-Nemotron-Super-49B-v1 and LLM API FP8 variant by @moraxu in #4375
- [fix] Incorrect mocker argument for a CLI accuracy test in Llama-3.3-70B-Instruct by @moraxu in #4604
- Add missing rcca folder by @Tabrizian in #4591
- [fix] Fix Llama4 allgather error due to None tensor by @jinyangyuan-nvidia in #4511
- [TRTLLM-4932] Add QA accuracy tests for NIM-prioritized models by @moraxu in #4242
- Update the description for NGC docker images by @MartinMarciniszyn in #4671
- [Test] - Correct waive the Slurm test stage by @chzblych in #4680
- fix[nvbug/5286515]: trtllm-llmapi-launch on single node single gpu by @Superjomn in #4529
- tests: waive and unwaive QA test cases by @crazydemo in #4644
- [TRTLLM-5326] - Fix test coverage report generation by @yiqingy0 in #4691
- fix: [nvbugs/5289912][nvbugs/5232406] use thread pool for multi-thread weight loading in fused moe. by @yuxianq in #4699
- fix: [nvbug5300494] Use runtime total gpu memory to calculate kv cache memory and log more memory information by @HuiGao-NV in #4660
- fix: Mistral Small vision encoder with BS>1 by @brb-nv in #4713
- [cherry-pick] test(perf): Add remaining
Phi-4-mini-instruct
perf tests (#4443) by @venkywonka in #4589 - [cherry-pick] test(perf): Add Llama-3_1-Nemotron-Ultra-253B-v1 perf tests (cpp) (#4446) by @venkywonka in #4590
- [cherry-pick] test(perf): Pt.2 Add Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (cpp) (#4499) by @venkywonka in #4588
- fix:https://nvbugs/5305692 update invalid links in doc. by @nv-guomingz in #4698
- fix: Fix AutoTuner warmup request generating. by @hyukn in #4670
- fix: [https://nvbugspro.nvidia.com/bug/5286795] Unwaive tests for bug-5286795. by @bobboli in #4724
- Remove V1 batching tests by @Tabrizian in #4703
- fix:https://nvbugs/5214239 by @nv-guomingz in #4718
- [https://nvbugspro.nvidia.com/bug/5236935][Fix] Fix document of using Draft-Target-Model (DTM) speculative decoding in Triton Server by @wili-65535 in #4731
- test: remove perf test l40s/l20 oom test cases and unwaive tests by @ruodil in #4720
- [fix] Add back RTX6000Pro post-merge tests by @yuanjingx87 in #4744
- test: remove large bs as it will oom by @StanleySun639 in #4726
- [https://nvbugs/5295389][fix]fix moe fp4 on sm120 by @pamelap-nvidia in #4624
- [nvbugs/5302709] fix: Use HF vision tower for llava-next on A100 by @amukkara in #4747
- [nvbugs/5297821] Fix llama4 disaggregated serving accuracy tests by @Tabrizian in #4743
- tests: fix 5250460 by @xinhe-nv in #4751
- tests: waive failed case by @crazydemo in #4785
- Waive l0 tests by @yiqingy0 in #4795
- [Docs] - Add date and commit info (#4448) by @chzblych in #4752
- Remove disaggregated cuda graph waived test by @Tabrizian in #4707
- fix: llmapi-launch add add trtllm-bench test with engine building (#4… by @Superjomn in #4550
- [https://nvbugs/5303634] skip evaluating empty batch_input_ids in summarize.py by @QiJune in #4676
- fix: Skip dummy medusa/eagle tests when WORLD_SIZE env variable is missing by @brb-nv in #4786
- fix: Fix queued req stats for release/0.20 by @pcastonguay in #4806
- [NVBUG-5291971] JIT path for XQA by @farazkh80 in #4675
- [TRTLLM-4932] Remove moe- related arguments from Llama-3_1-Nemotron-Ultra-253B-v1 CLI accuracy test by @moraxu in #4808
- test: remove invalid triton integration test cases by @StanleySun639 in #4801
- test: shorten reqs in con:1 cases and add streaming cases, add l2 perf test by @ruodil in #4796
- [Infra] - Better utilize multi-GPU CI resources by @chzblych in #4850
- Cherry-pick #4536 by @lfr-0531 in #4834
- Cherry-pick #4379 by @lfr-0531 in #4833
- fix: [nvbugs/5298600] fix illegal memory access on mrope_position_deltas by @yechank-nvidia in #4830
- Waive L0 test by @yiqingy0 in #4857
- Waive L0 test by @yiqingy0 in #4862
- test: fix rss increasement test case issue by @StanleySun639 in #4868
- fix: [nvbugs/5312750] Keep embed_tokens for last pp rank if tie_word_embeddings. by @yuxianq in #4902
- Fix: max_num_sequences calculation with overlap scheduling into release/0.20 by @dcampora in #4889
- [TRTLLM-5340] fix: remove the accuracy assert on run_majority_vote_ai… by @WeiHaocheng in #4907
- test: fix potential teardown error by @StanleySun639 in #4908
- Fix DeepGEMM NVCC Path by @lucifer1004 in #4886
- fix: cache-aware router related test fix by @zhengd-nv in #4911
- Downgrade NCCL version from 2.26.5 to 2.25.1 by @yiqingy0 in #4931
- fix: [nvbug 5321627] handle cases when TRT backend return more logits than output tokens by @hchings in #4921
- [5310329] fix: Fix warmup phase batch size out of range. by @hyukn in #4912
- [https://nvbugspro.nvidia.com/bug/5323820] Fix chunking equation for disabled case. by @FrankD412 in #4964
- [https://nvbugs/5238105] fix: ModelRunnerCpp num_return_sequences by @Funatiq in #3951
- [fix] Fix illegal mem access and possible accuracy lose by @liji-nv in #4943
- Cherry-pick #5004 by @lfr-0531 in #5005
- ci: waive testcase [NVBUG 5247271] by @stnie in #4992
- [Infra] - Update JNLP container config by @chzblych in #5009
- [5310329] chore: Unwaive test_e2e.py::test_openai_reasoning. by @hyukn in #4981
- doc: Minor fixes and clarification by @kaiyux in #4975
- [5289904] chore: Unwaive test for Qwen model. by @hyukn in #4657
- fix: [nvbugs/5324954, nvbugs/5304229] fix Qwen2-VL video and Qwen2.5-VL image test case by @yechank-nvidia in #4976
- tests: fix some typo and limitation on test cases by @crazydemo in #4989
- [https://nvbugs/5277592][fix] fix cuda graph padding for spec decoding (only for 0.20) by @lfr-0531 in #5058
- [TRTLLM-5516] perf: replicate dummy request for cuda graph padding (cherry-pick #4729) by @kaiyux in #5190
- test: add deepseek rcca cases by @ruodil in #5195
- doc:add release notes for v0.20.0 by @nv-guomingz in #5150
- test: add deepseek_v3_lite rcca cases by @ruodil in #5225
- test: Deprecate gpt_model_type "v1" static batching from triton_backend L0_backend_trtllm by @yinggeh in #5229
- [doc] Update Perf-Overview.MD with V0.20 Release Data (cherry-pick #5176) by @kaiyux in #5324
New Contributors
- @AdamzNV made their first contribution in #4425
- @stnie made their first contribution in #4616
- @yinggeh made their first contribution in #5229
Full Changelog: v0.20.0rc3...v0.20.0