Release v0.20.0 · NVIDIA/TensorRT-LLM

TensorRT-LLM Release 0.20.0

Key Features and Enhancements

Model Support
- Added Qwen3 support.Refer to “Qwen3” section in examples/models/core/qwen/README.md.
- Added HyperCLOVAX-SEED-Vision support in PyTorch flow. Refer to examples/models/contrib/hyperclovax/README.md
- Added Dynasor-CoT in scaffolding examples. Refer to examples/scaffolding/contrib/Dynasor/README.md
- Added Mistral Small 3.1 24B VLM support in TRT workflow
- Added Gemma3-1b-it support in PyTorch workflow
- Added Nemotron-H model support
- Added Eagle-3 support for LLAMA4
PyTorch workflow
- Added lora support
- Added return logits support
- Adopt new logprob definition in PyTorch flow
- Enabled per-request stats with PyTorch backend
- Enabled LogitsProcessor in PyTorch backend
Benchmark:
- Add beam width to low latency.
- Fix trtllm-bench iter_stats and cuda_graph_batch_sizes errors.
- Remove deprecated Python runtime benchmark
- Add benchmark support for scaffolding
Multimodal models
- Added support in trtllm-serve
- Added support in trtllm-bench, the support is limited to image only for now
Supported DeepSeek-R1 W4A8 on Hopper
Add the RTX Pro 6000 support on single GPU
Integrated Llama4 input processor
Added CGA reduction FHMA kernels on Blackwell
Enabled chunked context for FlashInfer
Supported KV cache reuse for MLA
Added Piecewise CUDA Graph support
Supported multiple LoRA adapters and TP
Added KV cache-aware router for disaggregated serving
Unfused attention for native support
Added group_rms_norm kernel to normalize multiple inputs in a single operator
Added smart router for the MoE module
Added head size 72 support for QKV preprocessing kernel
Added MNNVL MoE A2A support
Optimized Large Embedding Tables in Multimodal Models
Supported Top-K logprobs and prompt_logprobs in LLMAPI
Enabled overlap scheduler in TRT workflow via executor API

Infrastructure Changes

TRT-LLM team formally releases docker image on NGC.
The pre-built TensorRT-LLM wheel on PyPI is linked against PyTorch 2.7.0 now, which uses the CXX11 ABI
The dependent TensorRT version is updated to 10.10.0
The dependent CUDA version is updated to 12.9.0
The dependent public PyTorch version is updated to 2.7.0
The dependent NVIDIA ModelOpt version is updated to 0.29.0
The dependent NCCL version is maintained at 2.25.1
Open-sourced XQA kernels
Dependent datasets version was upgraded to 3.1.0
Migrate Triton Backend to TensorRT LLM repo to TensorRT LLM submodule
Downgrade gcc toolset version from 13 to 11

API Changes

[Breaking Change]:Enable scheduling overlap by default
Remove deprecated GptSession/V1 from TRT workflow
Set _AutoDeployLlmArgs as primary config object
Allow overriding CLI arguments with YAML file in trtllm-serve
Introduced multimodal embedding field in LlmRequest

Fixed Issues

Fix hang bug when context server doesn't have enough capacity for KV Cache (#3095)
Fix C++ decoder synchronization in PyTorch (#3106)
Fix bug of create cuda stream as default parameter which will be initialized during importing (#3764)
Fix bug related to creating CUDA stream as default parameter, which will be initialized during importing (#3764)
Fix attention DP bug on Qwen3 MoE model (#4141)
Fix illegal memory access when running LLaMA 4 with CUDA Graph enabled (#4101)
Reset planned states to avoid memory leak in TrtllmAttentionWrapper (#4227)

Known Issues

multi-GPU model support on RTX Pro 6000

What's Changed

Refine doc by @juney-nvidia in #4420
Refine doc by @juney-nvidia in #4421
refine doc by @juney-nvidia in #4422
Remove vila test by @Tabrizian in #4376
[TRTLLM-4618][feat] Add Nemotron Super 49B FP8 test on RTX6000 Pro (SM120) by @farazkh80 in #4363
tests: add qa test mentioned in docs by @crazydemo in #4357
[Infra] - Always push the release images in the post-merge job by @chzblych in #4426
tests: Add test cases for rcca cases by @crazydemo in #4347
chore: cleanup perf_evaluator code by @Superjomn in #3833
feat: Add pp support for hybrid attn/mamba model by @yuxianq in #4358
fix: wrong argument name enable_overlap_scheduler by @kaiyux in #4433
Update "Roadmap" link under README.md to the issues with Roadmap label by @AdamzNV in #4425
fix potential issues in allreduce fusion kernel and ut by @yilin-void in #4226
[TRTLLM-4638] feat(scaffolding): update Reward Controller to PRM specific controller with step split by @dc3671 in #4337
feat: NIXL interface integration by @Shixiaowei02 in #3934
Downgrade the logger level for fallback tactic warning. by @hyukn in #4440
Test: Improve model re-use in C++ DGX tests for CI stability by @DomBrown in #4263
fix: temp disable the problem test by @Shixiaowei02 in #4445
Add llama4 disagg accuracy tests by @Tabrizian in #4336
[https://nvbugs/5123103][fix] Fix torch compile for DeepSeekV3 by @liji-nv in #3952
[Docs] - Reapply #4220 by @chzblych in #4434
[TRTLLM-4618][feat] Fix cutlass MoE GEMM fallback failure on FP8 + add e2e test for Mixtral 8x7B FP8 on RTX6000 Pro (SM120) by @farazkh80 in #4335
[Feat] add chunked-attention kernels on Hopper (for llama4) by @PerkzZheng in #4291
test(perf): Add some Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (TRT flow, trtllm-bench) by @venkywonka in #4128
fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. by @yuxianq in #4399
feat: Low Precision Allreduce for PCIe based GPU by @kanghui0204 in #4344
test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4429
[TRTLLM-4932] Add CLI accuracy tests for Llama-3.3-70B-Instruct and LLM API BF16 variant by @moraxu in #4362
test: update test filter in perf test yml file to select cases by gpu name and add cases for RTX 6000 pro by @ruodil in #4282
[AutoDeploy] HF factory improvements by @lucaslie in #4371
chore: bump version to 0.21.0rc0 by @ZhanruiSunCh in #4465
doc: [TRTLLM-325]Integrate the NGC image in Makefile automation and document by @MartinMarciniszyn in #4400
chore: bump version to 0.20.0 by @ZhanruiSunCh in #4469
fix: replace the image links in the blog by @Shixiaowei02 in #4490
fix: cleanup process tree for disaggregated test by @tongyuantongyu in #4116
Cherry pick #4508 by @QiJune in #4512
Cherry pick #4447 by @yuxianq in #4517
chore: Remove unused script by @kaiyux in #4485
chore: Deprecate autopp. by @yuxianq in #4471
fix: Fix trtllm sampler beam width bug by @dcampora in #4507
tests: update api change from decoder to sampler in test by @crazydemo in #4479
docs: Add KV Cache Management documentation by @Funatiq in #3908
test: add failed case in waive list and fix some test script issue for perf test by @ruodil in #4528
Add tritonrelease container by @Tabrizian in #4544
fix: [TRTLLM-325]WAR against security vulnerabilities in Python packages by @MartinMarciniszyn in #4539
[5141290][5273694][5260696] fix: Fix mrope argument missing issue in the summary tasks for Qwen model. by @hyukn in #4432
test: waive hanging cases for perf test by @ruodil in #4563
[nvbugs/5274894] fix: Moving finished context requests to generation by @Funatiq in #4576
[5234029][5226211] chore: Unwaive multimodal tests for Qwen model. by @hyukn in #4519
test(perf): Extend the Llama-Nemotron-Nano-8B perf-integration-tests (pyt) by @venkywonka in #4407
test: fix for perf sanity test and skip fp8 deepseek blackwell cases by @ruodil in #4598
[5180961] chore: Unwaive test for Qwen model. by @hyukn in #4524
[https://nvbugs/5289907][fix] Restore per-channel pre-quant by @Barry-Delaney in #4545
Update internal cutlass kernels commit id by @Barry-Delaney in #4619
ci: waive testcase [NVBUG 5297821] by @stnie in #4616
[nvbugs/5274894] fix: Sort requests for functional correctness and performance by @Funatiq in #4608
[CI] Waive known errors with test TestDeepSeekV3Lite::test_fp8_block_scales_4gpus by @SimengLiu-nv in #4627
[TR[TLLM-4618][feat] Add remaining NVFP4 Nemotron Super 49B test on RTX6000 Pro (SM120) by @farazkh80 in https://github.com//pull/4548
[TRTLLM-4932] Add CLI accuracy tests for Llama-3_3-Nemotron-Super-49B-v1 and LLM API FP8 variant by @moraxu in #4375
[fix] Incorrect mocker argument for a CLI accuracy test in Llama-3.3-70B-Instruct by @moraxu in #4604
Add missing rcca folder by @Tabrizian in #4591
[fix] Fix Llama4 allgather error due to None tensor by @jinyangyuan-nvidia in #4511
[TRTLLM-4932] Add QA accuracy tests for NIM-prioritized models by @moraxu in #4242
Update the description for NGC docker images by @MartinMarciniszyn in #4671
[Test] - Correct waive the Slurm test stage by @chzblych in #4680
fix[nvbug/5286515]: trtllm-llmapi-launch on single node single gpu by @Superjomn in #4529
tests: waive and unwaive QA test cases by @crazydemo in #4644
[TRTLLM-5326] - Fix test coverage report generation by @yiqingy0 in #4691
fix: [nvbugs/5289912][nvbugs/5232406] use thread pool for multi-thread weight loading in fused moe. by @yuxianq in #4699
fix: [nvbug5300494] Use runtime total gpu memory to calculate kv cache memory and log more memory information by @HuiGao-NV in #4660
fix: Mistral Small vision encoder with BS>1 by @brb-nv in #4713
[cherry-pick] test(perf): Add remaining Phi-4-mini-instruct perf tests (#4443) by @venkywonka in #4589
[cherry-pick] test(perf): Add Llama-3_1-Nemotron-Ultra-253B-v1 perf tests (cpp) (#4446) by @venkywonka in #4590
[cherry-pick] test(perf): Pt.2 Add Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (cpp) (#4499) by @venkywonka in #4588
fix:https://nvbugs/5305692 update invalid links in doc. by @nv-guomingz in #4698
fix: Fix AutoTuner warmup request generating. by @hyukn in #4670
fix: [https://nvbugspro.nvidia.com/bug/5286795] Unwaive tests for bug-5286795. by @bobboli in #4724
Remove V1 batching tests by @Tabrizian in #4703
fix:https://nvbugs/5214239 by @nv-guomingz in #4718
[https://nvbugspro.nvidia.com/bug/5236935][Fix] Fix document of using Draft-Target-Model (DTM) speculative decoding in Triton Server by @wili-65535 in #4731
test: remove perf test l40s/l20 oom test cases and unwaive tests by @ruodil in #4720
[fix] Add back RTX6000Pro post-merge tests by @yuanjingx87 in #4744
test: remove large bs as it will oom by @StanleySun639 in #4726
[https://nvbugs/5295389][fix]fix moe fp4 on sm120 by @pamelap-nvidia in #4624
[nvbugs/5302709] fix: Use HF vision tower for llava-next on A100 by @amukkara in #4747
[nvbugs/5297821] Fix llama4 disaggregated serving accuracy tests by @Tabrizian in #4743
tests: fix 5250460 by @xinhe-nv in #4751
tests: waive failed case by @crazydemo in #4785
Waive l0 tests by @yiqingy0 in #4795
[Docs] - Add date and commit info (#4448) by @chzblych in #4752
Remove disaggregated cuda graph waived test by @Tabrizian in #4707
fix: llmapi-launch add add trtllm-bench test with engine building (#4… by @Superjomn in #4550
[https://nvbugs/5303634] skip evaluating empty batch_input_ids in summarize.py by @QiJune in #4676
fix: Skip dummy medusa/eagle tests when WORLD_SIZE env variable is missing by @brb-nv in #4786
fix: Fix queued req stats for release/0.20 by @pcastonguay in #4806
[NVBUG-5291971] JIT path for XQA by @farazkh80 in #4675
[TRTLLM-4932] Remove moe- related arguments from Llama-3_1-Nemotron-Ultra-253B-v1 CLI accuracy test by @moraxu in #4808
test: remove invalid triton integration test cases by @StanleySun639 in #4801
test: shorten reqs in con:1 cases and add streaming cases, add l2 perf test by @ruodil in #4796
[Infra] - Better utilize multi-GPU CI resources by @chzblych in #4850
Cherry-pick #4536 by @lfr-0531 in #4834
Cherry-pick #4379 by @lfr-0531 in #4833
fix: [nvbugs/5298600] fix illegal memory access on mrope_position_deltas by @yechank-nvidia in #4830
Waive L0 test by @yiqingy0 in #4857
Waive L0 test by @yiqingy0 in #4862
test: fix rss increasement test case issue by @StanleySun639 in #4868
fix: [nvbugs/5312750] Keep embed_tokens for last pp rank if tie_word_embeddings. by @yuxianq in #4902
Fix: max_num_sequences calculation with overlap scheduling into release/0.20 by @dcampora in #4889
[TRTLLM-5340] fix: remove the accuracy assert on run_majority_vote_ai… by @WeiHaocheng in #4907
test: fix potential teardown error by @StanleySun639 in #4908
Fix DeepGEMM NVCC Path by @lucifer1004 in #4886
fix: cache-aware router related test fix by @zhengd-nv in #4911
Downgrade NCCL version from 2.26.5 to 2.25.1 by @yiqingy0 in #4931
fix: [nvbug 5321627] handle cases when TRT backend return more logits than output tokens by @hchings in #4921
[5310329] fix: Fix warmup phase batch size out of range. by @hyukn in #4912
[https://nvbugspro.nvidia.com/bug/5323820] Fix chunking equation for disabled case. by @FrankD412 in #4964
[https://nvbugs/5238105] fix: ModelRunnerCpp num_return_sequences by @Funatiq in #3951
[fix] Fix illegal mem access and possible accuracy lose by @liji-nv in #4943
Cherry-pick #5004 by @lfr-0531 in #5005
ci: waive testcase [NVBUG 5247271] by @stnie in #4992
[Infra] - Update JNLP container config by @chzblych in #5009
[5310329] chore: Unwaive test_e2e.py::test_openai_reasoning. by @hyukn in #4981
doc: Minor fixes and clarification by @kaiyux in #4975
[5289904] chore: Unwaive test for Qwen model. by @hyukn in #4657
fix: [nvbugs/5324954, nvbugs/5304229] fix Qwen2-VL video and Qwen2.5-VL image test case by @yechank-nvidia in #4976
tests: fix some typo and limitation on test cases by @crazydemo in #4989
[https://nvbugs/5277592][fix] fix cuda graph padding for spec decoding (only for 0.20) by @lfr-0531 in #5058
[TRTLLM-5516] perf: replicate dummy request for cuda graph padding (cherry-pick #4729) by @kaiyux in #5190
test: add deepseek rcca cases by @ruodil in #5195
doc:add release notes for v0.20.0 by @nv-guomingz in #5150
test: add deepseek_v3_lite rcca cases by @ruodil in #5225
test: Deprecate gpt_model_type "v1" static batching from triton_backend L0_backend_trtllm by @yinggeh in #5229
[doc] Update Perf-Overview.MD with V0.20 Release Data (cherry-pick #5176) by @kaiyux in #5324

New Contributors

@AdamzNV made their first contribution in #4425
@stnie made their first contribution in #4616
@yinggeh made their first contribution in #5229

Full Changelog: v0.20.0rc3...v0.20.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v0.20.0

TensorRT-LLM Release 0.20.0

Key Features and Enhancements

Infrastructure Changes

API Changes

Fixed Issues

Known Issues

What's Changed

New Contributors

Contributors

Uh oh!