Skip to content

v0.20.0

Latest
Compare
Choose a tag to compare
@nv-guomingz nv-guomingz released this 19 Jun 04:19
· 1007 commits to main since this release
7965842

TensorRT-LLM Release 0.20.0

Key Features and Enhancements

  • Model Support
    • Added Qwen3 support.Refer to “Qwen3” section in examples/models/core/qwen/README.md.
    • Added HyperCLOVAX-SEED-Vision support in PyTorch flow. Refer to examples/models/contrib/hyperclovax/README.md
    • Added Dynasor-CoT in scaffolding examples. Refer to examples/scaffolding/contrib/Dynasor/README.md
    • Added Mistral Small 3.1 24B VLM support in TRT workflow
    • Added Gemma3-1b-it support in PyTorch workflow
    • Added Nemotron-H model support
    • Added Eagle-3 support for LLAMA4
  • PyTorch workflow
    • Added lora support
    • Added return logits support
    • Adopt new logprob definition in PyTorch flow
    • Enabled per-request stats with PyTorch backend
    • Enabled LogitsProcessor in PyTorch backend
  • Benchmark:
    • Add beam width to low latency.
    • Fix trtllm-bench iter_stats and cuda_graph_batch_sizes errors.
    • Remove deprecated Python runtime benchmark
    • Add benchmark support for scaffolding
  • Multimodal models
    • Added support in trtllm-serve
    • Added support in trtllm-bench, the support is limited to image only for now
  • Supported DeepSeek-R1 W4A8 on Hopper
  • Add the RTX Pro 6000 support on single GPU
  • Integrated Llama4 input processor
  • Added CGA reduction FHMA kernels on Blackwell
  • Enabled chunked context for FlashInfer
  • Supported KV cache reuse for MLA
  • Added Piecewise CUDA Graph support
  • Supported multiple LoRA adapters and TP
  • Added KV cache-aware router for disaggregated serving
  • Unfused attention for native support
  • Added group_rms_norm kernel to normalize multiple inputs in a single operator
  • Added smart router for the MoE module
  • Added head size 72 support for QKV preprocessing kernel
  • Added MNNVL MoE A2A support
  • Optimized Large Embedding Tables in Multimodal Models
  • Supported Top-K logprobs and prompt_logprobs in LLMAPI
  • Enabled overlap scheduler in TRT workflow via executor API

Infrastructure Changes

  • TRT-LLM team formally releases docker image on NGC.
  • The pre-built TensorRT-LLM wheel on PyPI is linked against PyTorch 2.7.0 now, which uses the CXX11 ABI
  • The dependent TensorRT version is updated to 10.10.0
  • The dependent CUDA version is updated to 12.9.0
  • The dependent public PyTorch version is updated to 2.7.0
  • The dependent NVIDIA ModelOpt version is updated to 0.29.0
  • The dependent NCCL version is maintained at 2.25.1
  • Open-sourced XQA kernels
  • Dependent datasets version was upgraded to 3.1.0
  • Migrate Triton Backend to TensorRT LLM repo to TensorRT LLM submodule
  • Downgrade gcc toolset version from 13 to 11

API Changes

  • [Breaking Change]:Enable scheduling overlap by default
  • Remove deprecated GptSession/V1 from TRT workflow
  • Set _AutoDeployLlmArgs as primary config object
  • Allow overriding CLI arguments with YAML file in trtllm-serve
  • Introduced multimodal embedding field in LlmRequest

Fixed Issues

  • Fix hang bug when context server doesn't have enough capacity for KV Cache (#3095)
  • Fix C++ decoder synchronization in PyTorch (#3106)
  • Fix bug of create cuda stream as default parameter which will be initialized during importing (#3764)
  • Fix bug related to creating CUDA stream as default parameter, which will be initialized during importing (#3764)
  • Fix attention DP bug on Qwen3 MoE model (#4141)
  • Fix illegal memory access when running LLaMA 4 with CUDA Graph enabled (#4101)
  • Reset planned states to avoid memory leak in TrtllmAttentionWrapper (#4227)

Known Issues

  • multi-GPU model support on RTX Pro 6000

What's Changed

  • Refine doc by @juney-nvidia in #4420
  • Refine doc by @juney-nvidia in #4421
  • refine doc by @juney-nvidia in #4422
  • Remove vila test by @Tabrizian in #4376
  • [TRTLLM-4618][feat] Add Nemotron Super 49B FP8 test on RTX6000 Pro (SM120) by @farazkh80 in #4363
  • tests: add qa test mentioned in docs by @crazydemo in #4357
  • [Infra] - Always push the release images in the post-merge job by @chzblych in #4426
  • tests: Add test cases for rcca cases by @crazydemo in #4347
  • chore: cleanup perf_evaluator code by @Superjomn in #3833
  • feat: Add pp support for hybrid attn/mamba model by @yuxianq in #4358
  • fix: wrong argument name enable_overlap_scheduler by @kaiyux in #4433
  • Update "Roadmap" link under README.md to the issues with Roadmap label by @AdamzNV in #4425
  • fix potential issues in allreduce fusion kernel and ut by @yilin-void in #4226
  • [TRTLLM-4638] feat(scaffolding): update Reward Controller to PRM specific controller with step split by @dc3671 in #4337
  • feat: NIXL interface integration by @Shixiaowei02 in #3934
  • Downgrade the logger level for fallback tactic warning. by @hyukn in #4440
  • Test: Improve model re-use in C++ DGX tests for CI stability by @DomBrown in #4263
  • fix: temp disable the problem test by @Shixiaowei02 in #4445
  • Add llama4 disagg accuracy tests by @Tabrizian in #4336
  • [https://nvbugs/5123103][fix] Fix torch compile for DeepSeekV3 by @liji-nv in #3952
  • [Docs] - Reapply #4220 by @chzblych in #4434
  • [TRTLLM-4618][feat] Fix cutlass MoE GEMM fallback failure on FP8 + add e2e test for Mixtral 8x7B FP8 on RTX6000 Pro (SM120) by @farazkh80 in #4335
  • [Feat] add chunked-attention kernels on Hopper (for llama4) by @PerkzZheng in #4291
  • test(perf): Add some Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (TRT flow, trtllm-bench) by @venkywonka in #4128
  • fix: [nvbugs/5287097] Align PP layer distribution between pytorch and TRT flow. by @yuxianq in #4399
  • feat: Low Precision Allreduce for PCIe based GPU by @kanghui0204 in #4344
  • test: [CI] Add failed cases into waives.txt by @xinhe-nv in #4429
  • [TRTLLM-4932] Add CLI accuracy tests for Llama-3.3-70B-Instruct and LLM API BF16 variant by @moraxu in #4362
  • test: update test filter in perf test yml file to select cases by gpu name and add cases for RTX 6000 pro by @ruodil in #4282
  • [AutoDeploy] HF factory improvements by @lucaslie in #4371
  • chore: bump version to 0.21.0rc0 by @ZhanruiSunCh in #4465
  • doc: [TRTLLM-325]Integrate the NGC image in Makefile automation and document by @MartinMarciniszyn in #4400
  • chore: bump version to 0.20.0 by @ZhanruiSunCh in #4469
  • fix: replace the image links in the blog by @Shixiaowei02 in #4490
  • fix: cleanup process tree for disaggregated test by @tongyuantongyu in #4116
  • Cherry pick #4508 by @QiJune in #4512
  • Cherry pick #4447 by @yuxianq in #4517
  • chore: Remove unused script by @kaiyux in #4485
  • chore: Deprecate autopp. by @yuxianq in #4471
  • fix: Fix trtllm sampler beam width bug by @dcampora in #4507
  • tests: update api change from decoder to sampler in test by @crazydemo in #4479
  • docs: Add KV Cache Management documentation by @Funatiq in #3908
  • test: add failed case in waive list and fix some test script issue for perf test by @ruodil in #4528
  • Add tritonrelease container by @Tabrizian in #4544
  • fix: [TRTLLM-325]WAR against security vulnerabilities in Python packages by @MartinMarciniszyn in #4539
  • [5141290][5273694][5260696] fix: Fix mrope argument missing issue in the summary tasks for Qwen model. by @hyukn in #4432
  • test: waive hanging cases for perf test by @ruodil in #4563
  • [nvbugs/5274894] fix: Moving finished context requests to generation by @Funatiq in #4576
  • [5234029][5226211] chore: Unwaive multimodal tests for Qwen model. by @hyukn in #4519
  • test(perf): Extend the Llama-Nemotron-Nano-8B perf-integration-tests (pyt) by @venkywonka in #4407
  • test: fix for perf sanity test and skip fp8 deepseek blackwell cases by @ruodil in #4598
  • [5180961] chore: Unwaive test for Qwen model. by @hyukn in #4524
  • [https://nvbugs/5289907][fix] Restore per-channel pre-quant by @Barry-Delaney in #4545
  • Update internal cutlass kernels commit id by @Barry-Delaney in #4619
  • ci: waive testcase [NVBUG 5297821] by @stnie in #4616
  • [nvbugs/5274894] fix: Sort requests for functional correctness and performance by @Funatiq in #4608
  • [CI] Waive known errors with test TestDeepSeekV3Lite::test_fp8_block_scales_4gpus by @SimengLiu-nv in #4627
  • [TR[TLLM-4618][feat] Add remaining NVFP4 Nemotron Super 49B test on RTX6000 Pro (SM120) by @farazkh80 in https://github.com//pull/4548
  • [TRTLLM-4932] Add CLI accuracy tests for Llama-3_3-Nemotron-Super-49B-v1 and LLM API FP8 variant by @moraxu in #4375
  • [fix] Incorrect mocker argument for a CLI accuracy test in Llama-3.3-70B-Instruct by @moraxu in #4604
  • Add missing rcca folder by @Tabrizian in #4591
  • [fix] Fix Llama4 allgather error due to None tensor by @jinyangyuan-nvidia in #4511
  • [TRTLLM-4932] Add QA accuracy tests for NIM-prioritized models by @moraxu in #4242
  • Update the description for NGC docker images by @MartinMarciniszyn in #4671
  • [Test] - Correct waive the Slurm test stage by @chzblych in #4680
  • fix[nvbug/5286515]: trtllm-llmapi-launch on single node single gpu by @Superjomn in #4529
  • tests: waive and unwaive QA test cases by @crazydemo in #4644
  • [TRTLLM-5326] - Fix test coverage report generation by @yiqingy0 in #4691
  • fix: [nvbugs/5289912][nvbugs/5232406] use thread pool for multi-thread weight loading in fused moe. by @yuxianq in #4699
  • fix: [nvbug5300494] Use runtime total gpu memory to calculate kv cache memory and log more memory information by @HuiGao-NV in #4660
  • fix: Mistral Small vision encoder with BS>1 by @brb-nv in #4713
  • [cherry-pick] test(perf): Add remaining Phi-4-mini-instruct perf tests (#4443) by @venkywonka in #4589
  • [cherry-pick] test(perf): Add Llama-3_1-Nemotron-Ultra-253B-v1 perf tests (cpp) (#4446) by @venkywonka in #4590
  • [cherry-pick] test(perf): Pt.2 Add Llama-3_3-Nemotron-Super-49B-v1 integration-perf-tests (cpp) (#4499) by @venkywonka in #4588
  • fix:https://nvbugs/5305692 update invalid links in doc. by @nv-guomingz in #4698
  • fix: Fix AutoTuner warmup request generating. by @hyukn in #4670
  • fix: [https://nvbugspro.nvidia.com/bug/5286795] Unwaive tests for bug-5286795. by @bobboli in #4724
  • Remove V1 batching tests by @Tabrizian in #4703
  • fix:https://nvbugs/5214239 by @nv-guomingz in #4718
  • [https://nvbugspro.nvidia.com/bug/5236935][Fix] Fix document of using Draft-Target-Model (DTM) speculative decoding in Triton Server by @wili-65535 in #4731
  • test: remove perf test l40s/l20 oom test cases and unwaive tests by @ruodil in #4720
  • [fix] Add back RTX6000Pro post-merge tests by @yuanjingx87 in #4744
  • test: remove large bs as it will oom by @StanleySun639 in #4726
  • [https://nvbugs/5295389][fix]fix moe fp4 on sm120 by @pamelap-nvidia in #4624
  • [nvbugs/5302709] fix: Use HF vision tower for llava-next on A100 by @amukkara in #4747
  • [nvbugs/5297821] Fix llama4 disaggregated serving accuracy tests by @Tabrizian in #4743
  • tests: fix 5250460 by @xinhe-nv in #4751
  • tests: waive failed case by @crazydemo in #4785
  • Waive l0 tests by @yiqingy0 in #4795
  • [Docs] - Add date and commit info (#4448) by @chzblych in #4752
  • Remove disaggregated cuda graph waived test by @Tabrizian in #4707
  • fix: llmapi-launch add add trtllm-bench test with engine building (#4… by @Superjomn in #4550
  • [https://nvbugs/5303634] skip evaluating empty batch_input_ids in summarize.py by @QiJune in #4676
  • fix: Skip dummy medusa/eagle tests when WORLD_SIZE env variable is missing by @brb-nv in #4786
  • fix: Fix queued req stats for release/0.20 by @pcastonguay in #4806
  • [NVBUG-5291971] JIT path for XQA by @farazkh80 in #4675
  • [TRTLLM-4932] Remove moe- related arguments from Llama-3_1-Nemotron-Ultra-253B-v1 CLI accuracy test by @moraxu in #4808
  • test: remove invalid triton integration test cases by @StanleySun639 in #4801
  • test: shorten reqs in con:1 cases and add streaming cases, add l2 perf test by @ruodil in #4796
  • [Infra] - Better utilize multi-GPU CI resources by @chzblych in #4850
  • Cherry-pick #4536 by @lfr-0531 in #4834
  • Cherry-pick #4379 by @lfr-0531 in #4833
  • fix: [nvbugs/5298600] fix illegal memory access on mrope_position_deltas by @yechank-nvidia in #4830
  • Waive L0 test by @yiqingy0 in #4857
  • Waive L0 test by @yiqingy0 in #4862
  • test: fix rss increasement test case issue by @StanleySun639 in #4868
  • fix: [nvbugs/5312750] Keep embed_tokens for last pp rank if tie_word_embeddings. by @yuxianq in #4902
  • Fix: max_num_sequences calculation with overlap scheduling into release/0.20 by @dcampora in #4889
  • [TRTLLM-5340] fix: remove the accuracy assert on run_majority_vote_ai… by @WeiHaocheng in #4907
  • test: fix potential teardown error by @StanleySun639 in #4908
  • Fix DeepGEMM NVCC Path by @lucifer1004 in #4886
  • fix: cache-aware router related test fix by @zhengd-nv in #4911
  • Downgrade NCCL version from 2.26.5 to 2.25.1 by @yiqingy0 in #4931
  • fix: [nvbug 5321627] handle cases when TRT backend return more logits than output tokens by @hchings in #4921
  • [5310329] fix: Fix warmup phase batch size out of range. by @hyukn in #4912
  • [https://nvbugspro.nvidia.com/bug/5323820] Fix chunking equation for disabled case. by @FrankD412 in #4964
  • [https://nvbugs/5238105] fix: ModelRunnerCpp num_return_sequences by @Funatiq in #3951
  • [fix] Fix illegal mem access and possible accuracy lose by @liji-nv in #4943
  • Cherry-pick #5004 by @lfr-0531 in #5005
  • ci: waive testcase [NVBUG 5247271] by @stnie in #4992
  • [Infra] - Update JNLP container config by @chzblych in #5009
  • [5310329] chore: Unwaive test_e2e.py::test_openai_reasoning. by @hyukn in #4981
  • doc: Minor fixes and clarification by @kaiyux in #4975
  • [5289904] chore: Unwaive test for Qwen model. by @hyukn in #4657
  • fix: [nvbugs/5324954, nvbugs/5304229] fix Qwen2-VL video and Qwen2.5-VL image test case by @yechank-nvidia in #4976
  • tests: fix some typo and limitation on test cases by @crazydemo in #4989
  • [https://nvbugs/5277592][fix] fix cuda graph padding for spec decoding (only for 0.20) by @lfr-0531 in #5058
  • [TRTLLM-5516] perf: replicate dummy request for cuda graph padding (cherry-pick #4729) by @kaiyux in #5190
  • test: add deepseek rcca cases by @ruodil in #5195
  • doc:add release notes for v0.20.0 by @nv-guomingz in #5150
  • test: add deepseek_v3_lite rcca cases by @ruodil in #5225
  • test: Deprecate gpt_model_type "v1" static batching from triton_backend L0_backend_trtllm by @yinggeh in #5229
  • [doc] Update Perf-Overview.MD with V0.20 Release Data (cherry-pick #5176) by @kaiyux in #5324

New Contributors

Full Changelog: v0.20.0rc3...v0.20.0