[pull] main from NVIDIA:main #73

pull · 2025-04-27T03:47:20Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.1)

Can you help keep this open source service alive? 💖 Please sponsor : )

Signed-off-by: bhsueh <[email protected]>

* Remote results.xml when no cases ran Signed-off-by: qqiao <[email protected]> * Change some test config to verify Signed-off-by: qqiao <[email protected]> * Update for quotes Signed-off-by: qqiao <[email protected]> * Move the remove results.xml in catch section Signed-off-by: qqiao <[email protected]> * Add missed path Signed-off-by: qqiao <[email protected]> * Change back the test stage setting Signed-off-by: qqiao <[email protected]> --------- Signed-off-by: qqiao <[email protected]>

* Update num_of_ctx_tokens in iteration stats * Revert not neccessary change of importing module

* cacheTransceiver buffer manager Signed-off-by: Chuang Zhu <[email protected]> * fix args Signed-off-by: Chuang Zhu <[email protected]> * cpp kvCacheManager Signed-off-by: Chuang Zhu <[email protected]> * format Signed-off-by: Chuang Zhu <[email protected]> --------- Signed-off-by: Chuang Zhu <[email protected]>

…ng wa… (#3852) * add warmup flag into py_executor to prevent enable profiler during warmup Signed-off-by: bhsueh <[email protected]> * fix bug of pre-commit Signed-off-by: bhsueh <[email protected]> * change setting warmup to all ranks Signed-off-by: bhsueh <[email protected]> --------- Signed-off-by: bhsueh <[email protected]>

* add submit_sync to RemoteMpiSessionClient Signed-off-by: Superjomn <[email protected]> Signed-off-by: Superjomn <[email protected]> add barrier Signed-off-by: Superjomn <[email protected]> Signed-off-by: Superjomn <[email protected]> Signed-off-by: Superjomn <[email protected]> fix comment Signed-off-by: Superjomn <[email protected]> disable test Signed-off-by: Superjomn <[email protected]> * fix Signed-off-by: Superjomn <[email protected]> --------- Signed-off-by: Superjomn <[email protected]>

* infra: install Triton in the base image Signed-off-by: Iman Tabrizian <[email protected]> * install Triton from the base image Signed-off-by: Iman Tabrizian <[email protected]> * update base image Signed-off-by: Iman Tabrizian <[email protected]> * Address review comments Signed-off-by: Iman Tabrizian <[email protected]> * update base image Signed-off-by: Iman Tabrizian <[email protected]> * waive test Signed-off-by: Iman Tabrizian <[email protected]> --------- Signed-off-by: Iman Tabrizian <[email protected]>

#3764) * fix bug of create cuda stream as default parameter which will be initialized during importing Signed-off-by: bhsueh <[email protected]> * add torch.cuda.Stream() for the leader node Signed-off-by: bhsueh <[email protected]> * fix pre-commit issue Signed-off-by: bhsueh <[email protected]> --------- Signed-off-by: bhsueh <[email protected]>

Signed-off-by: Yanchao Lu <[email protected]>

Signed-off-by: Zhenhuan Chen <[email protected]>

Signed-off-by: xinhe-nv <[email protected]>

* update waive list Signed-off-by: xinhe-nv <[email protected]> * update waives Signed-off-by: xinhe-nv <[email protected]> --------- Signed-off-by: xinhe-nv <[email protected]> Signed-off-by: Larry <[email protected]> Co-authored-by: Larry <[email protected]>

Signed-off-by: taoli <[email protected]> Co-authored-by: taoli <[email protected]>

…olding (#3807) Signed-off-by: Zhenhuan Chen <[email protected]>

* Add docs about DeepSeek-R1 long context support Signed-off-by: Xianjie <[email protected]> * update docs Signed-off-by: Xianjie <[email protected]> * reformat Signed-off-by: Xianjie <[email protected]> --------- Signed-off-by: Xianjie <[email protected]>

…3906) Signed-off-by: Zhenhuan Chen <[email protected]>

Signed-off-by: Mike Iovine <[email protected]>

Signed-off-by: Yukun He <[email protected]> Co-authored-by: Kefeng-Duan <[email protected]>

* test: add deepseek v3 & r1 cases Signed-off-by: Xiwen Yu <[email protected]>

Signed-off-by: Mike Iovine <[email protected]>

Signed-off-by: Erin Ho <[email protected]>

…#3863)

Signed-off-by: Yuxian Qiu <[email protected]>

…or reproducibility in attention tests (#3919) Signed-off-by: qixiang-99 <[email protected]>

Signed-off-by: fredw (generated by with_the_same_user script) <[email protected]>

Signed-off-by: Hao Lu <[email protected]@users.noreply.github.com> Co-authored-by: Hao Lu <[email protected]@users.noreply.github.com>

* update cubins Signed-off-by: Perkz Zheng <[email protected]> * add trtllm-gen kernels for eagle3 and also kernels with cga-reduction Signed-off-by: Perkz Zheng <[email protected]> * address the comments Signed-off-by: Perkz Zheng <[email protected]> --------- Signed-off-by: Perkz Zheng <[email protected]>

Signed-off-by: junq <[email protected]>

* Update gen tps calculation. Signed-off-by: Frank Di Natale <[email protected]> * Add back output speed for comparison. Signed-off-by: Frank Di Natale <[email protected]> * Fix issue with f-string. Signed-off-by: Frank Di Natale <[email protected]> * Fix some spacing. Signed-off-by: Frank Di Natale <[email protected]> * Replace output speed with per-request genphase tput. Signed-off-by: Frank Di Natale <[email protected]> * Add gen TPS breakdown. Signed-off-by: Frank Di Natale <[email protected]> * Update some tagging. Signed-off-by: Frank Di Natale <[email protected]> --------- Signed-off-by: Frank Di Natale <[email protected]>

Signed-off-by: Mike Iovine <[email protected]>

Signed-off-by: Balaram Buddharaju <[email protected]>

* add deepseek-r1 reasoning parser Signed-off-by: pansicheng <[email protected]> * fix test Signed-off-by: Pengyun Lin <[email protected]> --------- Signed-off-by: pansicheng <[email protected]> Signed-off-by: Pengyun Lin <[email protected]> Co-authored-by: Pengyun Lin <[email protected]>

* fix bug of qwen3 moe Signed-off-by: bhsueh <[email protected]> * update threshold Signed-off-by: bhsueh <[email protected]> --------- Signed-off-by: bhsueh <[email protected]>

* update qwen3 document Signed-off-by: bhsueh <[email protected]> * remove useless codes Signed-off-by: bhsueh <[email protected]> --------- Signed-off-by: bhsueh <[email protected]>

…4024) * reuse batch_indices, positions across layers Signed-off-by: Suyog Gupta <[email protected]> * fix flashinfer unit tests Signed-off-by: Suyog Gupta <[email protected]> * simplify call to get_batch_indices_positions Signed-off-by: Suyog Gupta <[email protected]> * fix call to get_batch_indices_positions Signed-off-by: Suyog Gupta <[email protected]> --------- Signed-off-by: Suyog Gupta <[email protected]>

Signed-off-by: Jinyang Yuan <[email protected]>

Signed-off-by: Yuan Tong <[email protected]>

Signed-off-by: Hui Gao <[email protected]>

Signed-off-by: Yuxian Qiu <[email protected]>

Shape was wrongly changed in DecoderState introduction. Signed-off-by: Robin Kobus <[email protected]>

…fo in the Jenkins job page (#3859) * infra: Support show base info and link for pipeline Signed-off-by: ZhanruiSunCh <[email protected]> * Move code to shared lib Signed-off-by: ZhanruiSunCh <[email protected]> * Remove not use code Signed-off-by: ZhanruiSunCh <[email protected]> * Update Build.groovy Signed-off-by: Zhanrui Sun <[email protected]> * Update L0_MergeRequest.groovy Signed-off-by: Zhanrui Sun <[email protected]> * Update L0_Test.groovy Signed-off-by: Zhanrui Sun <[email protected]> --------- Signed-off-by: ZhanruiSunCh <[email protected]> Signed-off-by: Zhanrui Sun <[email protected]>

…te (#3836) * Remove stdout pipe for genai-perf and make stress time as public parameter. Signed-off-by: Wangshanshan <[email protected]> * Update llmRequest based on comment. Signed-off-by: Wangshanshan <[email protected]> * launch process function refactor. Signed-off-by: Wangshanshan <[email protected]> --------- Signed-off-by: Wangshanshan <[email protected]>

* disable overlap in encoder Signed-off-by: Robin Kobus <[email protected]> * feat: invokeGatherBatch Signed-off-by: Robin Kobus <[email protected]> * feat: overlap same batch Signed-off-by: Robin Kobus <[email protected]> * chore: add enableTrtOverlap to ExecutorConfig Signed-off-by: Robin Kobus <[email protected]> * disable overlap for beam search and spec decode Signed-off-by: Robin Kobus <[email protected]> * skip overlap tests with beam search or speculative decoding Signed-off-by: Robin Kobus <[email protected]> * moveFinishedContextRequestsToGeneration and skip unfinished requests in updateRequests Signed-off-by: Robin Kobus <[email protected]> * enable overlap in GptChunkedLongContextTests Signed-off-by: Robin Kobus <[email protected]> * feat: Enable overlap in gptManagerBenchmark Signed-off-by: Robin Kobus <[email protected]> * feat: Improve early exit Signed-off-by: Robin Kobus <[email protected]> * refactor: Use OptionalRef for newOutputTokens tensor Signed-off-by: Robin Kobus <[email protected]> * feat: Add overlap scheduling support to TRTLLMDecoder - Updated TRTLLMDecoder to accept an `enable_overlap_scheduler` parameter. - Modified the decoder's internal logic to utilize the overlap scheduling feature. - Adjusted the sequence lengths handling to ensure compatibility with the new scheduling approach. - Enhanced unit tests to include cases for the overlap scheduler with the TRTLLMDecoder. Signed-off-by: Robin Kobus <[email protected]> * fix: allNewTokens in PP Signed-off-by: Robin Kobus <[email protected]> --------- Signed-off-by: Robin Kobus <[email protected]>

* Properly get decoding mode according to same logic as cpp. Signed-off-by: Daniel Campora <[email protected]> * Cross reference getDecodingMode implementations in pytorch - cpp. Signed-off-by: Daniel Campora <[email protected]> * Better bindings for DecodingMode. Signed-off-by: Daniel Campora <[email protected]> * Revert to version in main. Signed-off-by: Daniel Campora <[email protected]> * Fix. Signed-off-by: Daniel Campora <[email protected]> * Revert configuration.py. Signed-off-by: Daniel Campora <[email protected]> --------- Signed-off-by: Daniel Campora <[email protected]>

Signed-off-by: Erin Ho <[email protected]>

Signed-off-by: Alexandre Milesi <[email protected]> Co-authored-by: Alexandre Milesi <[email protected]> Co-authored-by: Haohang Huang <[email protected]>

* **Model:** Llama-3.1-Nemotron-Nano-8B-v1 * **Precision:** float16 * **Environment:** * GPUs: 1 H100 PCIe * Driver: 570.86.15 * **Test String:** `llama_v3.1_nemotron_nano_8b-bench-pytorch-float16-input_output_len:128,128` * **Request Throughput:** 81.86 req/sec * **Total Token Throughput:** 20956.44 tokens/sec * **Average Request Latency:** 5895.24 ms * **Test String:** `llama_v3.1_nemotron_nano_8b-bench-pytorch-float16-input_output_len:2000,2000` * **Request Throughput:** 1.45 req/sec * **Total Token Throughput:** 5783.92 tokens/sec * **Average Request Latency:** 211541.08 ms * **Test String:** `llama_v3.1_nemotron_nano_8b-bench-float16-maxbs:128-input_output_len:128,128` * **Request Throughput:** 52.75 req/sec * **Total Token Throughput:** 13505.00 tokens/sec * **Average Request Latency:** 5705.50 ms * **Test String:** `llama_v3.1_nemotron_nano_8b-bench-float16-maxbs:128-input_output_len:2000,2000` * **Request Throughput:** 1.41 req/sec * **Total Token Throughput:** 5630.76 tokens/sec * **Average Request Latency:** 217139.59 ms Signed-off-by: Venky Ganesh <[email protected]>

Signed-off-by: Kaiyu Xie <[email protected]> Signed-off-by: jiahanc <[email protected]> Co-authored-by: jiahanc <[email protected]>

Signed-off-by: Chuang Zhu <[email protected]>

#3985) * fix Signed-off-by: Enwei Zhu <[email protected]> * fix Signed-off-by: Enwei Zhu <[email protected]> --------- Signed-off-by: Enwei Zhu <[email protected]>

* fix bug of fused_moe on tp > 1 Signed-off-by: bhsueh <[email protected]> * refine codes Signed-off-by: bhsueh <[email protected]> --------- Signed-off-by: bhsueh <[email protected]>

) Signed-off-by: Rakib Hasan <[email protected]>

* beam_width and max_new_token Signed-off-by: Superjomn <[email protected]> * remove beam_width Signed-off-by: Superjomn <[email protected]> * remove min_length Signed-off-by: Superjomn <[email protected]> * remove return_num_sequences Signed-off-by: Superjomn <[email protected]> Signed-off-by: Superjomn <[email protected]> Signed-off-by: Superjomn <[email protected]> Signed-off-by: Superjomn <[email protected]> Signed-off-by: Superjomn <[email protected]> Signed-off-by: Superjomn <[email protected]> --------- Signed-off-by: Superjomn <[email protected]>

Signed-off-by: Yanchao Lu <[email protected]>

…1_8b_fp8, llama_v3.3_70b_fp8, llama_v3.1_405b_fp4 models (#3864) * tests: skip writing prepare_dataset output to logs Signed-off-by: Ruodi <[email protected]> * test: add llama_v3.1_8b_fp8 model, llama_v3.1_405b model and llama_nemotron_49b model in perf test, and modify original llama models dtype from float16 to bfloat16 according to README.md Signed-off-by: Ruodi <[email protected]> --------- Signed-off-by: Ruodi <[email protected]> Signed-off-by: Larry <[email protected]> Co-authored-by: Larry <[email protected]>

…mpletion (#3888) Signed-off-by: Pengyun Lin <[email protected]>

* [TRTLLM-4051] Support only run some backend type test Signed-off-by: ZhanruiSunCh <[email protected]> * Fix Signed-off-by: ZhanruiSunCh <[email protected]> * Fix name Signed-off-by: ZhanruiSunCh <[email protected]> * Fix pre-commit Signed-off-by: ZhanruiSunCh <[email protected]> * Fix groovy error Signed-off-by: ZhanruiSunCh <[email protected]> * Update L0_Test.groovy Signed-off-by: Zhanrui Sun <[email protected]> --------- Signed-off-by: ZhanruiSunCh <[email protected]> Signed-off-by: Zhanrui Sun <[email protected]>

Signed-off-by: nv-guomingz <[email protected]>

byshiue and others added 3 commits April 27, 2025 09:10

fix bug of deepseek gropu_size setting (#3860)

e9fab4f

Signed-off-by: bhsueh <[email protected]>

fix: Update num_of_ctx_tokens in iteration stats (#3785)

136aab5

* Update num_of_ctx_tokens in iteration stats * Revert not neccessary change of importing module

pull bot added the ⤵️ pull label Apr 27, 2025

chuangz0 and others added 26 commits April 27, 2025 11:48

Test: waive intermittent test hang (#3894)

068c72e

Signed-off-by: Yanchao Lu <[email protected]>

infra: add scaffolding paths to pytorch only files (#3835)

d5bca18

Signed-off-by: Zhenhuan Chen <[email protected]>

update waives & tests (#3887)

e20b67e

Signed-off-by: xinhe-nv <[email protected]>

Fix the link of doc (#3903)

2fe3592

Signed-off-by: taoli <[email protected]> Co-authored-by: taoli <[email protected]>

[TRTLLM-4638 ][feat] add best of n support with reward model in scaff…

ad15e45

…olding (#3807) Signed-off-by: Zhenhuan Chen <[email protected]>

fix(requirements): fix neither 'setup.py' nor 'pyproject.toml' found (#…

19da82d

…3906) Signed-off-by: Zhenhuan Chen <[email protected]>

[chore] Make llama4 MoE use maybe_execute_in_parallel (#3779)

e6f7ff3

Signed-off-by: Mike Iovine <[email protected]>

Fixing minor typo in allreduce kernel selection (#3912)

5502a52

Signed-off-by: Yukun He <[email protected]> Co-authored-by: Kefeng-Duan <[email protected]>

test: add deepseek v3 & r1 cases (#3528)

f84dd8f

* test: add deepseek v3 & r1 cases Signed-off-by: Xiwen Yu <[email protected]>

[fix] Fix flashinfer + speculation issues (#3686)

e534bf0

Signed-off-by: Mike Iovine <[email protected]>

waive test_attention_no_cache (#3921)

0577ea0

Signed-off-by: Erin Ho <[email protected]>

fix: Fix FMHA-based MLA in the generation phase and add MLA unit test (…

dafc28f

…#3863)

chore: remove DummyKvCacheManager. (#3896)

b91da76

Signed-off-by: Yuxian Qiu <[email protected]>

refactor(test): remove random context sequence lengths and set seed f…

f370dd0

…or reproducibility in attention tests (#3919) Signed-off-by: qixiang-99 <[email protected]>

feat: fix erros on scaffolding README (#3899)

8a994d8

Signed-off-by: fredw (generated by with_the_same_user script) <[email protected]>

Fix fp8 kvcache (#3877)

d2f312b

Signed-off-by: Hao Lu <[email protected]@users.noreply.github.com> Co-authored-by: Hao Lu <[email protected]@users.noreply.github.com>

increase H100 CI nodes for PyTorch only pipelines (#3927)

c381380

Signed-off-by: junq <[email protected]>

mikeiovine and others added 29 commits May 5, 2025 10:24

[fix] Skip debugCheckSemaphores in stream capture mode (#4032)

8caf200

Signed-off-by: Mike Iovine <[email protected]>

test: Test OOB access issue in penaltyKernel for endId=-1 (#4035)

5b1aeb6

Signed-off-by: Balaram Buddharaju <[email protected]>

Fix: fix bug of qwen3 moe (#4058)

e053cb6

* fix bug of qwen3 moe Signed-off-by: bhsueh <[email protected]> * update threshold Signed-off-by: bhsueh <[email protected]> --------- Signed-off-by: bhsueh <[email protected]>

doc: update qwen3 document (#4073)

5c0f554

* update qwen3 document Signed-off-by: bhsueh <[email protected]> * remove useless codes Signed-off-by: bhsueh <[email protected]> --------- Signed-off-by: bhsueh <[email protected]>

[fix] Loosen the thresholds of test_attention_mla (#4074)

eeb6c0c

Signed-off-by: Jinyang Yuan <[email protected]>

feat: support add internal cutlass kernels as subproject (#3658)

4b6c197

Signed-off-by: Yuan Tong <[email protected]>

fix: skip add new slot if request has slot 0 (#3991)

5a4794b

Signed-off-by: Hui Gao <[email protected]>

fix: Fix NVLink version decoding. (#3996)

b6cfe08

Signed-off-by: Yuxian Qiu <[email protected]>

[https://nvbugs/5247414] fix: draft/target probs shape (#4055)

e943ad5

Shape was wrongly changed in DecoderState introduction. Signed-off-by: Robin Kobus <[email protected]>

cleanup logprob params (#4039)

cba1793

Signed-off-by: Erin Ho <[email protected]>

fix: Pass local dir to processor creation (#4018)

001e666

Signed-off-by: Alexandre Milesi <[email protected]> Co-authored-by: Alexandre Milesi <[email protected]> Co-authored-by: Haohang Huang <[email protected]>

bench: TRTLLM-4936 Port benchmark_serving.py (#4011)

52d4302

Signed-off-by: Kaiyu Xie <[email protected]> Signed-off-by: jiahanc <[email protected]> Co-authored-by: jiahanc <[email protected]>

fix cache buffer (#3942)

09a28be

Signed-off-by: Chuang Zhu <[email protected]>

[TRTLLM-3925, https://nvbugs/5245262] [fix] Normalize LLM.generate API (

c28b909

#3985) * fix Signed-off-by: Enwei Zhu <[email protected]> * fix Signed-off-by: Enwei Zhu <[email protected]> --------- Signed-off-by: Enwei Zhu <[email protected]>

[Qwen3] chore: fix bug of fused_moe on tp > 1 (#4093)

f670a03

* fix bug of fused_moe on tp > 1 Signed-off-by: bhsueh <[email protected]> * refine codes Signed-off-by: bhsueh <[email protected]> --------- Signed-off-by: bhsueh <[email protected]>

Adding option to specify a set of token ids for multimodal tokens (#4107

bf9ac96

) Signed-off-by: Rakib Hasan <[email protected]>

[Infra] - Update code ownership rules (#4109)

0446270

Signed-off-by: Yanchao Lu <[email protected]>

fix: Align default setting & remove unnecessary check for chat and co…

721f84a

…mpletion (#3888) Signed-off-by: Pengyun Lin <[email protected]>

chore:update .gitignore for doc building task. (#3993)

62cfe74

Signed-off-by: nv-guomingz <[email protected]>

pull bot merged commit 62cfe74 into LarryXFly:main May 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from NVIDIA:main #73

[pull] main from NVIDIA:main #73

pull bot commented Apr 27, 2025 •

edited

Loading

[pull] main from NVIDIA:main #73

[pull] main from NVIDIA:main #73

Conversation

pull bot commented Apr 27, 2025 • edited Loading

pull bot commented Apr 27, 2025 •

edited

Loading