Skip to content

[pull] main from NVIDIA:main #73

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 125 commits into from
May 7, 2025
Merged

[pull] main from NVIDIA:main #73

merged 125 commits into from
May 7, 2025

Conversation

pull[bot]
Copy link

@pull pull bot commented Apr 27, 2025

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.1)

Can you help keep this open source service alive? 💖 Please sponsor : )

byshiue and others added 3 commits April 27, 2025 09:10
* Remote results.xml when no cases ran

Signed-off-by: qqiao <[email protected]>

* Change some test config to verify

Signed-off-by: qqiao <[email protected]>

* Update for quotes

Signed-off-by: qqiao <[email protected]>

* Move the remove results.xml in catch section

Signed-off-by: qqiao <[email protected]>

* Add missed path

Signed-off-by: qqiao <[email protected]>

* Change back the test stage setting

Signed-off-by: qqiao <[email protected]>

---------

Signed-off-by: qqiao <[email protected]>
* Update num_of_ctx_tokens in iteration stats
* Revert not neccessary change of importing module
@pull pull bot added the ⤵️ pull label Apr 27, 2025
chuangz0 and others added 26 commits April 27, 2025 11:48
* cacheTransceiver buffer manager

Signed-off-by: Chuang Zhu <[email protected]>

* fix args

Signed-off-by: Chuang Zhu <[email protected]>

* cpp kvCacheManager

Signed-off-by: Chuang Zhu <[email protected]>

* format

Signed-off-by: Chuang Zhu <[email protected]>

---------

Signed-off-by: Chuang Zhu <[email protected]>
…ng wa… (#3852)

* add warmup flag into py_executor to prevent enable profiler during warmup

Signed-off-by: bhsueh <[email protected]>

* fix bug of pre-commit

Signed-off-by: bhsueh <[email protected]>

* change setting warmup to all ranks

Signed-off-by: bhsueh <[email protected]>

---------

Signed-off-by: bhsueh <[email protected]>
* add submit_sync to RemoteMpiSessionClient

Signed-off-by: Superjomn <[email protected]>

Signed-off-by: Superjomn <[email protected]>

add barrier

Signed-off-by: Superjomn <[email protected]>

Signed-off-by: Superjomn <[email protected]>

Signed-off-by: Superjomn <[email protected]>

fix comment

Signed-off-by: Superjomn <[email protected]>

disable test

Signed-off-by: Superjomn <[email protected]>

* fix

Signed-off-by: Superjomn <[email protected]>

---------

Signed-off-by: Superjomn <[email protected]>
* infra: install Triton in the base image

Signed-off-by: Iman Tabrizian <[email protected]>

* install Triton from the base image

Signed-off-by: Iman Tabrizian <[email protected]>

* update base image

Signed-off-by: Iman Tabrizian <[email protected]>

* Address review comments

Signed-off-by: Iman Tabrizian <[email protected]>

* update base image

Signed-off-by: Iman Tabrizian <[email protected]>

* waive test

Signed-off-by: Iman Tabrizian <[email protected]>

---------

Signed-off-by: Iman Tabrizian <[email protected]>
#3764)

* fix bug of create cuda stream as default parameter which will be initialized during importing

Signed-off-by: bhsueh <[email protected]>

* add torch.cuda.Stream() for the leader node

Signed-off-by: bhsueh <[email protected]>

* fix pre-commit issue

Signed-off-by: bhsueh <[email protected]>

---------

Signed-off-by: bhsueh <[email protected]>
* update waive list

Signed-off-by: xinhe-nv <[email protected]>

* update waives

Signed-off-by: xinhe-nv <[email protected]>

---------

Signed-off-by: xinhe-nv <[email protected]>
Signed-off-by: Larry <[email protected]>
Co-authored-by: Larry <[email protected]>
Signed-off-by: taoli <[email protected]>
Co-authored-by: taoli <[email protected]>
* Add docs about DeepSeek-R1 long context support

Signed-off-by: Xianjie <[email protected]>

* update docs

Signed-off-by: Xianjie <[email protected]>

* reformat

Signed-off-by: Xianjie <[email protected]>

---------

Signed-off-by: Xianjie <[email protected]>
* test: add deepseek v3 & r1 cases

Signed-off-by: Xiwen Yu <[email protected]>
…or reproducibility in attention tests (#3919)

Signed-off-by: qixiang-99 <[email protected]>
Signed-off-by: fredw (generated by with_the_same_user script) <[email protected]>
Signed-off-by: Hao Lu <[email protected]@users.noreply.github.com>
Co-authored-by: Hao Lu <[email protected]@users.noreply.github.com>
* update cubins

Signed-off-by: Perkz Zheng <[email protected]>

* add trtllm-gen kernels for eagle3 and also kernels with cga-reduction

Signed-off-by: Perkz Zheng <[email protected]>

* address the comments

Signed-off-by: Perkz Zheng <[email protected]>

---------

Signed-off-by: Perkz Zheng <[email protected]>
* Update gen tps calculation.

Signed-off-by: Frank Di Natale <[email protected]>

* Add back output speed for comparison.

Signed-off-by: Frank Di Natale <[email protected]>

* Fix issue with f-string.

Signed-off-by: Frank Di Natale <[email protected]>

* Fix some spacing.

Signed-off-by: Frank Di Natale <[email protected]>

* Replace output speed with per-request genphase tput.

Signed-off-by: Frank Di Natale <[email protected]>

* Add gen TPS breakdown.

Signed-off-by: Frank Di Natale <[email protected]>

* Update some tagging.

Signed-off-by: Frank Di Natale <[email protected]>

---------

Signed-off-by: Frank Di Natale <[email protected]>
mikeiovine and others added 29 commits May 5, 2025 10:24
* add deepseek-r1 reasoning parser

Signed-off-by: pansicheng <[email protected]>

* fix test

Signed-off-by: Pengyun Lin <[email protected]>

---------

Signed-off-by: pansicheng <[email protected]>
Signed-off-by: Pengyun Lin <[email protected]>
Co-authored-by: Pengyun Lin <[email protected]>
* fix bug of qwen3 moe

Signed-off-by: bhsueh <[email protected]>

* update threshold

Signed-off-by: bhsueh <[email protected]>

---------

Signed-off-by: bhsueh <[email protected]>
* update qwen3 document

Signed-off-by: bhsueh <[email protected]>

* remove useless codes

Signed-off-by: bhsueh <[email protected]>

---------

Signed-off-by: bhsueh <[email protected]>
…4024)

* reuse batch_indices, positions across layers

Signed-off-by: Suyog Gupta <[email protected]>

* fix flashinfer unit tests

Signed-off-by: Suyog Gupta <[email protected]>

* simplify call to get_batch_indices_positions

Signed-off-by: Suyog Gupta <[email protected]>

* fix call to get_batch_indices_positions

Signed-off-by: Suyog Gupta <[email protected]>

---------

Signed-off-by: Suyog Gupta <[email protected]>
Shape was wrongly changed in DecoderState introduction.

Signed-off-by: Robin Kobus <[email protected]>
…fo in the Jenkins job page (#3859)

* infra: Support show base info and link for pipeline

Signed-off-by: ZhanruiSunCh <[email protected]>

* Move code to shared lib

Signed-off-by: ZhanruiSunCh <[email protected]>

* Remove not use code

Signed-off-by: ZhanruiSunCh <[email protected]>

* Update Build.groovy

Signed-off-by: Zhanrui Sun <[email protected]>

* Update L0_MergeRequest.groovy

Signed-off-by: Zhanrui Sun <[email protected]>

* Update L0_Test.groovy

Signed-off-by: Zhanrui Sun <[email protected]>

---------

Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: Zhanrui Sun <[email protected]>
…te (#3836)

* Remove stdout pipe for genai-perf and make stress time as public parameter.

Signed-off-by: Wangshanshan <[email protected]>

* Update llmRequest based on comment.

Signed-off-by: Wangshanshan <[email protected]>

* launch process function refactor.

Signed-off-by: Wangshanshan <[email protected]>

---------

Signed-off-by: Wangshanshan <[email protected]>
* disable overlap in encoder

Signed-off-by: Robin Kobus <[email protected]>

* feat: invokeGatherBatch

Signed-off-by: Robin Kobus <[email protected]>

* feat: overlap same batch

Signed-off-by: Robin Kobus <[email protected]>

* chore: add enableTrtOverlap to ExecutorConfig

Signed-off-by: Robin Kobus <[email protected]>

* disable overlap for beam search and spec decode

Signed-off-by: Robin Kobus <[email protected]>

* skip overlap tests with beam search or speculative decoding

Signed-off-by: Robin Kobus <[email protected]>

* moveFinishedContextRequestsToGeneration and skip unfinished requests in updateRequests

Signed-off-by: Robin Kobus <[email protected]>

* enable overlap in GptChunkedLongContextTests

Signed-off-by: Robin Kobus <[email protected]>

* feat: Enable overlap in gptManagerBenchmark

Signed-off-by: Robin Kobus <[email protected]>

* feat: Improve early exit

Signed-off-by: Robin Kobus <[email protected]>

* refactor: Use OptionalRef for newOutputTokens tensor

Signed-off-by: Robin Kobus <[email protected]>

* feat: Add overlap scheduling support to TRTLLMDecoder

- Updated TRTLLMDecoder to accept an `enable_overlap_scheduler` parameter.
- Modified the decoder's internal logic to utilize the overlap scheduling feature.
- Adjusted the sequence lengths handling to ensure compatibility with the new scheduling approach.
- Enhanced unit tests to include cases for the overlap scheduler with the TRTLLMDecoder.

Signed-off-by: Robin Kobus <[email protected]>

* fix: allNewTokens in PP

Signed-off-by: Robin Kobus <[email protected]>

---------

Signed-off-by: Robin Kobus <[email protected]>
* Properly get decoding mode according to same logic as cpp.

Signed-off-by: Daniel Campora <[email protected]>

* Cross reference getDecodingMode implementations in pytorch - cpp.

Signed-off-by: Daniel Campora <[email protected]>

* Better bindings for DecodingMode.

Signed-off-by: Daniel Campora <[email protected]>

* Revert to version in main.

Signed-off-by: Daniel Campora <[email protected]>

* Fix.

Signed-off-by: Daniel Campora <[email protected]>

* Revert configuration.py.

Signed-off-by: Daniel Campora <[email protected]>

---------

Signed-off-by: Daniel Campora <[email protected]>
Signed-off-by: Alexandre Milesi <[email protected]>
Co-authored-by: Alexandre Milesi <[email protected]>
Co-authored-by: Haohang Huang <[email protected]>
*   **Model:** Llama-3.1-Nemotron-Nano-8B-v1
*   **Precision:** float16
*   **Environment:**
    *   GPUs: 1 H100 PCIe
    *   Driver: 570.86.15

*   **Test String:** `llama_v3.1_nemotron_nano_8b-bench-pytorch-float16-input_output_len:128,128`
*   **Request Throughput:** 81.86 req/sec
*   **Total Token Throughput:** 20956.44 tokens/sec
*   **Average Request Latency:** 5895.24 ms

*   **Test String:** `llama_v3.1_nemotron_nano_8b-bench-pytorch-float16-input_output_len:2000,2000`
*   **Request Throughput:** 1.45 req/sec
*   **Total Token Throughput:** 5783.92 tokens/sec
*   **Average Request Latency:** 211541.08 ms

*   **Test String:** `llama_v3.1_nemotron_nano_8b-bench-float16-maxbs:128-input_output_len:128,128`
*   **Request Throughput:** 52.75 req/sec
*   **Total Token Throughput:** 13505.00 tokens/sec
*   **Average Request Latency:** 5705.50 ms

*   **Test String:** `llama_v3.1_nemotron_nano_8b-bench-float16-maxbs:128-input_output_len:2000,2000`
*   **Request Throughput:** 1.41 req/sec
*   **Total Token Throughput:** 5630.76 tokens/sec
*   **Average Request Latency:** 217139.59 ms

Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Kaiyu Xie <[email protected]>
Signed-off-by: jiahanc <[email protected]>
Co-authored-by: jiahanc <[email protected]>
Signed-off-by: Chuang Zhu <[email protected]>
#3985)

* fix

Signed-off-by: Enwei Zhu <[email protected]>

* fix

Signed-off-by: Enwei Zhu <[email protected]>

---------

Signed-off-by: Enwei Zhu <[email protected]>
* fix bug of fused_moe on tp > 1

Signed-off-by: bhsueh <[email protected]>

* refine codes

Signed-off-by: bhsueh <[email protected]>

---------

Signed-off-by: bhsueh <[email protected]>
* beam_width and max_new_token

Signed-off-by: Superjomn <[email protected]>

* remove beam_width

Signed-off-by: Superjomn <[email protected]>

* remove min_length

Signed-off-by: Superjomn <[email protected]>

* remove return_num_sequences

Signed-off-by: Superjomn <[email protected]>

Signed-off-by: Superjomn <[email protected]>

Signed-off-by: Superjomn <[email protected]>

Signed-off-by: Superjomn <[email protected]>

Signed-off-by: Superjomn <[email protected]>

Signed-off-by: Superjomn <[email protected]>

---------

Signed-off-by: Superjomn <[email protected]>
…1_8b_fp8, llama_v3.3_70b_fp8, llama_v3.1_405b_fp4 models (#3864)

* tests: skip writing prepare_dataset output to logs

Signed-off-by: Ruodi <[email protected]>

* test: add llama_v3.1_8b_fp8 model, llama_v3.1_405b model and llama_nemotron_49b model in perf test, and modify original llama models dtype from float16 to bfloat16 according to README.md

Signed-off-by: Ruodi <[email protected]>

---------

Signed-off-by: Ruodi <[email protected]>
Signed-off-by: Larry <[email protected]>
Co-authored-by: Larry <[email protected]>
* [TRTLLM-4051] Support only run some backend type test

Signed-off-by: ZhanruiSunCh <[email protected]>

* Fix

Signed-off-by: ZhanruiSunCh <[email protected]>

* Fix name

Signed-off-by: ZhanruiSunCh <[email protected]>

* Fix pre-commit

Signed-off-by: ZhanruiSunCh <[email protected]>

* Fix groovy error

Signed-off-by: ZhanruiSunCh <[email protected]>

* Update L0_Test.groovy

Signed-off-by: Zhanrui Sun <[email protected]>

---------

Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: Zhanrui Sun <[email protected]>
@pull pull bot merged commit 62cfe74 into LarryXFly:main May 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.