forked from NVIDIA/TensorRT-LLM
-
Notifications
You must be signed in to change notification settings - Fork 0
[pull] main from NVIDIA:main #73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: bhsueh <[email protected]>
* Remote results.xml when no cases ran Signed-off-by: qqiao <[email protected]> * Change some test config to verify Signed-off-by: qqiao <[email protected]> * Update for quotes Signed-off-by: qqiao <[email protected]> * Move the remove results.xml in catch section Signed-off-by: qqiao <[email protected]> * Add missed path Signed-off-by: qqiao <[email protected]> * Change back the test stage setting Signed-off-by: qqiao <[email protected]> --------- Signed-off-by: qqiao <[email protected]>
* Update num_of_ctx_tokens in iteration stats * Revert not neccessary change of importing module
* cacheTransceiver buffer manager Signed-off-by: Chuang Zhu <[email protected]> * fix args Signed-off-by: Chuang Zhu <[email protected]> * cpp kvCacheManager Signed-off-by: Chuang Zhu <[email protected]> * format Signed-off-by: Chuang Zhu <[email protected]> --------- Signed-off-by: Chuang Zhu <[email protected]>
…ng wa… (#3852) * add warmup flag into py_executor to prevent enable profiler during warmup Signed-off-by: bhsueh <[email protected]> * fix bug of pre-commit Signed-off-by: bhsueh <[email protected]> * change setting warmup to all ranks Signed-off-by: bhsueh <[email protected]> --------- Signed-off-by: bhsueh <[email protected]>
* add submit_sync to RemoteMpiSessionClient Signed-off-by: Superjomn <[email protected]> Signed-off-by: Superjomn <[email protected]> add barrier Signed-off-by: Superjomn <[email protected]> Signed-off-by: Superjomn <[email protected]> Signed-off-by: Superjomn <[email protected]> fix comment Signed-off-by: Superjomn <[email protected]> disable test Signed-off-by: Superjomn <[email protected]> * fix Signed-off-by: Superjomn <[email protected]> --------- Signed-off-by: Superjomn <[email protected]>
* infra: install Triton in the base image Signed-off-by: Iman Tabrizian <[email protected]> * install Triton from the base image Signed-off-by: Iman Tabrizian <[email protected]> * update base image Signed-off-by: Iman Tabrizian <[email protected]> * Address review comments Signed-off-by: Iman Tabrizian <[email protected]> * update base image Signed-off-by: Iman Tabrizian <[email protected]> * waive test Signed-off-by: Iman Tabrizian <[email protected]> --------- Signed-off-by: Iman Tabrizian <[email protected]>
#3764) * fix bug of create cuda stream as default parameter which will be initialized during importing Signed-off-by: bhsueh <[email protected]> * add torch.cuda.Stream() for the leader node Signed-off-by: bhsueh <[email protected]> * fix pre-commit issue Signed-off-by: bhsueh <[email protected]> --------- Signed-off-by: bhsueh <[email protected]>
Signed-off-by: Yanchao Lu <[email protected]>
Signed-off-by: Zhenhuan Chen <[email protected]>
Signed-off-by: xinhe-nv <[email protected]>
* update waive list Signed-off-by: xinhe-nv <[email protected]> * update waives Signed-off-by: xinhe-nv <[email protected]> --------- Signed-off-by: xinhe-nv <[email protected]> Signed-off-by: Larry <[email protected]> Co-authored-by: Larry <[email protected]>
Signed-off-by: taoli <[email protected]> Co-authored-by: taoli <[email protected]>
…olding (#3807) Signed-off-by: Zhenhuan Chen <[email protected]>
* Add docs about DeepSeek-R1 long context support Signed-off-by: Xianjie <[email protected]> * update docs Signed-off-by: Xianjie <[email protected]> * reformat Signed-off-by: Xianjie <[email protected]> --------- Signed-off-by: Xianjie <[email protected]>
…3906) Signed-off-by: Zhenhuan Chen <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Yukun He <[email protected]> Co-authored-by: Kefeng-Duan <[email protected]>
* test: add deepseek v3 & r1 cases Signed-off-by: Xiwen Yu <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Erin Ho <[email protected]>
Signed-off-by: Yuxian Qiu <[email protected]>
…or reproducibility in attention tests (#3919) Signed-off-by: qixiang-99 <[email protected]>
Signed-off-by: fredw (generated by with_the_same_user script) <[email protected]>
Signed-off-by: Hao Lu <[email protected]@users.noreply.github.com> Co-authored-by: Hao Lu <[email protected]@users.noreply.github.com>
* update cubins Signed-off-by: Perkz Zheng <[email protected]> * add trtllm-gen kernels for eagle3 and also kernels with cga-reduction Signed-off-by: Perkz Zheng <[email protected]> * address the comments Signed-off-by: Perkz Zheng <[email protected]> --------- Signed-off-by: Perkz Zheng <[email protected]>
Signed-off-by: junq <[email protected]>
* Update gen tps calculation. Signed-off-by: Frank Di Natale <[email protected]> * Add back output speed for comparison. Signed-off-by: Frank Di Natale <[email protected]> * Fix issue with f-string. Signed-off-by: Frank Di Natale <[email protected]> * Fix some spacing. Signed-off-by: Frank Di Natale <[email protected]> * Replace output speed with per-request genphase tput. Signed-off-by: Frank Di Natale <[email protected]> * Add gen TPS breakdown. Signed-off-by: Frank Di Natale <[email protected]> * Update some tagging. Signed-off-by: Frank Di Natale <[email protected]> --------- Signed-off-by: Frank Di Natale <[email protected]>
Signed-off-by: Mike Iovine <[email protected]>
Signed-off-by: Balaram Buddharaju <[email protected]>
* add deepseek-r1 reasoning parser Signed-off-by: pansicheng <[email protected]> * fix test Signed-off-by: Pengyun Lin <[email protected]> --------- Signed-off-by: pansicheng <[email protected]> Signed-off-by: Pengyun Lin <[email protected]> Co-authored-by: Pengyun Lin <[email protected]>
* fix bug of qwen3 moe Signed-off-by: bhsueh <[email protected]> * update threshold Signed-off-by: bhsueh <[email protected]> --------- Signed-off-by: bhsueh <[email protected]>
* update qwen3 document Signed-off-by: bhsueh <[email protected]> * remove useless codes Signed-off-by: bhsueh <[email protected]> --------- Signed-off-by: bhsueh <[email protected]>
…4024) * reuse batch_indices, positions across layers Signed-off-by: Suyog Gupta <[email protected]> * fix flashinfer unit tests Signed-off-by: Suyog Gupta <[email protected]> * simplify call to get_batch_indices_positions Signed-off-by: Suyog Gupta <[email protected]> * fix call to get_batch_indices_positions Signed-off-by: Suyog Gupta <[email protected]> --------- Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Jinyang Yuan <[email protected]>
Signed-off-by: Yuan Tong <[email protected]>
Signed-off-by: Hui Gao <[email protected]>
Signed-off-by: Yuxian Qiu <[email protected]>
Shape was wrongly changed in DecoderState introduction. Signed-off-by: Robin Kobus <[email protected]>
…fo in the Jenkins job page (#3859) * infra: Support show base info and link for pipeline Signed-off-by: ZhanruiSunCh <[email protected]> * Move code to shared lib Signed-off-by: ZhanruiSunCh <[email protected]> * Remove not use code Signed-off-by: ZhanruiSunCh <[email protected]> * Update Build.groovy Signed-off-by: Zhanrui Sun <[email protected]> * Update L0_MergeRequest.groovy Signed-off-by: Zhanrui Sun <[email protected]> * Update L0_Test.groovy Signed-off-by: Zhanrui Sun <[email protected]> --------- Signed-off-by: ZhanruiSunCh <[email protected]> Signed-off-by: Zhanrui Sun <[email protected]>
…te (#3836) * Remove stdout pipe for genai-perf and make stress time as public parameter. Signed-off-by: Wangshanshan <[email protected]> * Update llmRequest based on comment. Signed-off-by: Wangshanshan <[email protected]> * launch process function refactor. Signed-off-by: Wangshanshan <[email protected]> --------- Signed-off-by: Wangshanshan <[email protected]>
* disable overlap in encoder Signed-off-by: Robin Kobus <[email protected]> * feat: invokeGatherBatch Signed-off-by: Robin Kobus <[email protected]> * feat: overlap same batch Signed-off-by: Robin Kobus <[email protected]> * chore: add enableTrtOverlap to ExecutorConfig Signed-off-by: Robin Kobus <[email protected]> * disable overlap for beam search and spec decode Signed-off-by: Robin Kobus <[email protected]> * skip overlap tests with beam search or speculative decoding Signed-off-by: Robin Kobus <[email protected]> * moveFinishedContextRequestsToGeneration and skip unfinished requests in updateRequests Signed-off-by: Robin Kobus <[email protected]> * enable overlap in GptChunkedLongContextTests Signed-off-by: Robin Kobus <[email protected]> * feat: Enable overlap in gptManagerBenchmark Signed-off-by: Robin Kobus <[email protected]> * feat: Improve early exit Signed-off-by: Robin Kobus <[email protected]> * refactor: Use OptionalRef for newOutputTokens tensor Signed-off-by: Robin Kobus <[email protected]> * feat: Add overlap scheduling support to TRTLLMDecoder - Updated TRTLLMDecoder to accept an `enable_overlap_scheduler` parameter. - Modified the decoder's internal logic to utilize the overlap scheduling feature. - Adjusted the sequence lengths handling to ensure compatibility with the new scheduling approach. - Enhanced unit tests to include cases for the overlap scheduler with the TRTLLMDecoder. Signed-off-by: Robin Kobus <[email protected]> * fix: allNewTokens in PP Signed-off-by: Robin Kobus <[email protected]> --------- Signed-off-by: Robin Kobus <[email protected]>
* Properly get decoding mode according to same logic as cpp. Signed-off-by: Daniel Campora <[email protected]> * Cross reference getDecodingMode implementations in pytorch - cpp. Signed-off-by: Daniel Campora <[email protected]> * Better bindings for DecodingMode. Signed-off-by: Daniel Campora <[email protected]> * Revert to version in main. Signed-off-by: Daniel Campora <[email protected]> * Fix. Signed-off-by: Daniel Campora <[email protected]> * Revert configuration.py. Signed-off-by: Daniel Campora <[email protected]> --------- Signed-off-by: Daniel Campora <[email protected]>
Signed-off-by: Erin Ho <[email protected]>
Signed-off-by: Alexandre Milesi <[email protected]> Co-authored-by: Alexandre Milesi <[email protected]> Co-authored-by: Haohang Huang <[email protected]>
* **Model:** Llama-3.1-Nemotron-Nano-8B-v1 * **Precision:** float16 * **Environment:** * GPUs: 1 H100 PCIe * Driver: 570.86.15 * **Test String:** `llama_v3.1_nemotron_nano_8b-bench-pytorch-float16-input_output_len:128,128` * **Request Throughput:** 81.86 req/sec * **Total Token Throughput:** 20956.44 tokens/sec * **Average Request Latency:** 5895.24 ms * **Test String:** `llama_v3.1_nemotron_nano_8b-bench-pytorch-float16-input_output_len:2000,2000` * **Request Throughput:** 1.45 req/sec * **Total Token Throughput:** 5783.92 tokens/sec * **Average Request Latency:** 211541.08 ms * **Test String:** `llama_v3.1_nemotron_nano_8b-bench-float16-maxbs:128-input_output_len:128,128` * **Request Throughput:** 52.75 req/sec * **Total Token Throughput:** 13505.00 tokens/sec * **Average Request Latency:** 5705.50 ms * **Test String:** `llama_v3.1_nemotron_nano_8b-bench-float16-maxbs:128-input_output_len:2000,2000` * **Request Throughput:** 1.41 req/sec * **Total Token Throughput:** 5630.76 tokens/sec * **Average Request Latency:** 217139.59 ms Signed-off-by: Venky Ganesh <[email protected]>
Signed-off-by: Kaiyu Xie <[email protected]> Signed-off-by: jiahanc <[email protected]> Co-authored-by: jiahanc <[email protected]>
Signed-off-by: Chuang Zhu <[email protected]>
#3985) * fix Signed-off-by: Enwei Zhu <[email protected]> * fix Signed-off-by: Enwei Zhu <[email protected]> --------- Signed-off-by: Enwei Zhu <[email protected]>
* fix bug of fused_moe on tp > 1 Signed-off-by: bhsueh <[email protected]> * refine codes Signed-off-by: bhsueh <[email protected]> --------- Signed-off-by: bhsueh <[email protected]>
) Signed-off-by: Rakib Hasan <[email protected]>
* beam_width and max_new_token Signed-off-by: Superjomn <[email protected]> * remove beam_width Signed-off-by: Superjomn <[email protected]> * remove min_length Signed-off-by: Superjomn <[email protected]> * remove return_num_sequences Signed-off-by: Superjomn <[email protected]> Signed-off-by: Superjomn <[email protected]> Signed-off-by: Superjomn <[email protected]> Signed-off-by: Superjomn <[email protected]> Signed-off-by: Superjomn <[email protected]> Signed-off-by: Superjomn <[email protected]> --------- Signed-off-by: Superjomn <[email protected]>
Signed-off-by: Yanchao Lu <[email protected]>
…1_8b_fp8, llama_v3.3_70b_fp8, llama_v3.1_405b_fp4 models (#3864) * tests: skip writing prepare_dataset output to logs Signed-off-by: Ruodi <[email protected]> * test: add llama_v3.1_8b_fp8 model, llama_v3.1_405b model and llama_nemotron_49b model in perf test, and modify original llama models dtype from float16 to bfloat16 according to README.md Signed-off-by: Ruodi <[email protected]> --------- Signed-off-by: Ruodi <[email protected]> Signed-off-by: Larry <[email protected]> Co-authored-by: Larry <[email protected]>
…mpletion (#3888) Signed-off-by: Pengyun Lin <[email protected]>
* [TRTLLM-4051] Support only run some backend type test Signed-off-by: ZhanruiSunCh <[email protected]> * Fix Signed-off-by: ZhanruiSunCh <[email protected]> * Fix name Signed-off-by: ZhanruiSunCh <[email protected]> * Fix pre-commit Signed-off-by: ZhanruiSunCh <[email protected]> * Fix groovy error Signed-off-by: ZhanruiSunCh <[email protected]> * Update L0_Test.groovy Signed-off-by: Zhanrui Sun <[email protected]> --------- Signed-off-by: ZhanruiSunCh <[email protected]> Signed-off-by: Zhanrui Sun <[email protected]>
Signed-off-by: nv-guomingz <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.1)
Can you help keep this open source service alive? 💖 Please sponsor : )