[None][feat] AutoDeploy: Perf improvement for mamba layers #8991

nvchenghaoz · 2025-11-07T01:25:08Z

Summary by CodeRabbit

Bug Fixes
- Fixed decoding phase calculations in Mamba model operations for improved correctness during inference.

Signed-off-by: Chenghao Zhang <[email protected]>

coderabbitai · 2025-11-07T01:28:05Z

📝 Walkthrough

Walkthrough

Both files contain decode-phase optimizations for the Mamba model. The CUDA backend simplifies decoding index calculation by replacing index-based copying with direct slicing using offsets. The Triton backend removes redundant dt_pre computation, instead passing dt_hp directly to selective_state_update with softplus enabled.

Changes

Cohort / File(s)	Summary
Mamba CUDA backend decode optimization `tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py`	Replaces decoding index calculation and in-place index_copy_ operation with direct sliced copy_. Uses total_prefill_tokens and num_decode offsets for explicit slice bounds, ensuring dtype consistency via to().
Mamba Triton backend dt computation `tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py`	Removes dt_pre computation (softplus and clipping) in decode path. Passes dt_hp directly to selective_state_update with dt_bias_hp as bias and dt_softplus enabled, replacing previous zero-bias non-softplus path.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20–25 minutes

Triton backend changes may require verification that dt_softplus parameter produces numerically equivalent results and doesn't affect model accuracy
CUDA backend dtype handling should be verified to ensure the explicit .to(y_flat.dtype) conversion doesn't introduce unexpected precision changes
Both files involve low-level GPU kernels where subtle logic changes could have significant performance or numerical implications

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is incomplete. Only '@coderabbitai summary' was provided without any actual description, test coverage, or checklist completion.	Complete the PR description with sections explaining the issue/solution, test coverage, and completion of the PR checklist as specified in the template.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Title check	✅ Passed	The title clearly and directly summarizes the main change: a performance improvement for mamba layers in the AutoDeploy module, which matches the file modifications and optimization changes described in the raw summary.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

nvchenghaoz · 2025-11-07T21:22:18Z

/bot run

tensorrt-cicd · 2025-11-07T21:31:12Z

PR_Github #23878 [ run ] triggered by Bot. Commit: 45fbb9d

tensorrt-cicd · 2025-11-07T22:51:11Z

PR_Github #23878 [ run ] completed with state SUCCESS. Commit: 45fbb9d
/LLM/main/L0_MergeRequest_PR pipeline #17975 completed with status: 'FAILURE'

Signed-off-by: Chenghao Zhang <[email protected]>

suyoggupta · 2025-11-08T17:39:07Z

/bot run

tensorrt-cicd · 2025-11-08T17:45:04Z

PR_Github #23903 [ run ] triggered by Bot. Commit: 76530a4

tensorrt-cicd · 2025-11-08T18:59:13Z

PR_Github #23903 [ run ] completed with state SUCCESS. Commit: 76530a4
/LLM/main/L0_MergeRequest_PR pipeline #17995 completed with status: 'FAILURE'

suyoggupta · 2025-11-08T20:55:34Z

/bot run

tensorrt-cicd · 2025-11-08T21:01:45Z

PR_Github #23905 [ run ] triggered by Bot. Commit: c63abe0

tensorrt-cicd · 2025-11-08T22:18:18Z

PR_Github #23905 [ run ] completed with state SUCCESS. Commit: c63abe0
/LLM/main/L0_MergeRequest_PR pipeline #17997 completed with status: 'FAILURE'

Signed-off-by: Suyog Gupta <[email protected]>

suyoggupta · 2025-11-09T01:16:09Z

/bot run

tensorrt-cicd · 2025-11-09T01:22:38Z

PR_Github #23907 [ run ] triggered by Bot. Commit: 8eb0c25

tensorrt-cicd · 2025-11-09T02:54:40Z

PR_Github #23907 [ run ] completed with state SUCCESS. Commit: 8eb0c25
/LLM/main/L0_MergeRequest_PR pipeline #17999 completed with status: 'FAILURE'

Signed-off-by: Suyog Gupta <[email protected]>

suyoggupta · 2025-11-09T03:32:05Z

/bot run

tensorrt-cicd · 2025-11-09T03:37:45Z

PR_Github #23908 [ run ] triggered by Bot. Commit: 324181a

tensorrt-cicd · 2025-11-09T04:50:12Z

PR_Github #23908 [ run ] completed with state SUCCESS. Commit: 324181a
/LLM/main/L0_MergeRequest_PR pipeline #18000 completed with status: 'FAILURE'

tensorrt-cicd · 2025-11-09T06:54:50Z

PR_Github #23914 [ run ] triggered by Bot. Commit: 324181a

tensorrt-cicd · 2025-11-09T07:50:40Z

PR_Github #23914 [ run ] completed with state SUCCESS. Commit: 324181a
/LLM/main/L0_MergeRequest_PR pipeline #18004 completed with status: 'FAILURE'

suyoggupta · 2025-11-09T08:30:41Z

/bot run

tensorrt-cicd · 2025-11-09T08:37:07Z

PR_Github #23916 [ run ] triggered by Bot. Commit: 324181a

tensorrt-cicd · 2025-11-09T09:33:47Z

PR_Github #23916 [ run ] completed with state SUCCESS. Commit: 324181a
/LLM/main/L0_MergeRequest_PR pipeline #18006 completed with status: 'FAILURE'

suyoggupta · 2025-11-09T19:04:53Z

/bot run

tensorrt-cicd · 2025-11-09T19:11:20Z

PR_Github #23930 [ run ] triggered by Bot. Commit: 324181a

tensorrt-cicd · 2025-11-09T20:10:58Z

PR_Github #23930 [ run ] completed with state SUCCESS. Commit: 324181a
/LLM/main/L0_MergeRequest_PR pipeline #18019 completed with status: 'FAILURE'

tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py

tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py

tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.py

nvchenghaoz · 2025-11-10T17:35:59Z

/bot run

tensorrt-cicd · 2025-11-10T17:42:00Z

PR_Github #24039 [ run ] triggered by Bot. Commit: 324181a

tensorrt-cicd · 2025-11-10T18:44:33Z

PR_Github #24039 [ run ] completed with state SUCCESS. Commit: 324181a
/LLM/main/L0_MergeRequest_PR pipeline #18113 completed with status: 'FAILURE'

Signed-off-by: Chenghao Zhang <[email protected]>

nvchenghaoz · 2025-11-10T19:08:32Z

/bot run

tensorrt-cicd · 2025-11-10T19:14:33Z

PR_Github #24044 [ run ] triggered by Bot. Commit: eb7c92b

tensorrt-cicd · 2025-11-11T00:27:10Z

PR_Github #24044 [ run ] completed with state SUCCESS. Commit: eb7c92b
/LLM/main/L0_MergeRequest_PR pipeline #18118 completed with status: 'FAILURE'

nvchenghaoz · 2025-11-11T00:29:25Z

/bot run

tensorrt-cicd · 2025-11-11T00:35:36Z

PR_Github #24060 [ run ] triggered by Bot. Commit: eb7c92b

tensorrt-cicd · 2025-11-11T04:27:31Z

PR_Github #24060 [ run ] completed with state SUCCESS. Commit: eb7c92b
/LLM/main/L0_MergeRequest_PR pipeline #18132 completed with status: 'FAILURE'

nvchenghaoz · 2025-11-11T04:35:16Z

/bot run

tensorrt-cicd · 2025-11-11T04:40:35Z

PR_Github #24105 [ run ] triggered by Bot. Commit: eb7c92b

tensorrt-cicd · 2025-11-11T06:52:41Z

PR_Github #24105 [ run ] completed with state SUCCESS. Commit: eb7c92b
/LLM/main/L0_MergeRequest_PR pipeline #18168 completed with status: 'SUCCESS'

Signed-off-by: Chenghao Zhang <[email protected]> Signed-off-by: Suyog Gupta <[email protected]> Co-authored-by: Suyog Gupta <[email protected]>

Perf improvement: Minor fixes

350a613

Signed-off-by: Chenghao Zhang <[email protected]>

nvchenghaoz requested a review from suyoggupta November 7, 2025 01:25

nvchenghaoz requested a review from a team as a code owner November 7, 2025 01:25

github-project-automation bot added this to AutoDeploy Board Nov 7, 2025

github-project-automation bot moved this to Backlog in AutoDeploy Board Nov 7, 2025

nvchenghaoz changed the title ~~[None][Feat] AutoDeploy: Perf improvement for mamba layers.~~ [None][feat] AutoDeploy: Perf improvement for mamba layers. Nov 7, 2025

nvchenghaoz changed the title ~~[None][feat] AutoDeploy: Perf improvement for mamba layers.~~ [None][feat] AutoDeploy: Perf improvement for mamba layers Nov 7, 2025

suyoggupta approved these changes Nov 7, 2025

View reviewed changes

github-project-automation bot moved this from Backlog to In review in AutoDeploy Board Nov 7, 2025

Merge branch 'main' into chenghao/perf-nemotron-1106

45fbb9d

Add conv act fusion

76530a4

Signed-off-by: Chenghao Zhang <[email protected]>

Merge branch 'main' into chenghao/perf-nemotron-1106

c63abe0

fix unit tests

8eb0c25

Signed-off-by: Suyog Gupta <[email protected]>

fix tests

324181a

Signed-off-by: Suyog Gupta <[email protected]>

lucaslie reviewed Nov 10, 2025

View reviewed changes

nvchenghaoz added 2 commits November 10, 2025 11:06

Address reviewer's comments

e833d2f

Signed-off-by: Chenghao Zhang <[email protected]>

Merge branch 'main' into chenghao/perf-nemotron-1106

eb7c92b

nvchenghaoz self-assigned this Nov 10, 2025

lucaslie mentioned this pull request Nov 10, 2025

[Feature]: AutoDeploy: update fuse_causal_conv pattern matcher with new utility #9051

Closed

1 task

lucaslie approved these changes Nov 10, 2025

View reviewed changes

Wanli-Jiang mentioned this pull request Nov 11, 2025

[None][feat] Nano-v3 stack PRs v2 #9062

Draft

nvchenghaoz merged commit ec9cf71 into NVIDIA:main Nov 11, 2025
5 checks passed

github-project-automation bot moved this from In review to Done in AutoDeploy Board Nov 11, 2025

[None][feat] AutoDeploy: Perf improvement for mamba layers #8991

[None][feat] AutoDeploy: Perf improvement for mamba layers #8991

Uh oh!

Conversation

nvchenghaoz commented Nov 7, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

nvchenghaoz commented Nov 7, 2025

Uh oh!

tensorrt-cicd commented Nov 7, 2025

Uh oh!

tensorrt-cicd commented Nov 7, 2025

Uh oh!

suyoggupta commented Nov 8, 2025

Uh oh!

tensorrt-cicd commented Nov 8, 2025

Uh oh!

tensorrt-cicd commented Nov 8, 2025

Uh oh!

suyoggupta commented Nov 8, 2025

Uh oh!

tensorrt-cicd commented Nov 8, 2025

Uh oh!

tensorrt-cicd commented Nov 8, 2025

Uh oh!

suyoggupta commented Nov 9, 2025

Uh oh!

tensorrt-cicd commented Nov 9, 2025

Uh oh!

tensorrt-cicd commented Nov 9, 2025

Uh oh!

suyoggupta commented Nov 9, 2025

Uh oh!

tensorrt-cicd commented Nov 9, 2025

Uh oh!

tensorrt-cicd commented Nov 9, 2025

Uh oh!

tensorrt-cicd commented Nov 9, 2025

Uh oh!

tensorrt-cicd commented Nov 9, 2025

Uh oh!

suyoggupta commented Nov 9, 2025

Uh oh!

tensorrt-cicd commented Nov 9, 2025

Uh oh!

tensorrt-cicd commented Nov 9, 2025

Uh oh!

suyoggupta commented Nov 9, 2025

Uh oh!

tensorrt-cicd commented Nov 9, 2025

Uh oh!

tensorrt-cicd commented Nov 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nvchenghaoz commented Nov 10, 2025

Uh oh!

tensorrt-cicd commented Nov 10, 2025

Uh oh!

tensorrt-cicd commented Nov 10, 2025

Uh oh!

nvchenghaoz commented Nov 10, 2025

Uh oh!

tensorrt-cicd commented Nov 10, 2025

Uh oh!

tensorrt-cicd commented Nov 11, 2025

Uh oh!

nvchenghaoz commented Nov 11, 2025

Uh oh!

tensorrt-cicd commented Nov 11, 2025

Uh oh!

nvchenghaoz commented Nov 7, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 7, 2025 •

edited

Loading