Add cutlass decode kernel to TritonBench #4853

Aya-ZIbra · 2025-09-10T17:25:58Z

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1875

Add cutlass blackwell FMHA decode kernel implementation to TritonBench benchmarking suite .

Reviewed By: sryap

Differential Revision: D80041532

netlify · 2025-09-10T17:26:04Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`5589cbb`
🔍 Latest deploy log	https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/68d209ae04735500084257c1
😎 Deploy Preview	https://deploy-preview-4853--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

facebook-github-bot · 2025-09-10T17:26:09Z

This pull request was exported from Phabricator. Differential Revision: D80041532

Summary: D80992628 introduced SWA FWD kernel changes which did not support decode kernels (i.e., supporting sm100_fmha_fwd but not sm100_fmha_gen). Similarly, softmax_scale introduced in D82788784 did not support decode kernels either. In blackwell_fmha_test, the these parameters are dropped during decode kernel selection (https://www.internalfb.com/code/fbsource/[cd7066706035]/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/test/attention/blackwell_fmha_test.py?lines=182) To avoid confusion, do not test test_decode with ignored parameters. Differential Revision: D82991496

Summary: 1. Reduce pipeline stages to avoid exceeding smem limit 2. Add static_assert to make sure smem capacity violation is raised during compilation rather than runtime 3. Select the TMEM intrinsics based on sizeof(Element). 4. Update unittest to include bf16 5. Also label decode kernel test name with their corresponding test parameters. Differential Revision: D82991495

facebook-github-bot · 2025-09-22T20:16:53Z

@Aya-ZIbra has exported this pull request. If you are a Meta employee, you can view the originating diff in D80041532.

Summary: Pull Request resolved: pytorch#4853 X-link: facebookresearch/FBGEMM#1875 Add cutlass blackwell FMHA decode kernel implementation to TritonBench benchmarking suite . Reviewed By: sryap Differential Revision: D80041532

facebook-github-bot · 2025-09-22T20:28:54Z

@Aya-ZIbra has exported this pull request. If you are a Meta employee, you can view the originating diff in D80041532.

Summary: Pull Request resolved: pytorch#4853 X-link: facebookresearch/FBGEMM#1875 Add cutlass blackwell FMHA decode kernel implementation to TritonBench benchmarking suite . Reviewed By: sryap Differential Revision: D80041532

Summary: X-link: pytorch/FBGEMM#4853 X-link: facebookresearch/FBGEMM#1875 Add cutlass blackwell FMHA decode kernel implementation to TritonBench benchmarking suite . Reviewed By: sryap Differential Revision: D80041532

Summary: X-link: meta-pytorch/tritonbench#376 Pull Request resolved: pytorch#4853 X-link: facebookresearch/FBGEMM#1875 Add cutlass blackwell FMHA decode kernel implementation to TritonBench benchmarking suite . Reviewed By: sryap Differential Revision: D80041532

facebook-github-bot · 2025-09-23T02:44:56Z

@Aya-ZIbra has exported this pull request. If you are a Meta employee, you can view the originating diff in D80041532.

meta-cla bot added the cla signed label Sep 10, 2025

facebook-github-bot added the fb-exported label Sep 10, 2025

Gefei Zuo added 2 commits September 22, 2025 12:54

facebook-github-bot added the meta-exported label Sep 22, 2025

Aya-ZIbra force-pushed the export-D80041532 branch from 8bead29 to 3b2e442 Compare September 22, 2025 20:16

Aya-ZIbra force-pushed the export-D80041532 branch from 3b2e442 to 68081f5 Compare September 22, 2025 20:28

Aya-ZIbra force-pushed the export-D80041532 branch from 68081f5 to 5589cbb Compare September 23, 2025 02:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add cutlass decode kernel to TritonBench #4853

Add cutlass decode kernel to TritonBench #4853

Uh oh!

Aya-ZIbra commented Sep 10, 2025

Uh oh!

netlify bot commented Sep 10, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Sep 10, 2025

Uh oh!

facebook-github-bot commented Sep 22, 2025

Uh oh!

facebook-github-bot commented Sep 22, 2025

Uh oh!

facebook-github-bot commented Sep 23, 2025

Uh oh!

Uh oh!

Add cutlass decode kernel to TritonBench #4853

Are you sure you want to change the base?

Add cutlass decode kernel to TritonBench #4853

Uh oh!

Conversation

Aya-ZIbra commented Sep 10, 2025

Uh oh!

netlify bot commented Sep 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Uh oh!

facebook-github-bot commented Sep 10, 2025

Uh oh!

facebook-github-bot commented Sep 22, 2025

Uh oh!

facebook-github-bot commented Sep 22, 2025

Uh oh!

facebook-github-bot commented Sep 23, 2025

Uh oh!

Uh oh!

netlify bot commented Sep 10, 2025 •

edited

Loading