Skip to content

Conversation

Aya-ZIbra
Copy link
Contributor

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1875

Add cutlass blackwell FMHA decode kernel implementation to TritonBench benchmarking suite .

Reviewed By: sryap

Differential Revision: D80041532

Copy link

netlify bot commented Sep 10, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit 5589cbb
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/68d209ae04735500084257c1
😎 Deploy Preview https://deploy-preview-4853--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@meta-cla meta-cla bot added the cla signed label Sep 10, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D80041532

Gefei Zuo added 2 commits September 22, 2025 12:54
Summary:
D80992628 introduced SWA FWD kernel changes which did not support decode kernels (i.e., supporting sm100_fmha_fwd but not sm100_fmha_gen).
Similarly, softmax_scale introduced in D82788784 did not support decode kernels either.

In blackwell_fmha_test, the these parameters are dropped during decode kernel selection (https://www.internalfb.com/code/fbsource/[cd7066706035]/fbcode/deeplearning/fbgemm/fbgemm_gpu/experimental/gen_ai/test/attention/blackwell_fmha_test.py?lines=182)

To avoid confusion, do not test test_decode with ignored parameters.

Differential Revision: D82991496
Summary:
1. Reduce pipeline stages to avoid exceeding smem limit
2. Add static_assert to make sure smem capacity violation is raised during compilation rather than runtime
3. Select the TMEM intrinsics based on sizeof(Element).
4. Update unittest to include bf16
5. Also label decode kernel test name with their corresponding test parameters.

Differential Revision: D82991495
@facebook-github-bot
Copy link
Contributor

@Aya-ZIbra has exported this pull request. If you are a Meta employee, you can view the originating diff in D80041532.

Aya-ZIbra added a commit to Aya-ZIbra/FBGEMM that referenced this pull request Sep 22, 2025
Summary:
Pull Request resolved: pytorch#4853

X-link: facebookresearch/FBGEMM#1875

Add cutlass blackwell FMHA decode kernel implementation to TritonBench benchmarking suite .

Reviewed By: sryap

Differential Revision: D80041532
@facebook-github-bot
Copy link
Contributor

@Aya-ZIbra has exported this pull request. If you are a Meta employee, you can view the originating diff in D80041532.

Aya-ZIbra added a commit to Aya-ZIbra/FBGEMM that referenced this pull request Sep 22, 2025
Summary:
Pull Request resolved: pytorch#4853

X-link: facebookresearch/FBGEMM#1875

Add cutlass blackwell FMHA decode kernel implementation to TritonBench benchmarking suite .

Reviewed By: sryap

Differential Revision: D80041532
Aya-ZIbra added a commit to Aya-ZIbra/tritonbench that referenced this pull request Sep 23, 2025
Summary:
X-link: pytorch/FBGEMM#4853

X-link: facebookresearch/FBGEMM#1875

Add cutlass blackwell FMHA decode kernel implementation to TritonBench benchmarking suite .

Reviewed By: sryap

Differential Revision: D80041532
Summary:
X-link: meta-pytorch/tritonbench#376

Pull Request resolved: pytorch#4853

X-link: facebookresearch/FBGEMM#1875

Add cutlass blackwell FMHA decode kernel implementation to TritonBench benchmarking suite .

Reviewed By: sryap

Differential Revision: D80041532
@facebook-github-bot
Copy link
Contributor

@Aya-ZIbra has exported this pull request. If you are a Meta employee, you can view the originating diff in D80041532.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants