[PyTorch] Support for cuDNN-backed flex attention by vcherepanov-nv · Pull Request #2984 · NVIDIA/TransformerEngine

vcherepanov-nv · 2026-05-13T03:18:15Z

Description

Adds experimental PyTorch support for cuDNN-backed flex attention in DotProductAttention via a new score_mod callback path.

Users can pass:

score_mod(graph, score, tensors) -> score for forward score modification
optional score_mod_bprop(graph, dP, tensors) -> dP for backward
optional runtime tensor dictionaries for forward/backward score-mod graph inputs

When score_mod_bprop is supplied, it is the user's responsibility to make it mathematically consistent with score_mod. TE forwards this callback to cuDNN as provided and does not derive or validate the backward score transformation automatically.

Supported score_mod configuration

The current cuDNN-backed Flex Attention path supports:

PyTorch DotProductAttention / FusedAttention
FP16 or BF16 unquantized torch.Tensor Q/K/V inputs
SBHD or BSHD Q/K/V layouts
cuDNN F16/BF16 arbitrary-seqlen fused attention backend
attn_mask_type="no_mask"
core_attention_bias_type="no_bias" with no explicit bias tensor
vanilla softmax
attention_dropout=0.0
num_splits=1

The path is currently not supported with FP8, fp8_output, THD format, explicit cu_seqlens inputs, pad_between_seqs, attention masks, attention bias, ALiBi, sliding-window attention, sink attention, dropout, KV cache, context parallelism, CUDA graph capture, checkpointed core attention, or return_max_logit.

For deterministic execution, TE passes the deterministic setting through backend selection and forwards it to cuDNN Frontend sdpa_backward as use_deterministic_algorithm. The score_mod forward sdpa call does not take a separate deterministic flag.

Fixes #2492.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Adds FusedAttentionWithScoreModFunc, a cuDNN frontend Python graph path for SDPA forward/backward with score_mod and score_mod_bprop.
Extends DotProductAttention / FusedAttention APIs with score_mod, score_mod_bprop, score_mod_tensors, and score_mod_bprop_tensors.
Adds backend-selection filtering so score_mod only selects supported cuDNN fused attention configurations.
Adds execution-plan caching for forward and backward score-mod graphs, keyed by tensor metadata, layout, scale, callback topology, and runtime tensor metadata.
Supports explicit score_mod_graph_cache_key() for stateful callbacks, while leaving unsafe unkeyed bound methods uncached.
Executes cuDNN graphs on PyTorch's current CUDA stream and preserves SBHD/BSHD layouts without extra BHSD copies.
Adds validation for unsupported combinations including FP8, context parallelism, THD, KV cache, explicit masks/biases, dropout, non-vanilla softmax, CUDA graph capture, and checkpointed core attention.
Adds tests for cache-key behavior, unsafe callback caching, runtime tensor version checking, and CUDA correctness cases covering causal masking, softcap, and post-scale-bias-style score modification.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-05-13T03:26:45Z

Greptile Summary

This PR adds experimental cuDNN-backed Flex Attention to DotProductAttention and FusedAttention via a score_mod callback path. A new FusedAttentionWithScoreModFunc autograd Function builds and caches cuDNN frontend Python graphs for forward and backward, with a sophisticated cache-key scheme that safely handles lambdas, bound methods, and stateful callable instances.

Introduces flex_attention.py with cuDNN graph construction, execution, and cache-key logic; extends DotProductAttention / FusedAttention APIs with score_mod, score_mod_bprop, score_mod_tensors, and score_mod_bprop_tensors parameters.
Adds backend-selection filtering in get_attention_backend gating score_mod to the F16_arbitrary_seqlen fused-attention sub-backend only, and disabling FlashAttention and unfused paths.
Ships a comprehensive test suite covering cache-key correctness, version-counter safety, and CUDA correctness for causal, softcap, and post-scale-bias score modifications.

Confidence Score: 5/5

The PR is safe to merge. No code path executes incorrect attention computation under the supported configurations, and the gate assertions in DotProductAttention block every unsupported combination before reaching cuDNN.

The core cuDNN graph construction, execution, caching, and autograd wiring are all correct. The two findings are both non-blocking style/documentation concerns: the flash-attention master-switch helper relies on a late AND-reduction that works correctly today, and the silent gradient drop for requires_grad tensors in score_mod dicts is a footgun to document rather than a runtime defect for the described use cases.

flex_attention.py (requires_grad footgun for score_mod_tensors) and utils.py (_disable_all_flash_attention robustness) are the two files worth a second look before follow-on work extends the backend-selection filter chain.

Important Files Changed

Filename	Overview
transformer_engine/pytorch/attention/dot_product_attention/flex_attention.py	New cuDNN frontend Python SDPA path — graph construction, caching (with safe lambda/bound-method keying), execution, and autograd Function. Core logic is sound; requires_grad on score_mod_tensors silently produces no gradients.
transformer_engine/pytorch/attention/dot_product_attention/utils.py	Adds score_mod backend filtering and new AttentionParams fields; fixes fp8_meta None guard in eq. _disable_all_flash_attention only directly disables the master flag, relying on a late AND-reduction — works today but is fragile.
transformer_engine/pytorch/attention/dot_product_attention/backends.py	Adds score_mod dispatch branch in FusedAttention.forward; correctly placed as elif before the standard FusedAttnFunc path.
transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py	Extends DotProductAttention API with score_mod parameters; validation block correctly asserts structural constraints including score_mod_bprop_tensors requiring score_mod_bprop.
tests/pytorch/attention/test_flex_attention.py	New test file; covers cache key correctness, version-counter safety, and CUDA correctness for causal/softcap/post-scale-bias cases. CPU tensors passed to CUDA cuDNN score_mod_tensors (noted in prior threads).
tests/pytorch/utils.py	Adds has_score_mod/has_score_mod_bprop flags to get_available_attention_backends; straightforward addition.
qa/L0_pytorch_unittest/test.sh	Adds test_flex_attention.py to the CI test suite.

Sequence Diagram

sequenceDiagram
    participant User
    participant DPA as DotProductAttention
    participant BS as get_attention_backend
    participant FA as FusedAttention
    participant Func as FusedAttentionWithScoreModFunc
    participant Cache as _cudnn_score_mod_graph_cache
    participant cuDNN as cuDNN Frontend

    User->>DPA: "forward(q,k,v, score_mod=..., score_mod_tensors=...)"
    DPA->>BS: "get_attention_backend(has_score_mod=True)"
    BS-->>DPA: "use_fused_attention=True (F16_arbitrary_seqlen only)"
    DPA->>FA: "forward(q,k,v, score_mod=..., score_mod_tensors=...)"
    FA->>Func: apply(is_training, q,k,v, score_mod, ...)

    Func->>Cache: _get_cudnn_score_mod_fwd_graph(key)
    alt Cache miss
        Cache->>cuDNN: "build pygraph + sdpa(score_mod=wrapped_cb)"
        cuDNN->>User: score_mod(graph, score_tensor, tensors) to score
        cuDNN-->>Cache: compiled graph entry
        Cache-->>Func: _CudnnScoreModFwdGraphEntry
    else Cache hit
        Cache-->>Func: cached entry
    end

    Func->>cuDNN: "execute(variant_pack={q,k,v,output,stats,score_mod_tensors})"
    cuDNN-->>Func: output, stats
    Func-->>User: output

    User->>Func: backward(d_out)
    Func->>Cache: _get_cudnn_score_mod_bwd_graph(key)
    alt Cache miss
        Cache->>cuDNN: build pygraph + sdpa_backward(score_mod, score_mod_bprop)
        cuDNN->>User: score_mod_bprop(graph, dP, tensors) to dP
        cuDNN-->>Cache: compiled backward graph entry
    end
    Func->>cuDNN: "execute(variant_pack={q,k,v,o,dO,stats,dq,dk,dv})"
    cuDNN-->>Func: dq, dk, dv
    Func-->>User: dq, dk, dv

_{Reviews (13): Last reviewed commit: "Address Flex Attention review comments" | Re-trigger Greptile}

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

KshitijLakhani · 2026-05-19T19:50:45Z

Thanks for creating this PR @vcherepanov-nv
This is great !

I was curious about:

Do you have benchmark numbers bases on any toy test cases you might have run ? - would be good to have those in here for users of the API.
1. native PyT flex vs TE PyT flex
2. traditional causal TE via cuDNN vs flex expressed causal TE via cuDNN
I've linked the GH issue in the PR description. Could you please update / close it appropriately when this PR is merged
Thanks !

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

vcherepanov-nv · 2026-05-19T22:11:03Z

Thanks for the thorough review!

Do you have benchmark numbers bases on any toy test cases you might have run ? - would be good to have those in here for users of the API.

native PyT flex vs TE PyT flex

traditional causal TE via cuDNN vs flex expressed causal TE via cuDNN

I haven't done any benchmarking. Reportedly (from a Slack thread) score_mod can lead to significant perf gains if it allows to avoid mask materialization. For causal, I think I observed cuDNN choosing exactly the same kernel with score_mod and the explicit causal flag.

I've linked the GH issue in the PR description. Could you please update / close it appropriately when this PR is merged

Sure, thanks for linking!

sudhakarsingh27

Thanks for the PR! A few comments;
0. Agree with all the comments from @KshitijLakhani and @cyanguwa, so just +1ed them

A user doc specifying the design choices and the building blocks of graph caching would be valuable.
score_mod seems like a argument more than a feature and so the error messaging could use something more substantial like "(TE/cuDNN) Flex Attention"
New arguments of the form has_* in AttentionParams could be avoided. If passing score_mod, score_mod_tensors (which are hefty) is the blocker, could we create a encapsulating dataclass and pass that instead?
user_supplied_seqlens is a big vague, it seems like just a derived variable - does it degenerate to mean pad_between_seqs=True?
Among other nits

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

cyanguwa · 2026-05-31T22:01:31Z

+                or cu_seqlens_q_padded is not None
+                or cu_seqlens_kv_padded is not None
+            )
+


This can be removed, and replaced by checking if padding in attention_mask_type because that's when those tensors are used (i.e. THD, or non-THD + padding_xxx mask).

cyanguwa · 2026-05-31T22:22:18Z

+        use_flash_attention = False
+        use_flash_attention_2 = False
+        use_flash_attention_3 = False
+        use_flash_attention_4 = False


Wouldn't use_flash_attention=False disable all use_flash_attention_x? I thought the relationship was use_flash_attention=True when one of use_flash_attention_x is True, but when we set use_flash_attention=False, we're effectively disabling all use_flash_attention_x.

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

cyanguwa · 2026-06-03T20:50:02Z

+    if any(cudnn_frontend_package.glob("_compiled_module*")):
+        if cudnn_frontend_path not in sys.path:
+            sys.path.insert(0, cudnn_frontend_path)
+        return importlib.import_module("cudnn")


Do we expect users to compile in the 3rdparty/cudnn-frontend directory first before using this feature? i.e. how do they get the _compiled_module? Do we need to set this up in our setup.py file so users won't have this issue?

This was just a question of submodule FE vs system FE that we discussed on Slack.

Will investigate this in a follow-up PR.

cyanguwa · 2026-06-03T20:57:30Z

+    stats: Optional[torch.Tensor],
+) -> _CudnnScoreModFwdGraphEntry:
+    """Build a cached cuDNN frontend graph for score_mod fprop."""
+    cudnn = _import_cudnn_frontend()


We're calling this in pretty much every function in this file. We could do this once at the top of the file. Thanks.

So what's your concern here? Python caches repeated importlib.import_module("cudnn").

I think it's written like this because backends.py imports flex_attention.py eagerly, and cudnn python package might be optional?

CPU overhead. I think this could be called once at the beginning of the file?

Will investigate this in a follow-up PR.

Also, if there's any way to make the feature easier to understand to users.

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

cyanguwa · 2026-06-04T16:30:24Z

/te-ci PyTorch L0

KshitijLakhani

Deferring to @cyanguwa's thorough review of this PR and @vlad having perused/addressed my review comment sfrom before
Approving the PR so as to not hold it back
Good to merge once CI passes

vcherepanov-nv added 7 commits May 8, 2026 21:41

Add cuDNN score_mod attention path

11c3ed2

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Avoid BHSD copies in score_mod attention

eb35191

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Test relative position score_mod attention

57ce106

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Test softcap score_mod attention

e6ba0ea

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Run score_mod graphs on current CUDA stream

dcb6b49

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Add PyTorch score_mod execution plan cache

fefcbe7

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Fix score_mod cache edge cases

ac4c60d

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

vcherepanov-nv requested a review from cyanguwa as a code owner May 13, 2026 03:18

vcherepanov-nv added the 2.16.0 label May 13, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

6446825

for more information, see https://pre-commit.ci

vcherepanov-nv mentioned this pull request May 13, 2026

[Draft]Support for score_mod and score_mod_bprop in cuDNN's sdpa #2767

Closed

13 tasks

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/attention/dot_product_attention/backends.py Outdated

Comment thread transformer_engine/pytorch/attention/dot_product_attention/backends.py

Comment thread transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py Outdated

cyanguwa reviewed May 14, 2026

View reviewed changes

vcherepanov-nv and others added 3 commits May 15, 2026 00:48

Fix score_mod callback graph cache keys

58a5fb5

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Address score_mod review feedback

c00a0b7

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

a8ed67e

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed May 15, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/attention/dot_product_attention/backends.py Outdated

Fix score_mod lambda cache keys

e2a69e1

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

greptile-apps Bot reviewed May 15, 2026

View reviewed changes

Comment thread tests/pytorch/attention/test_attention.py Outdated

KshitijLakhani requested changes May 19, 2026

View reviewed changes

vcherepanov-nv and others added 2 commits May 19, 2026 21:17

Address flex attention review feedback

96f8ab2

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e11cc23

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed May 19, 2026

View reviewed changes

Comment thread tests/pytorch/attention/test_flex_attention.py

Comment thread tests/pytorch/attention/test_flex_attention.py

sudhakarsingh27 reviewed May 21, 2026

View reviewed changes

Comment thread tests/pytorch/attention/test_flex_attention.py

Address flex attention backend review feedback

f0f4f7b

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

github-actions Bot added the community-contribution PRs from external contributor outside the core maintainers, representing community-driven work. label May 21, 2026

cyanguwa reviewed May 31, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py

cyanguwa reviewed May 31, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/attention/dot_product_attention/flex_attention.py Outdated

cyanguwa reviewed May 31, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/attention/dot_product_attention/flex_attention.py

cyanguwa reviewed May 31, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/attention/dot_product_attention/flex_attention.py Outdated

cyanguwa reviewed May 31, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/attention/dot_product_attention/flex_attention.py

cyanguwa reviewed May 31, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/attention/dot_product_attention/utils.py

cyanguwa requested changes May 31, 2026

View reviewed changes

vcherepanov-nv added 7 commits June 2, 2026 19:07

Address flex attention review feedback

06605aa

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Address attention backend review nits

9cbab29

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Remove duplicate flex attention asserts

297e66e

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Clarify score mod tensor keys

7f514e1

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Normalize Flex Attention naming

fb51502

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Simplify score mod backward graph cache lookup

b82830e

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

Return only cuDNN graph from helper

4deeb2a

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

vcherepanov-nv requested a review from cyanguwa June 2, 2026 20:19

[pre-commit.ci] auto fixes from pre-commit.com hooks

e9fcf7c

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/attention/dot_product_attention/utils.py

Refer to Flex Attention in error messages

2e1874e

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

cyanguwa reviewed Jun 3, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/attention/dot_product_attention/flex_attention.py

cyanguwa reviewed Jun 3, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py Outdated

cyanguwa reviewed Jun 3, 2026

View reviewed changes

Comment thread transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py Outdated

cyanguwa requested changes Jun 3, 2026

View reviewed changes

Address Flex Attention review comments

984cf47

Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com>

cyanguwa approved these changes Jun 4, 2026

View reviewed changes

KshitijLakhani approved these changes Jun 4, 2026

View reviewed changes

vcherepanov-nv merged commit 97a9bfe into NVIDIA:main Jun 4, 2026
12 of 15 checks passed

Conversation

vcherepanov-nv commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Supported score_mod configuration

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

KshitijLakhani commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

vcherepanov-nv commented May 19, 2026

Uh oh!

sudhakarsingh27 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cyanguwa Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

vcherepanov-nv commented May 13, 2026 •

edited

Loading

greptile-apps Bot commented May 13, 2026 •

edited

Loading

sudhakarsingh27 left a comment •

edited

Loading

cyanguwa Jun 3, 2026 •

edited

Loading