Skip to content

[gemma4_31b][cuda] length-aware bf16 global attention + head_dim-agno…#20506

Draft
Gasoonjia wants to merge 1 commit into
gemma4_31b-cuda-decode-speedupfrom
gemma4_31b-cuda-attn-perf-git
Draft

[gemma4_31b][cuda] length-aware bf16 global attention + head_dim-agno…#20506
Gasoonjia wants to merge 1 commit into
gemma4_31b-cuda-decode-speedupfrom
gemma4_31b-cuda-attn-perf-git

Conversation

@Gasoonjia

Copy link
Copy Markdown
Contributor

…stic prefill autotune

  • Global (full-attention) bf16 layers: bound SDPA to a runtime kv_len scalar (CUDA-graph-safe) instead of the full max_seq_len KV buffer -> O(context) decode; restores decode scaling (was flat ~36.5 t/s at all depths -> 46.5@512, 34.9@127K). (sdpa.py kv_len path + cuda_source_transformations.py _lenaware_attention_forward; global layers only, sliding + turbo untouched)
  • Prefill global full-attention: replace fixed m32/m64 BLOCK_M selection with a head_dim-keyed autotuned _sdpa_fwd_kernel + register-budget prune (BLOCK_MHEAD_DIM <= 4096num_warps), fixing acc[64,512] fp32 register spill at head_dim=512. Prefill +24% @8K, +63% @32k, +117% @127k; head_dim-agnostic (no split-D needed for D<=512). (sdpa.py)
  • Runner: add --tokens_file (pre-tokenized input) and --ignore_eos (fixed decode length) for benchmarking. (main.cpp)
  • Validated: output bitwise-identical to prior kernel (cos=1.0, D=64/128/256/512), no decode regression; non-tq prefill now beats llama.cpp at all 5 cells and turbo TQ4 at 4/5. Op-level autotune profiling (A100) confirms the config set is near-optimal (in-set optimum at every regime; only <=1.3% marginal candidates).

@pytorch-bot

pytorch-bot Bot commented Jun 25, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20506

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 3 New Failures, 1 Pending, 3 Unrelated Failures, 3 Unclassified Failures

As of commit ce442fe with merge base 1b726b2 (image):

NEW FAILURES - The following jobs have failed:

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@linux-foundation-easycla

linux-foundation-easycla Bot commented Jun 25, 2026

Copy link
Copy Markdown

CLA Missing ID

  • ❌ The email address for the commit (ce442fe) is not linked to the GitHub account, preventing the EasyCLA check. Consult this Help Article and GitHub Help to resolve. (To view the commit's email address, add .patch at the end of this PR page's URL.) For further assistance with EasyCLA, please visit our EasyCLA portal and chat with our support bot.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 25, 2026
…stic prefill autotune

- Global (full-attention) bf16 layers: bound SDPA to a runtime kv_len scalar
  (CUDA-graph-safe) instead of the full max_seq_len KV buffer -> O(context)
  decode; restores decode scaling (was flat ~36.5 t/s at all depths ->
  46.5@512, 34.9@127K). (sdpa.py kv_len path + cuda_source_transformations.py
  _lenaware_attention_forward; global layers only, sliding + turbo untouched)
- Prefill global full-attention: replace fixed m32/m64 BLOCK_M selection with a
  head_dim-keyed autotuned _sdpa_fwd_kernel + register-budget prune
  (BLOCK_M*HEAD_DIM <= 4096*num_warps), fixing acc[64,512] fp32 register spill
  at head_dim=512. Prefill +24% @8K, +63% @32k, +117% @127k; head_dim-agnostic
  (no split-D needed for D<=512). (sdpa.py)
- Validated: output bitwise-identical to prior kernel (cos=1.0, D=64/128/256/512),
  no decode regression; non-tq prefill now beats llama.cpp at all 5 cells and
  turbo TQ4 at 4/5. Op-level autotune profiling (A100) confirms the config set is
  near-optimal (in-set optimum at every regime; only <=1.3% marginal candidates).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant