Skip to content

[executorch][gemma4] fuse MLP gate/up at GGUF load #20481

Draft
Gasoonjia wants to merge 2 commits into
mainfrom
gemma4_31b-mlp-fusion-unified
Draft

[executorch][gemma4] fuse MLP gate/up at GGUF load #20481
Gasoonjia wants to merge 2 commits into
mainfrom
gemma4_31b-mlp-fusion-unified

Conversation

@Gasoonjia

Copy link
Copy Markdown
Contributor

Summary:
Move the gemma4 MLP gate_proj|up_proj fusion to a single backend-agnostic point in the GGUF loader, and make the model forward consume it. Supersedes the earlier CUDA-only export-time fusion (reverted here).

  • gguf_loader.py: before any backend conversion (_convert_weight), buffer each layer's raw gate/up ExportableGGUFTensor and, once both arrive, row-concat their raw GGUF blocks along the output dim into one fused gate_up ExportableGGUFTensor (gate rows then up rows). Both backends then pack the
    already-fused weight with NO per-type concat: CUDA (Q4_K ->
    CudaCoalescedInt4Tensor, Q6_K -> CudaDp4aPlanarInt6Tensor) and MLX (ExportableGGUFTensor). Guards: same ggml_type + K; non-fuseable pairs and unpaired leftovers fall through unfused.
  • Gemma4MLP: when a fused gate_up_proj is present, run one matmul and split the [.., 2*intermediate_size] output back into gate/up; otherwise use the separate projections. The shared MLP stays safe for unfused checkpoints and the prequant/HF load paths (no gate_up_proj -> original path, no crash).
  • Revert the previous CUDA-localized fusion (cuda_source_transformations.py and export.py back to their original form). The kv_len-bounded tq4_sdpa kernel + call-site (already on main) are unchanged.

Single fusion point widens applicability (CUDA + MLX, incl. Q6_K) and keeps the model def backend-agnostic. Decode win is unchanged (same fused matmul, produced at load instead of at export).

Test Plan:

  • Raw concat (real GGUF blk.0 ffn, q4_k): fused.dequantize() == [gate; up] stacked, bit-exact; fused CudaCoalescedInt4Tensor rows [:N]/[N:] qdata+scale+zero bit-identical to gate/up.
  • Model-def fused vs unfused forward through real W4A8 int4_plain_mm: decode (T=1) bit-exact (cos 1.000000); prefill (T=4) cos 0.999988 -- the only delta is cuBLAS GEMM shape-dependent fp ordering (N=43008 vs 21504, identical weights), benign and inherent to any gate/up fusion.
  • Full CUDA GGUF export (gemma4_31b, --turboquant, max-seq-len 131072): loader logs "Fused gate+up on 60 MLP layers", TurboQuant swaps 10 layers, AOTI build clean (model.pte + 26.18GB aoti_cuda_blob.ptd, "Done.").
  • Decode via gemma4_31b_runner on the new build: coherent output, no NaN; prefill 1375 tok/s, decode 38.3 tok/s (no cuda_graph sanity).

gasoonjia added 2 commits June 23, 2026 15:21
Summary:
Fuse each gemma4_31b MLP's gate_proj|up_proj into a single
[2*intermediate, hidden] coalesced-int4 matmul, applied by default in the CUDA
export. This issues one activation-quant + one W4A8 matvec per layer instead of
two, cutting per-token launch + activation-quant overhead in the launch-bound
decode path. Only Q4_K (CudaCoalescedInt4Tensor) gate/up pairs are fused; any
other quant type (e.g. Q6_K) is left as two matmuls (guarded, still correct).

Builds on the already-landed kv_len-bounded tq4_sdpa kernel + gemma4_31b
call-site (kv_len + mask_is_causal), which recovered 128k decode from ~2.8 to
~43 tok/s. With both, ET gemma4_31b 128k+TurboQuant decode beats llama.cpp at
every measured context (cuda_graph ON):

  ctx    ET      llama
  512    44.80   42.77
  2K     43.20   41.97
  8K     42.23   41.23
  32K    41.64   40.27
  127K   38.41   35.97

TurboQuant KV compression kept; prefill restored (6-8x) with no regression;
output quality preserved.

Test Plan:
- Fusion numerics: fused vs unfused MLP through the real W4A8 int4_plain_mm
  kernel = bit-exact (max_abs_diff 0.0, cos 1.000000) for decode (T=1) and
  prefill (T=4).
- Export + run: fused module exported via CudaPartitioner and executed through
  executor_runner (RC=0, cos 0.999915 vs eager). Full 31B export logs
  "Fused gate+up on 60 MLP layers".
- Decode A/B (gemma4_31b 128k+TQ, cuda_graph ON, 5x median): table above; beats
  llama.cpp at 512 -> 127K. nsys: tq4_sdpa 91.7% -> 2.9% of decode.
…a+mlx)

Summary:
Move the gemma4 MLP gate_proj|up_proj fusion to a single backend-agnostic point
in the GGUF loader, and make the model forward consume it. Supersedes the
earlier CUDA-only export-time fusion (reverted here).

- gguf_loader.py: before any backend conversion (_convert_weight), buffer each
  layer's raw gate/up ExportableGGUFTensor and, once both arrive, row-concat
  their raw GGUF blocks along the output dim into one fused gate_up
  ExportableGGUFTensor (gate rows then up rows). Both backends then pack the
  already-fused weight with NO per-type concat: CUDA (Q4_K ->
  CudaCoalescedInt4Tensor, Q6_K -> CudaDp4aPlanarInt6Tensor) and MLX
  (ExportableGGUFTensor). Guards: same ggml_type + K; non-fuseable pairs and
  unpaired leftovers fall through unfused.
- Gemma4MLP: when a fused gate_up_proj is present, run one matmul and split the
  [.., 2*intermediate_size] output back into gate/up; otherwise use the separate
  projections. The shared MLP stays safe for unfused checkpoints and the
  prequant/HF load paths (no gate_up_proj -> original path, no crash).
- Revert the previous CUDA-localized fusion (cuda_source_transformations.py and
  export.py back to their original form). The kv_len-bounded tq4_sdpa kernel +
  call-site (already on main) are unchanged.

Single fusion point widens applicability (CUDA + MLX, incl. Q6_K) and keeps the
model def backend-agnostic. Decode win is unchanged (same fused matmul, produced
at load instead of at export).

Test Plan:
- Raw concat (real GGUF blk.0 ffn, q4_k): fused.dequantize() == [gate; up]
  stacked, bit-exact; fused CudaCoalescedInt4Tensor rows [:N]/[N:]
  qdata+scale+zero bit-identical to gate/up.
- Model-def fused vs unfused forward through real W4A8 int4_plain_mm: decode
  (T=1) bit-exact (cos 1.000000); prefill (T=4) cos 0.999988 -- the only delta
  is cuBLAS GEMM shape-dependent fp ordering (N=43008 vs 21504, identical
  weights), benign and inherent to any gate/up fusion.
- Full CUDA GGUF export (gemma4_31b, --turboquant, max-seq-len 131072): loader
  logs "Fused gate+up on 60 MLP layers", TurboQuant swaps 10 layers, AOTI build
  clean (model.pte + 26.18GB aoti_cuda_blob.ptd, "Done.").
- Decode via gemma4_31b_runner on the new build: coherent output, no NaN;
  prefill 1375 tok/s, decode 38.3 tok/s (no cuda_graph sanity).
@pytorch-bot

pytorch-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20481

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 3 Unrelated Failures, 2 Unclassified Failures

As of commit 638f07a with merge base 65bc0ca (image):

NEW FAILURES - The following jobs have failed:

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 24, 2026
@linux-foundation-easycla

Copy link
Copy Markdown

CLA Missing ID

  • ❌ The email address for the commit (638f07a) is not linked to the GitHub account, preventing the EasyCLA check. Consult this Help Article and GitHub Help to resolve. (To view the commit's email address, add .patch at the end of this PR page's URL.) For further assistance with EasyCLA, please visit our EasyCLA portal and chat with our support bot.

@Gasoonjia Gasoonjia force-pushed the gemma4_31b-cuda-decode-speedup branch 2 times, most recently from 1c371e2 to 4025660 Compare June 25, 2026 17:23
Gasoonjia added a commit that referenced this pull request Jun 25, 2026
Summary:
Fuse each gemma4_31b MLP's gate_proj|up_proj into a single
[2*intermediate, hidden] coalesced-int4 matmul, applied by default in
the CUDA export. This issues one activation-quant + one W4A8 matvec per
layer instead of two, cutting per-token launch + activation-quant
overhead in the launch-bound decode path. Only Q4_K
(CudaCoalescedInt4Tensor) gate/up pairs are fused; any other quant type
(e.g. Q6_K) is left as two matmuls (guarded, still correct).

| decode length | main branch | current branch |
|---|---|---|
| 512 | 42.2 | 44.80 |
| 2K | 40.8 | 43.20 |
| 8K | 40.0 | 42.23 |
| 32K | 39.4 | 41.64 |
| 127K | 35.5 | 38.41 |

Next Step: we will upsteam this kind of operator fusion into gemma4-31b
model level when loading gguf.
#20481 is the draft PR
Base automatically changed from gemma4_31b-cuda-decode-speedup to main June 25, 2026 22:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant