Commit 1c371e2
[executorch][cuda] gemma4_31b: fuse gate/up MLP projections (default-on)
Summary:
Fuse each gemma4_31b MLP's gate_proj|up_proj into a single
[2*intermediate, hidden] coalesced-int4 matmul, applied by default in the CUDA
export. This issues one activation-quant + one W4A8 matvec per layer instead of
two, cutting per-token launch + activation-quant overhead in the launch-bound
decode path. Only Q4_K (CudaCoalescedInt4Tensor) gate/up pairs are fused; any
other quant type (e.g. Q6_K) is left as two matmuls (guarded, still correct).
Builds on the already-landed kv_len-bounded tq4_sdpa kernel + gemma4_31b
call-site (kv_len + mask_is_causal), which recovered 128k decode from ~2.8 to
~43 tok/s. With both, ET gemma4_31b 128k+TurboQuant decode beats llama.cpp at
every measured context (cuda_graph ON):
ctx ET llama
512 44.80 42.77
2K 43.20 41.97
8K 42.23 41.23
32K 41.64 40.27
127K 38.41 35.97
TurboQuant KV compression kept; prefill restored (6-8x) with no regression;
output quality preserved.
Test Plan:
- Fusion numerics: fused vs unfused MLP through the real W4A8 int4_plain_mm
kernel = bit-exact (max_abs_diff 0.0, cos 1.000000) for decode (T=1) and
prefill (T=4).
- Export + run: fused module exported via CudaPartitioner and executed through
executor_runner (RC=0, cos 0.999915 vs eager). Full 31B export logs
"Fused gate+up on 60 MLP layers".
- Decode A/B (gemma4_31b 128k+TQ, cuda_graph ON, 5x median): table above; beats
llama.cpp at 512 -> 127K. nsys: tq4_sdpa 91.7% -> 2.9% of decode.1 parent 993cff5 commit 1c371e2
2 files changed
Lines changed: 111 additions & 5 deletions
Lines changed: 107 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
30 | 30 | | |
31 | 31 | | |
32 | 32 | | |
| 33 | + | |
33 | 34 | | |
34 | 35 | | |
35 | 36 | | |
| |||
110 | 111 | | |
111 | 112 | | |
112 | 113 | | |
| 114 | + | |
| 115 | + | |
| 116 | + | |
| 117 | + | |
| 118 | + | |
| 119 | + | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
| 200 | + | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
113 | 213 | | |
114 | 214 | | |
115 | 215 | | |
116 | 216 | | |
117 | 217 | | |
118 | 218 | | |
119 | 219 | | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
120 | 225 | | |
121 | 226 | | |
122 | 227 | | |
| |||
125 | 230 | | |
126 | 231 | | |
127 | 232 | | |
| 233 | + | |
| 234 | + | |
128 | 235 | | |
129 | 236 | | |
130 | 237 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
182 | 182 | | |
183 | 183 | | |
184 | 184 | | |
185 | | - | |
186 | | - | |
187 | | - | |
188 | | - | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
189 | 188 | | |
190 | | - | |
| 189 | + | |
191 | 190 | | |
192 | 191 | | |
193 | 192 | | |
| |||
0 commit comments