Commit beacd23
committed
Update base for Update on "Add quantized input support to cpu_sdpa"
cpu_sdpa (unfused SDPA) previously only supported float inputs.
When the model uses quantized Q/K/V (int8 with per-channel scales
and zero_points), decode fell back to cpu_flash_attention, missing
the ~25-30% throughput improvement from unfused SDPA.
This adds quantized support to cpu_sdpa by:
- Accepting optional quantization params (zero_points, scales for Q/K/V)
- Using _q_at_k_gemm for QK^T (handles both int8 and float)
- Using _qk_at_v_gemm for scoresV (handles both int8 and float)
- Applying scaling factor separately (fused with mask add or max reduction)
- Allocating a dequantization buffer for V when quantized
The dispatch in op_sdpa.cpp is updated to route quantized decode
(seq_len==1) through cpu_sdpa instead of cpu_flash_attention.
Differential Revision: [D96044310](https://our.internmc.facebook.com/intern/diff/D96044310/)
[ghstack-poisoned]1 parent a5821d0 commit beacd23
1 file changed
Lines changed: 4 additions & 4 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
138 | 138 | | |
139 | 139 | | |
140 | 140 | | |
141 | | - | |
142 | | - | |
143 | | - | |
| 141 | + | |
| 142 | + | |
144 | 143 | | |
145 | 144 | | |
146 | 145 | | |
| 146 | + | |
147 | 147 | | |
148 | 148 | | |
149 | 149 | | |
| |||
203 | 203 | | |
204 | 204 | | |
205 | 205 | | |
206 | | - | |
| 206 | + | |
207 | 207 | | |
208 | 208 | | |
209 | 209 | | |
| |||
0 commit comments