You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update on "Add quantized input support to cpu_sdpa"
cpu_sdpa (unfused SDPA) previously only supported float inputs.
When the model uses quantized Q/K/V (int8 with per-channel scales
and zero_points), decode fell back to cpu_flash_attention, missing
the ~25-30% throughput improvement from unfused SDPA.
This adds quantized support to cpu_sdpa by:
- Accepting optional quantization params (zero_points, scales for Q/K/V)
- Using _q_at_k_gemm for QK^T (handles both int8 and float)
- Using _qk_at_v_gemm for scoresV (handles both int8 and float)
- Applying scaling factor separately (fused with mask add or max reduction)
- Allocating a dequantization buffer for V when quantized
The dispatch in op_sdpa.cpp is updated to route quantized decode
(seq_len==1) through cpu_sdpa instead of cpu_flash_attention.
Differential Revision: [D96044310](https://our.internmc.facebook.com/intern/diff/D96044310/)
[ghstack-poisoned]
0 commit comments