Update base for Update on "Add quantized input support to cpu_sdpa"

kimishpatel · kimishpatel · commit beacd2383d39 · 2026-05-05T12:59:16.000-07:00
cpu_sdpa (unfused SDPA) previously only supported float inputs. When the model uses quantized Q/K/V (int8 with per-channel scales and zero_points), decode fell back to cpu_flash_attention, missing the ~25-30% throughput improvement from unfused SDPA. This adds quantized support to cpu_sdpa by: - Accepting optional quantization params (zero_points, scales for Q/K/V) - Using _q_at_k_gemm for QK^T (handles both int8 and float) - Using _qk_at_v_gemm for scoresV (handles both int8 and float) - Applying scaling factor separately (fused with mask add or max reduction) - Allocating a dequantization buffer for V when quantized The dispatch in op_sdpa.cpp is updated to route quantized decode (seq_len==1) through cpu_sdpa instead of cpu_flash_attention. Differential Revision: [D96044310](https://our.internmc.facebook.com/intern/diff/D96044310/) [ghstack-poisoned]
diff --git a/.ci/scripts/test_lora.sh b/.ci/scripts/test_lora.sh
@@ -138,12 +138,12 @@ EXPECTED_QUANT_PREFIX="<|im_start|>user Calculate 15% of 80?<|im_end|><|im_start
 Okay, so I need to calculate 15% of 80."
 EXPECTED_QUANT_LORA_PREFIX="
 <|im_start|>user Calculate 15% of 80?<|im_end|><|im_start|>assistant
-To calculate 15% of 80, we can multiply 80 by 15/100.
-80 * 15/100 = 12.
-So, 15% of 80 is 12.
+To calculate 15% of 80, we can multiply 80 by 15/100 and then simplify the fraction.
+So, 15% of 80 is equal to (80 * 15) / 100 = 1200 / 100 = 12.
 #### 12
 The answer is: 12<|im_end|>"
 
+
 # Export Quantized PTE, PTD file, no LoRA.
 # override base.lora_config=null to avoid creating a lora model
 # and loading lora weights.
@@ -203,7 +203,7 @@ fi
 NOW=$(date +"%H:%M:%S")
 echo "Test 4: Quantized, program-data separation lora. Starting to run llama runner at ${NOW}"
 # shellcheck source=/dev/null
-cmake-out/examples/models/llama/llama_main --model_path=qwen_lora_math_q.pte --data_paths="qwen_foundation_q.ptd,qwen_lora_math_q.ptd" --prompt="${PROMPT}" ${RUNTIME_ARGS} > result.txt
+cmake-out/examples/models/llama/llama_main --model_path=qwen_lora_math_q.pte --data_paths="qwen_foundation_q.ptd,qwen_lora_math_q.ptd" --prompt="${PROMPT}" ${RUNTIME_ARGS} --seq_len=104 > result.txt
 NOW=$(date +"%H:%M:%S")
 echo "Finished at ${NOW}"