Skip to content

Commit 58aed0e

Browse files
psiddhclaude
andcommitted
Qualcomm: cap inf replacement value to fix 16a16w accuracy regression
PR #19660 folded ReplaceInfValues into QnnQuantizer._replace_inf and made the inf stand-in equal to the full quant range. For 16a16w that is 65535 (vs the previous fixed 255), which blows up the attention-mask quant scale and breaks stories110M decoding in test-llama-runner-qnn-linux. Cap the magnitude at 255 to restore prior behavior; 8a8w is unaffected. Co-authored-by: Claude <noreply@anthropic.com>
1 parent aada6d7 commit 58aed0e

1 file changed

Lines changed: 6 additions & 1 deletion

File tree

backends/qualcomm/quantizer/quantizer.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -416,7 +416,12 @@ def _get_quant_range(self, node):
416416
if quant_info.output_qspec.quant_min is None
417417
else quant_info.output_qspec.quant_min
418418
)
419-
return quant_range
419+
# Cap the inf stand-in so it does not dominate the tensor's
420+
# dynamic range. For >8-bit activations the full range (e.g.
421+
# 65535 for uint16) would blow up the attention-mask quant scale
422+
# and wreck accuracy; 255 keeps a reasonable scale for
423+
# Llama-style attention masks.
424+
return min(quant_range, 255)
420425

421426
def _get_candidates_with_infinity_args(self, graph_module: GraphModule):
422427
binary_op_sources = [

0 commit comments

Comments
 (0)