Skip to content

Qualcomm: cap inf replacement value to fix 16a16w accuracy regression#20471

Open
psiddh wants to merge 4 commits into
pytorch:mainfrom
psiddh:fix-qnn-16a16w-inf-replacement
Open

Qualcomm: cap inf replacement value to fix 16a16w accuracy regression#20471
psiddh wants to merge 4 commits into
pytorch:mainfrom
psiddh:fix-qnn-16a16w-inf-replacement

Conversation

@psiddh

@psiddh psiddh commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

PR #19660 folded ReplaceInfValues into QnnQuantizer._replace_inf and made the inf stand-in equal to the full quant range. For 16a16w that is 65535 (vs the previous fixed 255), which blows up the attention-mask quant scale and breaks stories110M decoding in test-llama-runner-qnn-linux. Cap the magnitude at 255 to restore prior behavior; 8a8w is unaffected.

Copilot AI review requested due to automatic review settings June 24, 2026 01:14
@psiddh psiddh requested a review from abhinaykukkadapu as a code owner June 24, 2026 01:14
@pytorch-bot

pytorch-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20471

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

⏳ No Failures, 12 Pending

As of commit 8875528 with merge base da9158b (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 24, 2026
@linux-foundation-easycla

linux-foundation-easycla Bot commented Jun 24, 2026

Copy link
Copy Markdown

CLA Signed
The committers listed above are authorized under a signed CLA.

  • ✅ login: psiddh / name: Siddartha Pothapragada (2333727)

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

PR pytorch#19660 folded ReplaceInfValues into QnnQuantizer._replace_inf and made
the inf stand-in equal to the full quant range. For 16a16w that is 65535
(vs the previous fixed 255), which blows up the attention-mask quant scale
and breaks stories110M decoding in test-llama-runner-qnn-linux. Cap the
magnitude at 255 to restore prior behavior; 8a8w is unaffected.
@psiddh psiddh force-pushed the fix-qnn-16a16w-inf-replacement branch from 58aed0e to 2333727 Compare June 24, 2026 01:17

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Caps the stand-in value used when replacing ±inf constants during QNN quantization annotation so that higher-bit activation quantization (notably 16a16w) doesn’t inflate attention-mask dynamic range and degrade Llama decoding accuracy.

Changes:

  • Cap _get_quant_range()’s returned value to <= 255 to prevent large integer ranges (e.g., uint16) from dominating observer calibration.
  • Add an explanatory code comment documenting the rationale and the Llama attention-mask motivation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +419 to +423
# Cap the inf stand-in so it does not dominate the tensor's
# dynamic range. For >8-bit activations the full range (e.g.
# 65535 for uint16) would blow up the attention-mask quant scale
# and wreck accuracy; 255 keeps a reasonable scale for
# Llama-style attention masks.
# 65535 for uint16) would blow up the attention-mask quant scale
# and wreck accuracy; 255 keeps a reasonable scale for
# Llama-style attention masks.
return min(quant_range, 255)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @psiddh,
Thanks for identifying the issue.
Could you please let me do a quick test on the CI testing model and validate the accuracy issue?
Ideally, we don't want hard coded numbers, and that is why we replaced them with quant range in the PR: #19660

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, please go ahead and validate on CI, happy to wait for your results.

On the hardcoded 255: agree we should avoid magic numbers. perhaps a cleaner version would be to derive the cap from the int8 range rather than hardcoding: ..

return min(quant_range, torch.iinfo(torch.int8).max - torch.iinfo(torch.int8).min)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@winskuo-quic Any updates ? We would like to land this asap as this is stalling viable/strict branch from moving forward.

@psiddh psiddh Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@winskuo-quic The CI runner for test-llama-runner-qnn-linux (qnn_16a16w) is OOM-killing the export step, exit code 137 (SIGKILL), The job is never getting far enough , the Python export process gets killed before it finishes , it looks like.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @psiddh,
Thanks for sharing all the test results.
I am also encountering the same issue.
I noticed it is killed during qnn_preprocess. I believe there's a chance OOM happened inside QNN SDK, which is weird as this is a small model.
I am still working on finding a solution to resolve the issue.
If you happened to find a work around, please feel free to merge first to unblock the CI error.
Thanks again for all the help.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the quick work-around : #20511 , This will give us time to investigate OOM issue

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the workaround. I will try to find re-enable 16a16w test.

Copilot AI review requested due to automatic review settings June 24, 2026 03:04

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

Copilot AI review requested due to automatic review settings June 25, 2026 07:04

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

@psiddh psiddh force-pushed the fix-qnn-16a16w-inf-replacement branch from 67f78b6 to bf8bdbf Compare June 25, 2026 16:11
The 16a16w export+compile is OOM-killed on linux.2xlarge (23 min) and
runner crashes on linux.4xlarge.memory (93 min). Use linux.8xlarge.memory
to verify whether the accuracy fix works once memory is sufficient.
Copilot AI review requested due to automatic review settings June 25, 2026 16:12
@psiddh psiddh force-pushed the fix-qnn-16a16w-inf-replacement branch from bf8bdbf to 8875528 Compare June 25, 2026 16:12

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

fail-fast: false
with:
runner: linux.2xlarge
runner: linux.8xlarge.memory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants