Qualcomm AI Engine Direct - Refactor llama runner for dynamic IO dtypes#19146
Qualcomm AI Engine Direct - Refactor llama runner for dynamic IO dtypes#19146chenweng-quic wants to merge 1 commit into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19146
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 2 New Failures, 2 Unclassified FailuresAs of commit 8222b4b with merge base 824cbff ( NEW FAILURES - The following jobs have failed:
UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
b06f047 to
66fafc5
Compare
|
Hi @abhinaykukkadapu, It will cause following error in op_fallback.cpp To fix the dtype issue, I add logic in spec_prop_pass to avoid restoring. |
|
Hi @abhinaykukkadapu, |
|
@claude review this PR |
|
Claude finished @abhinaykukkadapu's task in 3m 50s —— View job PR Review: Qualcomm AI Engine Direct - Refactor llama runner for dynamic IO dtypes
Overall AssessmentThis is a well-motivated refactor that removes compile-time dtype dispatch ( The PR is large (33 files, ~800 additions / ~700 deletions) but most changes are mechanical: removing template parameters, switching from typed pointers to Issues Found1. Bug:
|
- Summary To enable GPU backend support in the Llama runner, refactoring is required because the dtypes of kv_cache, attention_mask, and logits are currently hardcoded, preventing floating point models from running. This PR fix the issue by removing the hardcoded dtype in the runner.
13fbb40 to
8222b4b
Compare
|
Hi @psiddh, |

Summary
To enable GPU backend support in the Llama runner, refactoring is required because the dtypes of kv_cache, attention_mask, and logits are currently hardcoded, preventing floating‑point models from running.
This PR focuses on removing the hardcode dtype for them.
Key changes
MultimodalPromptProcessor, and related runner classes
construction time instead of compile-time bitwidth detection
offsets; add fill_mask() helper for multi-dtype attention mask filling
Test plan
cc @cccclai @cbilgin @abhinaykukkadapu