Qualcomm AI Engine Direct - Refactor llama runner for dynamic IO dtypes by chenweng-quic · Pull Request #19146 · pytorch/executorch

chenweng-quic · 2026-04-27T05:36:08Z

Summary

To enable GPU backend support in the Llama runner, refactoring is required because the dtypes of kv_cache, attention_mask, and logits are currently hardcoded, preventing floating‑point models from running.
This PR focuses on removing the hardcode dtype for them.

Key changes

Remove template parameter from KVManager, LhdTokenGenerator,
MultimodalPromptProcessor, and related runner classes
Detect kv_cache and attention_mask dtypes dynamically from MethodMeta at
construction time instead of compile-time bitwidth detection
Switch to std::byte* pointer arithmetic with getDtypeSize() for all buffer
offsets; add fill_mask() helper for multi-dtype attention mask filling
Update spec_prop pass for custom llama op for sharding case greater than 1

Test plan

python ${EXECUTORCH_ROOT}/backends/qualcomm/tests/test_qnn_delegate.py -k TestExampleLLMScript.test_llama_stories_110m --model SM8650 --build_folder /local/mnt/workspace/chenweng/executorch/executorch/build-android --host mlgtw-linux --device bebcca9b --executorch_root ${EXECUTORCH_ROOT} --artifact_dir ./stories_110m_pte_size --llama_artifacts . --use_fp16

cc @cccclai @cbilgin @abhinaykukkadapu

pytorch-bot · 2026-04-27T05:36:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19146

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Run pull request jobs on OSDC runners in shadow mode

❌ 2 New Failures, 2 Unclassified Failures

As of commit 8222b4b with merge base 824cbff ():

NEW FAILURES - The following jobs have failed:

pull / test-sqnr-static-llm-qnn-linux (smollm2_135m) / linux-job (gh)
test_static_llm_model
pull / test-static-llama-qnn-linux (stories_260k_bc) / linux-job (gh)
RuntimeError: Command docker exec -t 22c5a40be5ba0084ad4b28454c02275a0890ead7ad14486998f273c66641b193 /exec failed with exit code 1

UNCLASSIFIED FAILURES - DrCI could not classify the following jobs because the workflow did not run on the merge base. The failures may be pre-existing on trunk or introduced by this PR:

Build Windows Wheels / pytorch/executorch / build-wheel-py3_10-cpu (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
Process completed with exit code 1.
Build Windows Wheels / pytorch/executorch / upload / upload-wheel-py3_10-cpu (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
Unable to download artifact(s): Artifact not found for name: pytorch_executorch__3.10_cpu_x64

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-04-27T05:37:05Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

chenweng-quic · 2026-04-28T06:36:34Z

Hi @abhinaykukkadapu,
This PR touch the spec_prop_pass since the spec written in build_quant_io is restore by spec_prop_pass when num_sharding greater than 1.
e.g.,

after build_quant_io
input -> call_delegate1 (spec=uint16) -> llama.fallback (spec=uint16) -> call_delegate2 (spec=uint16) -> output
after spec_prop_pass 
input -> call_delegate1 (spec=uint16) -> llama.fallback (spec=float) -> call_delegate2 (spec=uint16) -> output

It will cause following error in op_fallback.cpp

To fix the dtype issue, I add logic in spec_prop_pass to avoid restoring.
I am not sure if this is correct fix, could you share your opinion?

chenweng-quic · 2026-04-29T03:39:21Z

Hi @abhinaykukkadapu,
Could you tag claude to review this PR?
Thanks

abhinaykukkadapu · 2026-04-29T04:19:24Z

@claude review this PR

claude · 2026-04-29T04:20:05Z

Claude finished @abhinaykukkadapu's task in 3m 50s —— View job

PR Review: Qualcomm AI Engine Direct - Refactor llama runner for dynamic IO dtypes

Read and analyze all changed files
Review C++ runtime changes (kv_manager, token_generator, prompt_processor, etc.)
Review Python export/pass changes (build_quant_io, spec_prop_pass, llama.py, etc.)
Review new op_fallback.py file
Check for correctness, safety, and style issues
Post review findings

Overall Assessment

This is a well-motivated refactor that removes compile-time dtype dispatch (template <typename T>) in favor of runtime dtype detection via MethodMeta. The approach is sound — detecting kv_cache_dtype_ and attention_mask_dtype_ from the model metadata at construction time is cleaner and more extensible than the previous KvBitWidth enum + template instantiation approach.

The PR is large (33 files, ~800 additions / ~700 deletions) but most changes are mechanical: removing template parameters, switching from typed pointers to std::byte* with getDtypeSize() arithmetic, and threading MethodMeta through constructors.

Issues Found

1. Bug: `fill_mask` uses wrong "negative" value for float masks

kv_manager.cpp:51 — When use_pos_value=false and dtype is Float, the mask is filled with -65535.0. For quantized integer types, 0 is the "masked" value — but for float attention masks, the standard convention is -inf (or a very large negative like -1e9). Using -65535.0 as a magic number is fragile and may not match what the model expects if it was trained with standard float masking.

Consider using -std::numeric_limits<float>::infinity() or at least documenting why -65535.0 was chosen. If the model's attention implementation specifically expects this value, a comment would clarify the intent.

2. Bug: `getDtypeSize` missing `Half` / `Float16` support

utils.h:30-51 — The getDtypeSize() function handles Float, Double, Int, Long, Byte, and UInt16 but does not handle Half (float16). Given the PR's stated goal is to support floating-point models (including fp16 via --use_fp16), this seems like a critical omission. If the kv_cache or attention_mask is exported with Half dtype, getDtypeSize() will hit the ET_CHECK_MSG(false, ...) path at runtime.

Fix this →

3. Bug: `fill_mask` also missing `Half` support

kv_manager.cpp:19-61 — Similarly, fill_mask() only handles UInt16, Byte, and Float. If the attention mask ends up as Half dtype, this will crash. Should add a Half case.

4. `decoder_runner.h:logits_to_token` also missing `Half` support

decoder_runner.h:74-89 — The logits_to_token switch only handles UInt16, Byte, and Float. If logits come back as Half, this will fail.

5. Redundant semicolon (style nit)

kv_manager.cpp:140 — There's a trailing double semicolon:

      std::byte* parent_ptr = attention_mask +
          pidx * metadata_.context_len * getDtypeSize(attention_mask_dtype_);
      ;

6. KV cache dtype detection heuristic is fragile

kv_manager.cpp:71-89 — The constructor iterates over inputs starting at index 2 to find the kv_cache by matching tensor_nbytes == expected_tensor_nbytes. This heuristic could misidentify a non-kv-cache tensor that happens to match the expected byte size. It also assumes that all inputs from index 2 onward that don't match this size are non-kv-cache tensors (like input_pos or sliding window attention mask), but there's no validation that at least one kv_cache tensor was actually found.

Additionally, the check tensor_meta->scalar_type() != kv_cache_dtype_ at line 75 will fire on non-kv-cache tensors (like input_pos which is Int), because once kv_cache_dtype_ is set, any subsequent tensor with a different dtype triggers the error message "Currently mixed scalar type of kv_cache is not allowed". This seems like a real bug — the non-kv-cache inputs between index 2 and the first kv_cache tensor will trigger a false mixed-dtype error.

Looking more carefully: the check at line 74 only fires when kv_cache_dtype_ != Undefined, and kv_cache_dtype_ is only set when a tensor matches the expected size (line 83-85). So the logic is:

First, it scans for the first tensor matching the expected kv_cache size and records its dtype.
Then, for subsequent tensors that don't match the expected size, the dtype comparison still runs but those tensors have different sizes so they'd mismatch but the code would still check their dtype...

Actually, the mixed-dtype check at line 74-78 runs before the size check at line 80-89. So if kv_cache_dtype_ is already set (from a previous kv_cache tensor), and then we encounter input_pos (which is Int dtype and different size), we'd enter the if at line 74 and hit the error. This seems like a real bug — the dtype check should be guarded by the size check.

Fix this →

7. spec_prop_pass change — approach seems reasonable

exir/passes/spec_prop_pass.py:82-95 — The change adds exir_ops.edge.llama.fallback.default alongside executorch_call_delegate in the condition that preserves existing specs instead of regenerating them. This makes sense: the llama.fallback op acts as a passthrough between delegates, and its spec should be preserved just like delegate call specs. The comment from @chenweng-quic explains the motivation well — spec_prop_pass was restoring the spec set by build_quant_io, breaking the dtype for the fallback op in the sharding case.

One concern: the import executorch.extension.llm.custom_ops.op_fallback at the top of spec_prop_pass.py (line 13) is a side-effect import that registers the llama.fallback op. This creates a dependency from the core exir package on an LLM-specific custom op. If op_fallback.py isn't available (e.g., in a minimal install), this import would fail and break spec_prop_pass for all users, not just QNN/llama users. Consider making this a lazy import or guarding it with a try/except.

Fix this →

8. `op_fallback.py` extraction is clean

The extraction of the llama.fallback op registration from model_sharding.py into its own op_fallback.py file is a good separation of concerns. The import in model_sharding.py (line 10) ensures backward compatibility.

CI Failures

The PR has 2 CI failures:

test-sqnr-static-llm-qnn-linux (smollm2_135m) — test_static_llm_model
test-static-llama-qnn-linux (stories_260k_bc) — test_llama_stories_260k

These may be related to the spec_prop_pass change or the build_quant_io changes affecting non-fp16 quantized models. Worth investigating whether the kv_cache dtype detection logic in the C++ runner works correctly for the 8-bit quantized case that these tests exercise.

Summary

The core refactoring approach (removing templates, using runtime dtype detection) is the right direction. The main concerns are:

Critical: Missing Half dtype support in getDtypeSize(), fill_mask(), and logits_to_token() — this could cause runtime crashes for the fp16 use case this PR is enabling.
Critical: The kv_cache dtype detection loop in KVManager constructor may trigger false "mixed scalar type" errors on non-kv-cache input tensors.
Medium: The unconditional import of op_fallback in spec_prop_pass.py could break non-LLM users.
Minor: -65535.0 magic number in float mask, trailing semicolon.

- Summary To enable GPU backend support in the Llama runner, refactoring is required because the dtypes of kv_cache, attention_mask, and logits are currently hardcoded, preventing floating point models from running. This PR fix the issue by removing the hardcoded dtype in the runner.

chenweng-quic · 2026-05-18T02:53:34Z

Hi @psiddh,
Could you take a look of this PR?
Thanks.

psiddh · 2026-05-20T00:55:43Z

@claude Re-review the PR thoroughly

claude · 2026-05-20T00:56:32Z

Claude finished @psiddh's task in 2m 41s —— View job

Re-review: Qualcomm AI Engine Direct - Refactor llama runner for dynamic IO dtypes

Read and analyze all C++ runner changes (kv_manager, token_generator, prompt_processor, decoder_runner, utils)
Read and analyze Python export/pass changes (build_quant_io, spec_prop_pass, llama.py, op_fallback)
Read multimodal runner changes
Read test and JNI changes
Check if previous review issues have been addressed
Post comprehensive review findings

Overall Assessment

This is a well-executed refactoring that removes compile-time dtype dispatch (template <typename T>) in favor of runtime dtype detection via MethodMeta. The approach is correct: detecting kv_cache_dtype_ and attention_mask_dtype_ from model metadata at construction time is cleaner, more extensible, and enables GPU backend support for floating-point models.

The changes are primarily mechanical — removing template parameters, switching from typed pointers to std::byte* with getDtypeSize() arithmetic, and threading MethodMeta through constructors. The code quality is good and the architecture is sound.

Compared to the previous review, Issue #6 (kv_cache dtype detection ordering bug) has been fixed — the size check now occurs before the dtype consistency check (lines 80-90 in kv_manager.cpp), and there's a post-loop assertion that kv_cache_dtype_ was actually found.

Issues Found

1. Missing `Half` (float16) support in `getDtypeSize()` — Medium severity

examples/qualcomm/oss_scripts/llama/runner/utils.h:30-51

The getDtypeSize() function handles Float, Double, Int, Long, Byte, and UInt16 but does not handle Half (float16). While the current GPU backend may not use Half tensors directly (GPU typically uses Float), this is a gap given the PR's goal of supporting floating-point models. If a future model exports with Half dtype for kv_cache or attention_mask, this will crash at runtime.

Since the PR is specifically enabling --use_fp16 support, it's worth confirming whether the QNN GPU backend maps fp16 to Float tensors at the ExecuTorch level. If it does, this is a latent issue rather than an immediate bug.

2. Missing `Half` support in `fill_mask()` and `logits_to_token()` — Same as above

examples/qualcomm/oss_scripts/llama/runner/kv_manager.cpp:19-63 and examples/qualcomm/oss_scripts/llama/runner/decoder_runner.h:74-89

Same concern — if Half dtype is ever encountered, these will crash. The comment at line 50-52 documents that -65535.0 is deliberately chosen as a finite negative, which addresses the previous review's magic number concern.

3. `spec_prop_pass.py` unconditional import — Medium severity

exir/passes/spec_prop_pass.py:13

import executorch.extension.llm.custom_ops.op_fallback  # noqa: F401

This is an unconditional side-effect import in a core exir pass that registers the llama.fallback op. This creates a coupling from the infrastructure layer (exir) to an application layer (extension/llm/custom_ops). If op_fallback.py or its transitive dependencies aren't installed (e.g., minimal runtime-only install), SpecPropPass breaks for all users, not just QNN/llama users.

Recommendation: Guard with try/except, or restructure so spec_prop_pass doesn't need to import the op definition:

try:
    import executorch.extension.llm.custom_ops.op_fallback  # noqa: F401
except ImportError:
    pass

Alternatively, the condition at line 82-85 could use hasattr or getattr with a fallback instead of referencing exir_ops.edge.llama.fallback.default directly.

Fix this →

4. `spec_prop_pass.py` logic correctness — Low risk but worth noting

exir/passes/spec_prop_pass.py:82-95

The change preserves the spec for exir_ops.edge.llama.fallback.default nodes, which is correct for the sharding use case described in the PR comment. However, the condition if "spec" not in node.meta (line 94) means that if the fallback node already has a spec from build_quant_io, it will be preserved. If it doesn't have a spec (first time through), it will be generated from meta_val. This is the right behavior — it prevents spec_prop_pass from overwriting specs set by build_quant_io in the sharding path.

5. `static` local variable in `logits_to_token` — Thread safety concern

examples/qualcomm/oss_scripts/llama/runner/decoder_runner.h:63

static std::vector<float> logits_f(vocab_size);

This static local vector is not thread-safe. If logits_to_token is ever called from multiple threads (e.g., multi-batch inference), this will corrupt data. This isn't a new issue (likely existed before this PR), but now that the dtype handling is more complex, it's worth noting. For the current single-threaded runner this is fine.

6. Redundant `using` namespace directives in `MethodMeta` usage

In multiple places (runner.cpp, token_generator.cpp, prompt_processor.cpp), MethodMeta is used both as executorch::runtime::MethodMeta and via using executorch::runtime::MethodMeta. This is consistent and correct, but I note that method_meta is sometimes passed as std::unique_ptr<MethodMeta> and sometimes as Result<MethodMeta>. This dual pattern is intentional (constructor vs. init_io), but future readers should understand that the constructor gets a unique_ptr for ownership semantics (detecting dtype) while init_io gets a Result for metadata queries.

Positive Aspects

Clean architecture: The TensorStructRaw type (with runtime dtype) vs TensorStruct<T> (with compile-time type) distinction is well-designed. TensorStructRaw for attention_mask, window_attention_mask, and logits cleanly handles the dynamic dtype case.
KV cache detection logic is now robust: The constructor first checks size to identify kv_cache tensors, then validates dtype consistency only among matching tensors. The post-loop assertion kv_cache_dtype_ != Undefined ensures early failure if no kv_cache was found.
JNI simplification: Removing the KvBitWidth enum and template dispatch in jni_layer_llama.cpp is a significant cleanup — the runtime dtype detection makes this complexity unnecessary.
GPU backend support: The llama.py changes properly handle the gpu backend option throughout, and skip_quantize[modality] = args.use_fp16 correctly bypasses quantization for fp16 mode.
op_fallback.py extraction: Moving the op registration from model_sharding.py to its own file is good separation of concerns.
Test coverage: Adding the --use_fp16 flag to the test harness enables CI validation of the new code path.

CI Failures

The two CI failures are:

test-sqnr-static-llm-qnn-linux (smollm2_135m) — test_static_llm_model
test-static-llama-qnn-linux (stories_260k_bc) — docker exec failure (likely infrastructure)

The first failure could be related to the spec_prop_pass change affecting the quantized path. Worth checking if the llama.fallback spec preservation impacts the SQNR evaluation flow for sharded models.

Summary

This is a solid refactoring. The main actionable issue is the unconditional import in spec_prop_pass.py (#3 above), which creates an unwanted coupling from core infrastructure to an LLM-specific extension. The missing Half dtype support (#1, #2) is a latent gap that should be tracked but may not be blocking depending on how QNN maps fp16.

The previous review's critical issue #6 (kv_cache dtype detection ordering bug) has been properly fixed in this version.

chenweng-quic requested review from JacobSzwejbka, abhinaykukkadapu, kirklandsign and larryliu0820 as code owners April 27, 2026 05:36

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 27, 2026

chenweng-quic added the module: qnn Issues related to Qualcomm's QNN delegate and code under backends/qualcomm/ label Apr 27, 2026

chenweng-quic force-pushed the dev1/chenweng/support_llama_dynamic_io_dtypes branch from b06f047 to 66fafc5 Compare April 28, 2026 01:37

chenweng-quic requested a review from mergennachin as a code owner April 28, 2026 06:26

shewu-quic mentioned this pull request May 15, 2026

Qualcomm AI Engine Direct - Add fp16a8w quantization config #19537

Open

chenweng-quic force-pushed the dev1/chenweng/support_llama_dynamic_io_dtypes branch from 13fbb40 to 8222b4b Compare May 17, 2026 11:47

chenweng-quic requested a review from psiddh as a code owner May 17, 2026 11:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qualcomm AI Engine Direct - Refactor llama runner for dynamic IO dtypes#19146

Qualcomm AI Engine Direct - Refactor llama runner for dynamic IO dtypes#19146
chenweng-quic wants to merge 1 commit into
pytorch:mainfrom
CodeLinaro:dev1/chenweng/support_llama_dynamic_io_dtypes

chenweng-quic commented Apr 27, 2026 •

edited by pytorch-bot Bot

Loading

Uh oh!

pytorch-bot Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 27, 2026

Uh oh!

chenweng-quic commented Apr 28, 2026

Uh oh!

chenweng-quic commented Apr 29, 2026 •

edited

Loading

Uh oh!

abhinaykukkadapu commented Apr 29, 2026

Uh oh!

claude Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

chenweng-quic commented May 18, 2026

Uh oh!

psiddh commented May 20, 2026

Uh oh!

claude Bot commented May 20, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chenweng-quic commented Apr 27, 2026 • edited by pytorch-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes

Test plan

Uh oh!

pytorch-bot Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19146

❗ 1 Active SEVs

❌ 2 New Failures, 2 Unclassified Failures

Uh oh!

github-actions Bot commented Apr 27, 2026

This PR needs a release notes: label

Uh oh!

chenweng-quic commented Apr 28, 2026

Uh oh!

chenweng-quic commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhinaykukkadapu commented Apr 29, 2026

Uh oh!

claude Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Qualcomm AI Engine Direct - Refactor llama runner for dynamic IO dtypes

Overall Assessment

Issues Found

1. Bug: fill_mask uses wrong "negative" value for float masks

2. Bug: getDtypeSize missing Half / Float16 support

3. Bug: fill_mask also missing Half support

4. decoder_runner.h:logits_to_token also missing Half support

5. Redundant semicolon (style nit)

6. KV cache dtype detection heuristic is fragile

7. spec_prop_pass change — approach seems reasonable

8. op_fallback.py extraction is clean

CI Failures

Summary

Uh oh!

chenweng-quic commented May 18, 2026

Uh oh!

psiddh commented May 20, 2026

Uh oh!

claude Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: Qualcomm AI Engine Direct - Refactor llama runner for dynamic IO dtypes

Overall Assessment

Issues Found

1. Missing Half (float16) support in getDtypeSize() — Medium severity

2. Missing Half support in fill_mask() and logits_to_token() — Same as above

3. spec_prop_pass.py unconditional import — Medium severity

4. spec_prop_pass.py logic correctness — Low risk but worth noting

5. static local variable in logits_to_token — Thread safety concern

6. Redundant using namespace directives in MethodMeta usage

Positive Aspects

CI Failures

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chenweng-quic commented Apr 27, 2026 •

edited by pytorch-bot Bot

Loading

pytorch-bot Bot commented Apr 27, 2026 •

edited

Loading

This PR needs a `release notes:` label

chenweng-quic commented Apr 29, 2026 •

edited

Loading

claude Bot commented Apr 29, 2026 •

edited

Loading

1. Bug: `fill_mask` uses wrong "negative" value for float masks

2. Bug: `getDtypeSize` missing `Half` / `Float16` support

3. Bug: `fill_mask` also missing `Half` support

4. `decoder_runner.h:logits_to_token` also missing `Half` support

8. `op_fallback.py` extraction is clean

claude Bot commented May 20, 2026 •

edited

Loading

1. Missing `Half` (float16) support in `getDtypeSize()` — Medium severity

2. Missing `Half` support in `fill_mask()` and `logits_to_token()` — Same as above

3. `spec_prop_pass.py` unconditional import — Medium severity

4. `spec_prop_pass.py` logic correctness — Low risk but worth noting

5. `static` local variable in `logits_to_token` — Thread safety concern

6. Redundant `using` namespace directives in `MethodMeta` usage