Add: HF compare testing framework for Qwen3-14B model validation by wangqin1723-max · Pull Request #184 · hw-native-sys/pypto-lib

wangqin1723-max · 2026-04-27T04:17:16Z

Summary

Adds a HuggingFace comparison framework under llm/testing/hf_compare/ for validating pypto kernels against HF reference implementations
Supports three test case types: prefill, decode, and end-to-end
Includes batch configuration, weight adapter utilities, and input sampling
Adds unit and integration tests in tests/llm/testing/

Test plan

Run unit tests: python -m pytest tests/llm/testing/
Run HF compare end-to-end: python -m llm.testing.hf_compare run qwen3_14b.e2e -k hf_model_path=<model> -k cpu_only=true

🤖 Generated with Claude Code

coderabbitai · 2026-04-27T04:17:29Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR introduces a fused multi-layer Qwen3-14B decode kernel compiled into a single PyPTO program, updates the executor and engine to use stacked weights and padded block tables for paged attention, and adds a comprehensive HuggingFace comparison testing framework with reusable abstractions for reference models, target models, weight adapters, and comparison runners, including concrete implementations for single-layer prefill/decode, multi-layer decode, and end-to-end generation.

Changes

Fused Multi-Layer Decode & Host Integration

Layer / File(s)	Summary
Fused Decode Kernel `models/qwen3/14b/qwen3_14b_decode_full.py`	Qwen3-14B fused decode kernel processes all transformer layers end-to-end with RMSNorm, Q/K/V projections, RoPE, grouped-head attention, output projection, post-RMSNorm, and MLP gating in a single PyPTO kernel; uses dynamic dimensions for variable batch size and includes golden PyTorch reference.
KV Cache Utilities `llm/core/kv_cache.py`	`block_table_for_batch_padded()` builds flattened `-1`-padded block tables for paged attention; `materialize_decode_cache_all_layers()` returns reshaped key/value caches across all layers.
Executor Fused Decode Path `llm/core/pypto_executor.py`	PyptoQwen14BExecutor refactored to execute fused multi-layer decode instead of per-layer loops; materializes all-layer KV caches, unpacks stacked weight tensors during compilation via `_stack_decode_weights()`, and adds timing instrumentation for prefill/decode phases.
Engine Integration `llm/core/engine.py`	`generate_batch()` and `generate_stream()` updated to call `block_table_for_batch_padded()` instead of `block_table_for_batch()`.
Model & Config Updates `llm/model/qwen3_14b_decode.py`, `llm/examples/qwen3_14b_npu_generate.py`	Qwen3 decode module path updated to reference new fused kernel file; example batch size increased from 1 to 16.

HF Comparison Testing Framework

Layer / File(s)	Summary
Core Abstractions `llm/testing/hf_compare/base.py`	Protocol definitions for `ReferenceModel`, `TargetModel`, `WeightAdapter` and dataclasses for `TensorSpec`, `InputSpec`, `OutputSelector`, `Tolerance`, `ComparisonCase`, `CompareReport`; includes case registry system with `@register_case()` decorator and `get_case()` / `list_cases()` helpers.
Reference & Target Models `llm/testing/hf_compare/reference.py`, `llm/testing/hf_compare/target.py`	`CallableReference` and `HFModuleReference` wrap HF forward functions and modules; `CallableTarget`, `PyPTOKernelTarget`, and `TorchTarget` provide target implementations for custom code, compiled kernels, and PyTorch modules with backend/platform mapping.
Weight Adapters & Sampling `llm/testing/hf_compare/weight_adapter.py`, `llm/testing/hf_compare/input_sampler.py`	Declarative weight adapter framework with tensor ops (Cast, Transpose, View, Reshape); `DictAdapter` and `PassthroughAdapter` implementations; `load_hf_state()` for safetensors loading; input samplers for uniform, normal, constant distributions and RoPE table generation.
Comparison & Runner `llm/testing/hf_compare/comparator.py`, `llm/testing/hf_compare/runner.py`	`Comparator` computes metrics (max_abs, mean_abs, max_rel, cosine, pass_rate); `run_case()` orchestrates reference/target execution and reporting; `run_chain()` supports multi-step chained comparisons.
CLI & Package `llm/testing/hf_compare/__init__.py`, `llm/testing/hf_compare/__main__.py`	Package re-exports public API; CLI entry point with `list` subcommand to enumerate cases and `run` subcommand with `--json` output and `--kwarg/-k` parameter passing.
Qwen3 Test Cases `llm/testing/hf_compare/cases/qwen3_14b_*.py`	Four registered cases: `qwen3_14b.prefill` (single-layer), `qwen3_14b.decode` (single-layer), `qwen3_14b.decode_full` (multi-layer fused with `StackedAdapter`), and `qwen3_14b.e2e` (full generation with CPU and NPU backends); each includes HF reference forward, kernel loading, weight adaptation, and paged-attention initialization.
Tests `tests/llm/testing/test_hf_compare_*.py`	Batch configuration validation for prefill/decode case builders; end-to-end case execution with device resolution, state filtering, and NPU runner integration tests.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

hw-native-sys/pypto-lib#160: Extended KV-cache and executor foundations from PR #160 to add padded block-table support and fused multi-layer decode path with stacked weights.
hw-native-sys/pypto-lib#173: Coordinates with host/kernel changes for padded block-tables and paged-attention layout supporting dynamic user batches.
hw-native-sys/pypto-lib#153: Related through shared switch to paged/block-based KV addressing (block_table/slot_mapping) and decode kernel interfaces.

Poem

🐰 A rabbit hops through fused decode dreams,
Stacking weights and cache streams,
Comparing kernels with HF's care,
Block tables padded everywhere!
End-to-end generation flows,
Testing framework that grows and grows! 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 6.06% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Add: HF compare testing framework for Qwen3-14B model validation' is a concise, descriptive summary that clearly indicates the main addition: a HuggingFace comparison testing framework for Qwen3-14B validation.
Description check	✅ Passed	The description is well-related to the changeset. It explains the framework structure (HF comparison under llm/testing/hf_compare/), test case types (prefill, decode, e2e), and includes a clear test plan matching the implemented changes.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a Hugging Face comparison framework for verifying PyPTO kernels and models, including core abstractions, weight adapters, and specific cases for Qwen3-14B. The review feedback identifies several performance bottlenecks where models and kernels are re-initialized or re-compiled during every run instead of the preparation phase. Additionally, there are logic errors in the case chaining implementation, missing tensors in kernel signatures, and inappropriate tolerance settings for token ID comparisons in end-to-end tests.

gemini-code-assist · 2026-04-27T04:24:26Z

+            "hidden_states", "seq_lens", "input_rms_weight",
+            "wq", "wk", "wv", "q_norm_weight", "k_norm_weight",
+            "rope_cos", "rope_sin", "k_cache", "v_cache",
+            "wo", "post_rms_weight", "w_gate", "w_up", "w_down", "out",
+        ]
+        self._decode_spec_order = [
+            "hidden_states", "input_rms_weight",
+            "wq", "wk", "wv", "q_norm_weight", "k_norm_weight",
+            "seq_lens", "rope_cos", "rope_sin", "k_cache", "v_cache",
+            "wo", "post_rms_weight", "w_gate", "w_up", "w_down", "out",
+        ]


The _prefill_spec_order and _decode_spec_order are missing block_table and slot_mapping tensors, which are required by the Qwen3 kernels. This will cause a mismatch with the kernel signature during NPU execution.

gemini-code-assist · 2026-04-27T04:24:26Z

+    cfg = AutoConfig.from_pretrained(model_path)
+    cfg._attn_implementation = "eager"
+    layer = Qwen3DecoderLayer(cfg, layer_idx=0).to(hf_dtype).eval()


Instantiating the Hugging Face model and loading weights inside the forward function (via CallableReference) will cause significant performance overhead as it happens on every run. Consider using HFModuleReference which handles module initialization and weight loading once during the prepare phase.

gemini-code-assist · 2026-04-27T04:24:26Z

+        compiled_prefill = ir.compile(prefill.build_qwen3_14b_prefill_program(batch=batch, max_seq=max_seq), backend_type=backend)
+        self._prefill_dir = compiled_prefill.output_dir
+        compiled_decode = ir.compile(dec.build_qwen3_decode_program(batch=batch, max_seq=max_seq), backend_type=backend)
+        self._decode_dir = compiled_decode.output_dir


Compiling the PyPTO programs inside the _NPURunner constructor, which is itself instantiated inside the target's run function, will trigger a slow compilation on every test iteration. Compilation should be performed once in the prepare phase, similar to how PyPTOKernelTarget is implemented.

gemini-code-assist · 2026-04-27T04:24:27Z

+    cfg._attn_implementation = "eager"
+    cfg.num_hidden_layers = num_layers
+
+    model = Qwen3ForCausalLM(cfg).to(hf_dtype).eval()


The Hugging Face model is instantiated inside the forward function. This will be extremely slow for end-to-end tests as it reloads the model on every run. Use HFModuleReference to cache the model instance.

gemini-code-assist · 2026-04-27T04:24:27Z

+                cast_to=torch.float32,
+            ),
+        ],
+        tolerance=Tolerance(atol=atol, rtol=rtol, pass_rate_threshold=1.0),


The default atol and rtol (5e-3) are applied to generated_ids. Since these are token IDs (integers), a non-zero relative tolerance can allow incorrect tokens to pass if their IDs are numerically close. End-to-end generation should typically use exact matching (atol=0, rtol=0) for token IDs.

gemini-code-assist · 2026-04-27T04:24:27Z

+    cfg = AutoConfig.from_pretrained(model_path)
+    cfg._attn_implementation = "eager"
+    layer = Qwen3DecoderLayer(cfg, layer_idx=0).to(hf_dtype).eval()


Model initialization and weight loading should be moved to the prepare phase to avoid performance degradation during repeated test runs.

gemini-code-assist · 2026-04-27T04:24:27Z

+        # (ComparisonCase.run did the work but did not return ref outputs, so
+        # we re-materialize on the fly. For now we skip and rely on tests
+        # using this chain to either keep state externally or extend the API.)
+        prev_ref_out = {}


The prev_ref_out dictionary is explicitly cleared at the end of each step, and it is never populated with the actual reference outputs (ref_out). This breaks the chaining logic where step N+1 is supposed to receive inputs from step N's reference outputs. Additionally, run_case needs to be updated to return the reference outputs to the caller.

gemini-code-assist · 2026-04-27T04:24:27Z

+        return all_weights[prefix + name]
+
+    def proj(name: str) -> torch.Tensor:
+        return w(name).t().contiguous().to(torch.bfloat16)


The projection weights are hardcoded to torch.bfloat16. This may cause precision issues or runtime errors if the comparison case is configured with a different hf_dtype (e.g., fp32). It should use the hf_dtype_t resolved in the build function.

gemini-code-assist · 2026-04-27T04:24:27Z

+                hi = ki * q_per_kv + qi
+                q_head = q_rot[:, hi, :]
+                scores = (q_head.float() @ k_cached.float().T) * scale
+                causal = torch.triu(torch.full((seq, seq), float("-inf")), diagonal=1)


The causal mask is being re-created inside the head loops. Moving this outside the loops will improve CPU performance during prefill validation.

coderabbitai

Actionable comments posted: 12

🧹 Nitpick comments (4)

llm/core/pypto_executor.py (2)

19-19: ⚡ Quick win

Hard-coded _TIMING_ENABLED = True with print() for every prefill/decode call.

The flag is set unconditionally to True and prints on every prefill/decode invocation, which will be noisy on hot inference paths and is awkward to silence for end-users (they'd have to edit the source). Consider gating the timing on an env var (or a constructor flag on PyptoQwen14BExecutor) and routing output through logging instead of print so consumers can control verbosity at runtime.
♻️ Sketch
-import time
+import logging
+import os
+import time
 from dataclasses import dataclass
 from pathlib import Path

 import torch

-_TIMING_ENABLED = True
+_TIMING_ENABLED = os.environ.get("PYPTO_EXECUTOR_TIMING", "0") == "1"
+_LOG = logging.getLogger(__name__)
@@
-        if _TIMING_ENABLED:
-            print(
-                f"[timing] prefill: {len(model.layers)} layers, "
-                f"{(time.perf_counter() - t_prefill_start) * 1000:.2f} ms",
-                flush=True,
-            )
+        if _TIMING_ENABLED:
+            _LOG.info(
+                "prefill: %d layers, %.2f ms",
+                len(model.layers),
+                (time.perf_counter() - t_prefill_start) * 1000,
+            )
Apply the same swap for the decode timing block.
Also applies to: 188-194, 237-243
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@llm/core/pypto_executor.py` at line 19, Replace the hard-coded
_TIMING_ENABLED = True and the ad-hoc print()calls in the prefill and decode
timing blocks with a configurable, non-invasive mechanism: add a boolean
constructor parameter (e.g., timing_enabled) to PyptoQwen14BExecutor that
defaults to reading an environment variable (e.g., PYRPT_TIMING) and remove the
global hard-coded flag; replace all print(...) usages in the prefill and decode
timing sections with logger.debug/info calls using the standard logging module
so users can control verbosity via logging configuration, and gate the timing
measurements behind the new instance attribute (self.timing_enabled) so timing
is silent unless explicitly enabled.
334-361: 🏗️ Heavy lift

Peak memory from list accumulation in torch.cat stacking.

The _stack_decode_weights method materializes transient buffers by accumulating all per-layer results in a list before concatenating. For Qwen3-14B with 40 layers, each stack_kernel_weight call collects 40 per-layer tensors (each freshly produced by _kernel_weight—transpose + bf16 + contiguous operations) and holds them all simultaneously. For weights like w_gate/w_up/w_down (~170 MB/layer in bf16 each) across all 11 stacked weights, this pattern keeps peak memory at roughly (N+1) * per_layer_size for each weight type.

If memory is constrained in your target environment, pre-allocate each stacked tensor and copy per-layer slices into it one at a time (releasing each _kernel_weight result before producing the next), reducing peak allocation to ~2 * per_layer_size per weight.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@llm/core/pypto_executor.py` around lines 334 - 361, The current
_stack_decode_weights implementation (stack_kernel_weight and stack_norm_weight)
builds Python lists of per-layer tensors then torch.cat's them, causing high
peak memory; change each stack_* to pre-allocate the final tensor sized
(len(model.layers), ...) and fill slices from each layer one-by-one by calling
_kernel_weight or getattr(layer, ...) per layer and copying into the
pre-allocated buffer, releasing the per-layer result immediately; update
stack_kernel_weight to compute the final shape from a single layer (using
_kernel_weight on layer 0 without keeping it long) and then loop layers to copy
into buffer, and similarly update stack_norm_weight to allocate a (num_layers,
feature) float tensor and fill row-wise from getattr(layer,
attr).cpu().contiguous() so decode_input_rms_weight, decode_wq, decode_wk,
decode_wv, decode_q_norm_weight, decode_k_norm_weight, decode_wo,
decode_post_rms_weight, decode_w_gate, decode_w_up, and decode_w_down are built
without materializing long lists.

llm/testing/hf_compare/cases/qwen3_14b_prefill.py (2)

194-201: 💤 Low value

Repo-root resolution via parents[4] is fragile.

Path(__file__).resolve().parents[4] silently produces a wrong root if the file is ever moved or the package structure changes (you'll then load a kernel from the wrong tree or get a confusing ImportError). Consider centralizing this in a small helper that walks up to a known sentinel (e.g. pyproject.toml or examples/), or pinning it via a package-relative resource. Optional given this is test-only code.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@llm/testing/hf_compare/cases/qwen3_14b_prefill.py` around lines 194 - 201,
The repo-root computation using Path(__file__).resolve().parents[4] is fragile;
replace it with a small helper (e.g., find_repo_root_by_sentinel) that walks
upward from Path(__file__).resolve() until it finds a known sentinel like
"pyproject.toml" or the "examples" directory and returns that path, then use
that helper to derive _KERNEL_PATH instead of repo_root = Path(...).parents[4];
update the Qwen3 loader code that sets _KERNEL_PATH and the ImportError message
to use the helper and ensure it raises a clear error if the sentinel isn't
found.

131-158: 💤 Low value

Vectorize _select_npu_kv to avoid per-element Python loops.

For typical configs (e.g. batch=16, seq_len=128, NUM_KV_HEADS=8), this runs ~16k Python-level row copies plus a .item() sync per (b, pos). It's easy to express with index arithmetic in a single index_select, which removes the host/device sync and the inner Python loops. Also, out is created via torch.empty and never aliased, so the trailing .clone() on line 158 is unnecessary.

♻️ Vectorized sketch

def _select_npu_kv(flat, *, slot_mapping, block_size, batch, max_seq, seq_len):
    pos = torch.arange(seq_len, device=flat.device)
    b   = torch.arange(batch, device=flat.device).unsqueeze(1)            # [B, 1]
    slots = slot_mapping.view(batch, max_seq)[:, :seq_len].long()         # [B, S]
    slot_block = slots // block_size
    slot_off   = slots %  block_size
    base = slot_block * (NUM_KV_HEADS * block_size) + slot_off            # [B, S]
    kvh  = torch.arange(NUM_KV_HEADS, device=flat.device)                 # [nkv]
    rows = base.unsqueeze(1) + kvh.view(1, -1, 1) * block_size            # [B, nkv, S]
    return flat.index_select(0, rows.reshape(-1)).view(batch, NUM_KV_HEADS, seq_len, -1)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@llm/testing/hf_compare/cases/qwen3_14b_prefill.py` around lines 131 - 158,
The _select_npu_kv function is doing heavy Python-level loops and device-sync
.item() calls; replace the per-(b,pos) loops with a vectorized index computation
using torch integer ops on slot_mapping to compute slot_block and slot_off,
build a base index = slot_block*(NUM_KV_HEADS*block_size)+slot_off, add a kvh
offset (kvh * block_size) to form a [B, NUM_KV_HEADS, S] rows tensor, then use
flat.index_select(0, rows.reshape(-1)) and reshape to (batch, NUM_KV_HEADS,
seq_len, HEAD_DIM) to return; also remove the unnecessary out.clone() at the
end. Reference symbols: _select_npu_kv, flat, slot_mapping, block_size, batch,
max_seq, seq_len, NUM_KV_HEADS, HEAD_DIM.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@llm/testing/hf_compare/__main__.py`:
- Around line 9-15: The module docstring at the top of
llm/testing/hf_compare/__main__.py advertises CLI flags (--hf-model-path,
--platform, --cpu-only) that the parser doesn't register; update the docstring
to show the actual usage of the implemented CLI (the "list" and "run"
subcommands) and example invocation for "run" using the registered positional
"case", the --json flag and the kwarg option (-k/--kwarg), e.g. show usages like
"python -m llm.testing.hf_compare list" and "python -m llm.testing.hf_compare
run qwen3_14b.decode -k hf_model_path=/path/to/Qwen3-14B -k cpu_only=true" so
copy-paste works and matches how the run subparser (the parser.add_subparsers /
run subparser and its --json and -k/--kwarg definitions) is actually
implemented.

In `@llm/testing/hf_compare/cases/qwen3_14b_decode_full.py`:
- Around line 240-263: Validate runtime parameters immediately after parsing:
ensure batch > 0 (already checked), raise ValueError if seq_len is None or
seq_len <= 0, raise ValueError if num_layers <= 0, and validate hf_dtype is one
of the supported strings (e.g. "fp32" or "bf16") and raise ValueError for
others; perform these checks before computing context_len/max_seq_eff
(variables: seq_len, num_layers, hf_dtype, context_len, max_seq_eff,
dec.MAX_SEQ) so invalid cases fail during case construction rather than during
decoding.

In `@llm/testing/hf_compare/cases/qwen3_14b_decode.py`:
- Around line 202-231: Validate inputs early: ensure seq_len (if provided) and
ctx_len (when used as fallback) are positive and raise ValueError if <= 0, and
validate hf_dtype against allowed values ("fp32" or "bf16") before mapping to
hf_dtype_t (currently set in hf_dtype_t = torch.float32 if hf_dtype == "fp32"
else torch.bfloat16). Specifically, add a check near where seq_len/ctx_len are
processed to reject non-positive lengths (reference variables seq_len, ctx_len,
context_len) and add an explicit hf_dtype validation that raises a ValueError
for unknown strings before assigning hf_dtype_t so typos don't silently become
bfloat16.

In `@llm/testing/hf_compare/cases/qwen3_14b_e2e.py`:
- Around line 66-72: The _parse_prompt_ids function currently converts
list/tuple inputs to integers but doesn't reject an empty list/tuple; ensure you
validate and raise the same ValueError("prompt_ids must contain at least one
token id") for list/tuple inputs too by checking the computed out list after
converting the values (from the isinstance(prompt_ids, (list, tuple)) branch)
and raising if empty so callers like the e2e runner won't receive an empty list;
keep the same error message and behavior for the string path.
- Around line 619-639: Validate inputs early: after coercing and converting
inputs (num_layers, max_new_tokens, hf_dtype) add checks that num_layers > 0 and
max_new_tokens > 0 and that hf_dtype is one of the allowed strings ("fp32" or
"bf16"); if any check fails, raise a ValueError with a clear message. Replace
the current silent fallback logic that sets hf_dtype_t = torch.float32 if
hf_dtype == "fp32" else torch.bfloat16 with an explicit mapping that first
validates hf_dtype and then assigns hf_dtype_t so typos don't silently pick
bf16; ensure these checks sit near the existing conversions around the variables
num_layers, max_new_tokens, and hf_dtype_t in this builder function.
- Around line 490-516: The call to _load_qwen3_modules() runs unconditionally
and pulls in backend-only code (e.g., pypto.language) even on the CPU-only path;
move the _load_qwen3_modules() invocation into the non-CPU branch (the branch
guarded by the cpu_only flag) or else hoist the static constants HIDDEN,
HEAD_DIM, and NUM_KV_HEADS into module-level constants so the CPU-only path does
not import prefill; update uses of prefill.HIDDEN, prefill.HEAD_DIM, and
prefill.NUM_KV_HEADS in the code that builds layer_weights and cache_rows to
reference the new constants or to only access prefill after the conditional.

In `@llm/testing/hf_compare/cases/qwen3_14b_prefill.py`:
- Around line 182-192: The code silently maps any hf_dtype string other than
"fp32" to bfloat16; change this to validate hf_dtype explicitly and raise a
clear error for unknown values: accept a defined set (e.g. "fp32", "float32" ->
torch.float32; "bfloat16", "bf16" -> torch.bfloat16; optionally "fp16" ->
torch.float16), map the allowed strings to the correct torch dtype and set
hf_dtype_t accordingly, and throw a ValueError with a descriptive message if
hf_dtype is not in the allowed set (update the assignment around hf_dtype_t to
perform the validation/mapping).

In `@llm/testing/hf_compare/input_sampler.py`:
- Around line 46-57: The build_rope_tables function must reject odd head_dim to
avoid silently producing wrong-width RoPE tables; add an early validation in
build_rope_tables that checks head_dim is even (e.g., head_dim % 2 == 0) and
raise a ValueError with a clear message if not, then proceed to compute
inv_freq, freqs, emb, and return emb.cos()/emb.sin() as before; reference the
existing symbols head_dim, inv_freq, freqs, and emb so the check is inserted
before inv_freq is computed.

In `@llm/testing/hf_compare/reference.py`:
- Around line 99-103: forward() currently passes raw input tensors to
forward_fn, which can mismatch the module dtype; modify forward() to first cast
any floating-point tensors in the inputs mapping to self.dtype (use
torch.is_floating_point to detect) into a new inputs dict (preserving
non-floating tensors unchanged and preserving device), then call
self.forward_fn(self._module, cast_inputs) inside the existing torch.no_grad()
block; reference the forward method, self.dtype, forward_fn, and prepare() to
locate the related code.

In `@llm/testing/hf_compare/runner.py`:
- Around line 113-140: The run_chain loop never propagates reference outputs and
permanently mutates user cases: stop clearing prev_ref_out each iteration and
actually capture the reference outputs from run_case (e.g., change run_case to
return (CompareReport, dict[str,torch.Tensor]) or expose ref outputs on
ComparisonCase) and assign that to prev_ref_out so forward_map can work; also
avoid mutating step.case.on_inputs in place—either restore the original hook
after calling run_case or run a shallow copy of step.case with the wrapped
on_inputs when invoking run_case (do not leave the wrapper assigned to the
original case).

In `@llm/testing/hf_compare/target.py`:
- Around line 173-195: In run(), before calling forward_fn(self._module,
inputs), ensure all floating-point tensors in the inputs Mapping are converted
to self.dtype to avoid dtype mismatches with the module (e.g., convert any
tensor where tensor.is_floating_point() is True to tensor.to(self.dtype)); do
this conversion only when self._module is prepared (after module load) and pass
the normalized inputs into forward_fn; reference symbols: run, forward_fn,
module_factory, state_dict_prefix, strict_load, and self._module.

In `@llm/testing/hf_compare/weight_adapter.py`:
- Around line 49-56: The Transpose Op implementation violates its docstring: in
Transpose -> _op it uses t.t() which only works for 2D matrices; change the
default branch to call t.transpose(-2, -1) (or tensor.transpose) to swap the
last two dimensions for tensors with ndim>=2, leaving scalars/1D untouched, and
keep the existing t.permute(*dims) behavior when dims are provided; update the
_op in Transpose accordingly.

---

Nitpick comments:
In `@llm/core/pypto_executor.py`:
- Line 19: Replace the hard-coded _TIMING_ENABLED = True and the ad-hoc
print()calls in the prefill and decode timing blocks with a configurable,
non-invasive mechanism: add a boolean constructor parameter (e.g.,
timing_enabled) to PyptoQwen14BExecutor that defaults to reading an environment
variable (e.g., PYRPT_TIMING) and remove the global hard-coded flag; replace all
print(...) usages in the prefill and decode timing sections with
logger.debug/info calls using the standard logging module so users can control
verbosity via logging configuration, and gate the timing measurements behind the
new instance attribute (self.timing_enabled) so timing is silent unless
explicitly enabled.
- Around line 334-361: The current _stack_decode_weights implementation
(stack_kernel_weight and stack_norm_weight) builds Python lists of per-layer
tensors then torch.cat's them, causing high peak memory; change each stack_* to
pre-allocate the final tensor sized (len(model.layers), ...) and fill slices
from each layer one-by-one by calling _kernel_weight or getattr(layer, ...) per
layer and copying into the pre-allocated buffer, releasing the per-layer result
immediately; update stack_kernel_weight to compute the final shape from a single
layer (using _kernel_weight on layer 0 without keeping it long) and then loop
layers to copy into buffer, and similarly update stack_norm_weight to allocate a
(num_layers, feature) float tensor and fill row-wise from getattr(layer,
attr).cpu().contiguous() so decode_input_rms_weight, decode_wq, decode_wk,
decode_wv, decode_q_norm_weight, decode_k_norm_weight, decode_wo,
decode_post_rms_weight, decode_w_gate, decode_w_up, and decode_w_down are built
without materializing long lists.

In `@llm/testing/hf_compare/cases/qwen3_14b_prefill.py`:
- Around line 194-201: The repo-root computation using
Path(__file__).resolve().parents[4] is fragile; replace it with a small helper
(e.g., find_repo_root_by_sentinel) that walks upward from
Path(__file__).resolve() until it finds a known sentinel like "pyproject.toml"
or the "examples" directory and returns that path, then use that helper to
derive _KERNEL_PATH instead of repo_root = Path(...).parents[4]; update the
Qwen3 loader code that sets _KERNEL_PATH and the ImportError message to use the
helper and ensure it raises a clear error if the sentinel isn't found.
- Around line 131-158: The _select_npu_kv function is doing heavy Python-level
loops and device-sync .item() calls; replace the per-(b,pos) loops with a
vectorized index computation using torch integer ops on slot_mapping to compute
slot_block and slot_off, build a base index =
slot_block*(NUM_KV_HEADS*block_size)+slot_off, add a kvh offset (kvh *
block_size) to form a [B, NUM_KV_HEADS, S] rows tensor, then use
flat.index_select(0, rows.reshape(-1)) and reshape to (batch, NUM_KV_HEADS,
seq_len, HEAD_DIM) to return; also remove the unnecessary out.clone() at the
end. Reference symbols: _select_npu_kv, flat, slot_mapping, block_size, batch,
max_seq, seq_len, NUM_KV_HEADS, HEAD_DIM.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: d5086288-7b7b-471c-969a-79ea791ebefd

📥 Commits

Reviewing files that changed from the base of the PR and between 9d74d6f and 7009dee.

📒 Files selected for processing (21)

examples/models/qwen3/14b/qwen3_14b_decode_full.py
llm/core/kv_cache.py
llm/core/pypto_executor.py
llm/model/qwen3_14b_decode.py
llm/testing/__init__.py
llm/testing/hf_compare/__init__.py
llm/testing/hf_compare/__main__.py
llm/testing/hf_compare/base.py
llm/testing/hf_compare/cases/__init__.py
llm/testing/hf_compare/cases/qwen3_14b_decode.py
llm/testing/hf_compare/cases/qwen3_14b_decode_full.py
llm/testing/hf_compare/cases/qwen3_14b_e2e.py
llm/testing/hf_compare/cases/qwen3_14b_prefill.py
llm/testing/hf_compare/comparator.py
llm/testing/hf_compare/input_sampler.py
llm/testing/hf_compare/reference.py
llm/testing/hf_compare/runner.py
llm/testing/hf_compare/target.py
llm/testing/hf_compare/weight_adapter.py
tests/llm/testing/test_hf_compare_batch_config.py
tests/llm/testing/test_hf_compare_e2e.py

coderabbitai · 2026-05-06T02:40:45Z

+"""CLI for the HF comparison framework.
+
+Usage:
+    python -m llm.testing.hf_compare list
+    python -m llm.testing.hf_compare run qwen3_14b.decode \\
+        --hf-model-path /path/to/Qwen3-14B --platform a2a3 [--cpu-only]
+"""


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Module docstring advertises CLI flags that don't exist.

The example shows --hf-model-path, --platform, and --cpu-only, but the run subparser only registers case, --json, and --kwarg/-k. Anyone copy-pasting this command will get error: unrecognized arguments. The PR test plan correctly uses -k hf_model_path=... -k cpu_only=true, so update the docstring to match.

📝 Suggested docstring fix

"""CLI for the HF comparison framework. Usage: python -m llm.testing.hf_compare list - python -m llm.testing.hf_compare run qwen3_14b.decode \\ - --hf-model-path /path/to/Qwen3-14B --platform a2a3 [--cpu-only] + python -m llm.testing.hf_compare run qwen3_14b.decode \\ + -k hf_model_path=/path/to/Qwen3-14B -k platform=a2a3 [-k cpu_only=true] """

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm/testing/hf_compare/__main__.py` around lines 9 - 15, The module docstring at the top of llm/testing/hf_compare/__main__.py advertises CLI flags (--hf-model-path, --platform, --cpu-only) that the parser doesn't register; update the docstring to show the actual usage of the implemented CLI (the "list" and "run" subcommands) and example invocation for "run" using the registered positional "case", the --json flag and the kwarg option (-k/--kwarg), e.g. show usages like "python -m llm.testing.hf_compare list" and "python -m llm.testing.hf_compare run qwen3_14b.decode -k hf_model_path=/path/to/Qwen3-14B -k cpu_only=true" so copy-paste works and matches how the run subparser (the parser.add_subparsers / run subparser and its --json and -k/--kwarg definitions) is actually implemented.

coderabbitai · 2026-05-06T02:40:45Z

+    device_id = int(device_id) if device_id is not None else None
+    batch = int(batch)
+    seed = int(seed)
+    num_layers = int(num_layers)
+    seq_len = int(seq_len) if seq_len is not None else None
+    max_seq = int(max_seq) if max_seq is not None else None
+    atol = float(atol)
+    rtol = float(rtol)
+    hf_dtype_t = torch.float32 if hf_dtype == "fp32" else torch.bfloat16
+
+    repo_root = Path(__file__).resolve().parents[4]
+    _KERNEL_PATH = repo_root / "examples" / "models" / "qwen3" / "14b" / "qwen3_14b_decode_full.py"
+    _SPEC = importlib.util.spec_from_file_location("_qwen3_14b_decode_full_kernel", _KERNEL_PATH)
+    if _SPEC is None or _SPEC.loader is None:
+        raise ImportError(f"Cannot load fused decode kernel from {_KERNEL_PATH}")
+    dec = importlib.util.module_from_spec(_SPEC)
+    _SPEC.loader.exec_module(dec)
+
+    if batch <= 0:
+        raise ValueError(f"batch must be positive, got {batch}")
+    context_len = seq_len if seq_len is not None else dec.MAX_SEQ
+    max_seq_eff = max_seq or dec.MAX_SEQ
+    if context_len > max_seq_eff:
+        raise ValueError(f"seq_len={context_len} > max_seq={max_seq_eff}")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject zero/invalid runtime parameters up front.

seq_len=0 makes the decode position negative in the reference path, num_layers=0 creates a meaningless "fused" comparison, and any hf_dtype other than "fp32" currently falls through to bf16. These should fail during case construction.

Proposed fix

batch = int(batch) seed = int(seed) num_layers = int(num_layers) seq_len = int(seq_len) if seq_len is not None else None max_seq = int(max_seq) if max_seq is not None else None atol = float(atol) rtol = float(rtol) - hf_dtype_t = torch.float32 if hf_dtype == "fp32" else torch.bfloat16 + if num_layers <= 0: + raise ValueError(f"num_layers must be positive, got {num_layers}") + if hf_dtype not in {"fp32", "bf16"}: + raise ValueError(f"hf_dtype must be 'fp32' or 'bf16', got {hf_dtype!r}") + hf_dtype_t = torch.float32 if hf_dtype == "fp32" else torch.bfloat16 @@ if batch <= 0: raise ValueError(f"batch must be positive, got {batch}") context_len = seq_len if seq_len is not None else dec.MAX_SEQ max_seq_eff = max_seq or dec.MAX_SEQ + if context_len <= 0: + raise ValueError(f"seq_len must be positive, got {context_len}") if context_len > max_seq_eff: raise ValueError(f"seq_len={context_len} > max_seq={max_seq_eff}")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm/testing/hf_compare/cases/qwen3_14b_decode_full.py` around lines 240 - 263, Validate runtime parameters immediately after parsing: ensure batch > 0 (already checked), raise ValueError if seq_len is None or seq_len <= 0, raise ValueError if num_layers <= 0, and validate hf_dtype is one of the supported strings (e.g. "fp32" or "bf16") and raise ValueError for others; perform these checks before computing context_len/max_seq_eff (variables: seq_len, num_layers, hf_dtype, context_len, max_seq_eff, dec.MAX_SEQ) so invalid cases fail during case construction rather than during decoding.

coderabbitai · 2026-05-06T02:40:45Z

+    cpu_only = _coerce_bool(cpu_only)
+    device_id = int(device_id) if device_id is not None else None
+    batch = int(batch)
+    seed = int(seed)
+    seq_len = int(seq_len) if seq_len is not None else None
+    max_seq = int(max_seq) if max_seq is not None else None
+    ctx_len = int(ctx_len) if ctx_len is not None else None
+    atol = float(atol)
+    rtol = float(rtol)
+    hf_dtype_t = torch.float32 if hf_dtype == "fp32" else torch.bfloat16
+
+    repo_root = Path(__file__).resolve().parents[4]
+    # Load the kernel directly from examples (bypass llm.model shim).
+    _KERNEL_PATH = repo_root / "examples" / "models" / "qwen3" / "14b" / "qwen3_14b_decode.py"
+    _SPEC = importlib.util.spec_from_file_location("_qwen3_14b_decode_kernel", _KERNEL_PATH)
+    if _SPEC is None or _SPEC.loader is None:
+        raise ImportError(f"Cannot load Qwen3-14B decode kernel from {_KERNEL_PATH}")
+    dec = importlib.util.module_from_spec(_SPEC)
+    _SPEC.loader.exec_module(dec)
+
+    if batch <= 0:
+        raise ValueError(f"batch must be positive, got {batch}")
+    # Decode is a single-token step that assumes position = (seq_len - 1).
+    # `seq_len` is the context length for this step; `max_seq` controls the
+    # cache capacity / RoPE table size (defaults to the kernel's MAX_SEQ).
+    # Back-compat: `ctx_len` is an alias for `seq_len` if `seq_len` is unset.
+    context_len = seq_len if seq_len is not None else (ctx_len or dec.MAX_SEQ)
+    max_seq_eff = max_seq or dec.MAX_SEQ
+    if context_len > max_seq_eff:
+        raise ValueError(f"seq_len={context_len} > max_seq={max_seq_eff}")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject seq_len <= 0 and unknown hf_dtype during case construction.

The reference path uses pos = seq_len - 1, so a zero/negative decode length is invalid, and the current ternary silently treats any hf_dtype typo as bf16. It is better to fail here than during input materialization or execution.

Proposed fix

batch = int(batch) seed = int(seed) seq_len = int(seq_len) if seq_len is not None else None max_seq = int(max_seq) if max_seq is not None else None ctx_len = int(ctx_len) if ctx_len is not None else None atol = float(atol) rtol = float(rtol) - hf_dtype_t = torch.float32 if hf_dtype == "fp32" else torch.bfloat16 + if hf_dtype not in {"fp32", "bf16"}: + raise ValueError(f"hf_dtype must be 'fp32' or 'bf16', got {hf_dtype!r}") + hf_dtype_t = torch.float32 if hf_dtype == "fp32" else torch.bfloat16 @@ # Decode is a single-token step that assumes position = (seq_len - 1). # `seq_len` is the context length for this step; `max_seq` controls the # cache capacity / RoPE table size (defaults to the kernel's MAX_SEQ). # Back-compat: `ctx_len` is an alias for `seq_len` if `seq_len` is unset. context_len = seq_len if seq_len is not None else (ctx_len or dec.MAX_SEQ) max_seq_eff = max_seq or dec.MAX_SEQ + if context_len <= 0: + raise ValueError(f"seq_len must be positive, got {context_len}") if context_len > max_seq_eff: raise ValueError(f"seq_len={context_len} > max_seq={max_seq_eff}")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm/testing/hf_compare/cases/qwen3_14b_decode.py` around lines 202 - 231, Validate inputs early: ensure seq_len (if provided) and ctx_len (when used as fallback) are positive and raise ValueError if <= 0, and validate hf_dtype against allowed values ("fp32" or "bf16") before mapping to hf_dtype_t (currently set in hf_dtype_t = torch.float32 if hf_dtype == "fp32" else torch.bfloat16). Specifically, add a check near where seq_len/ctx_len are processed to reject non-positive lengths (reference variables seq_len, ctx_len, context_len) and add an explicit hf_dtype validation that raises a ValueError for unknown strings before assigning hf_dtype_t so typos don't silently become bfloat16.

coderabbitai · 2026-05-06T02:40:45Z

+def _parse_prompt_ids(prompt_ids: str | list[int] | tuple[int, ...]) -> list[int]:
+    if isinstance(prompt_ids, (list, tuple)):
+        return [int(v) for v in prompt_ids]
+    out = [int(part.strip()) for part in str(prompt_ids).split(",") if part.strip()]
+    if not out:
+        raise ValueError("prompt_ids must contain at least one token id")
+    return out


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate non-empty prompts for list/tuple inputs too.

The string path rejects an empty prompt, but the list/tuple path returns [] unchanged. That later reaches prompt_len - 1 in the e2e runner and blows up far away from the actual input error.

Proposed fix

def _parse_prompt_ids(prompt_ids: str | list[int] | tuple[int, ...]) -> list[int]: if isinstance(prompt_ids, (list, tuple)): - return [int(v) for v in prompt_ids] + out = [int(v) for v in prompt_ids] + if not out: + raise ValueError("prompt_ids must contain at least one token id") + return out out = [int(part.strip()) for part in str(prompt_ids).split(",") if part.strip()] if not out: raise ValueError("prompt_ids must contain at least one token id") return out

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm/testing/hf_compare/cases/qwen3_14b_e2e.py` around lines 66 - 72, The _parse_prompt_ids function currently converts list/tuple inputs to integers but doesn't reject an empty list/tuple; ensure you validate and raise the same ValueError("prompt_ids must contain at least one token id") for list/tuple inputs too by checking the computed out list after converting the values (from the isinstance(prompt_ids, (list, tuple)) branch) and raising if empty so callers like the e2e runner won't receive an empty list; keep the same error message and behavior for the string path.

coderabbitai · 2026-05-06T02:40:45Z

+    dec, prefill = _load_qwen3_modules()
+
+    input_ids = inputs["input_ids"].to(torch.long)
+    batch = int(input_ids.shape[0])
+    if batch != 1:
+        raise ValueError(f"qwen3_14b.e2e expects batch=1, got {batch}")
+
+    prompt_len = int(input_ids.shape[1])
+    if prompt_len + max_new_tokens - 1 > max_seq:
+        raise ValueError(
+            f"prompt_len + max_new_tokens - 1 = {prompt_len + max_new_tokens - 1} exceeds max_seq={max_seq}"
+        )
+
+    hidden = int(prefill.HIDDEN)
+    head_dim = int(prefill.HEAD_DIM)
+    num_kv_heads = int(prefill.NUM_KV_HEADS)
+    cache_rows = batch * num_kv_heads * max_seq
+
+    layer_weights = [
+        _extract_layer_weights(dict(weights), layer_idx=idx, hidden=hidden, head_dim=head_dim)
+        for idx in range(num_layers)
+    ]
+    embed_weight = weights["model.embed_tokens.weight"]
+    norm_weight = weights["model.norm.weight"].view(1, hidden)
+    lm_weight = weights["lm_head.weight"]
+    rope_cos, rope_sin = _build_rope(max_seq, head_dim)
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail for f in \ examples/models/qwen3/14b/qwen3_14b_decode.py \ examples/models/qwen3/14b/qwen3_14b_prefill.py do printf '\n== %s ==\n' "$f" rg -n '^(from|import)\s+' "$f" done

Repository: hw-native-sys/pypto-lib

Length of output: 236

🏁 Script executed:

# First, find where _load_qwen3_modules is defined and how it's used rg -n "_load_qwen3_modules" llm/testing/hf_compare/cases/qwen3_14b_e2e.py # Then check for cpu_only usage in the file rg -n "cpu_only" llm/testing/hf_compare/cases/qwen3_14b_e2e.py # Get the file size to determine how to read it wc -l llm/testing/hf_compare/cases/qwen3_14b_e2e.py

Repository: hw-native-sys/pypto-lib

Length of output: 573

🏁 Script executed:

# Get context around the _load_qwen3_modules call and cpu_only handling sed -n '480,535p' llm/testing/hf_compare/cases/qwen3_14b_e2e.py

Repository: hw-native-sys/pypto-lib

Length of output: 1898

🏁 Script executed:

# Get the definition of _load_qwen3_modules sed -n '41,60p' llm/testing/hf_compare/cases/qwen3_14b_e2e.py

Repository: hw-native-sys/pypto-lib

Length of output: 833

🏁 Script executed:

# Check what constants are being extracted from the modules sed -n '41,70p' llm/testing/hf_compare/cases/qwen3_14b_e2e.py | head -30 # Look at the entire flow after _load_qwen3_modules() to see which paths actually use prefill sed -n '490,580p' llm/testing/hf_compare/cases/qwen3_14b_e2e.py

Repository: hw-native-sys/pypto-lib

Length of output: 4487

🏁 Script executed:

# Check if pypto is actually a backend-only package by looking at its usage in the helper files cat examples/models/qwen3/14b/qwen3_14b_prefill.py | head -50 # Also check if the constants HIDDEN, HEAD_DIM, NUM_KV_HEADS are defined independently or depend on pypto rg -A 5 -B 5 "HIDDEN|HEAD_DIM|NUM_KV_HEADS" examples/models/qwen3/14b/qwen3_14b_prefill.py | head -40

Repository: hw-native-sys/pypto-lib

Length of output: 3626

Move _load_qwen3_modules() behind the non-CPU branch or hoist constants.

_load_qwen3_modules() at line 490 executes unconditionally before the cpu_only check at line 526, importing pypto.language even when the CPU-only path runs. The prefill module is only used to extract static constants (HIDDEN, HEAD_DIM, NUM_KV_HEADS), so either load the modules conditionally within the NPU branch or define these constants inline in this case module to keep the CPU-only path free of backend dependencies.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm/testing/hf_compare/cases/qwen3_14b_e2e.py` around lines 490 - 516, The call to _load_qwen3_modules() runs unconditionally and pulls in backend-only code (e.g., pypto.language) even on the CPU-only path; move the _load_qwen3_modules() invocation into the non-CPU branch (the branch guarded by the cpu_only flag) or else hoist the static constants HIDDEN, HEAD_DIM, and NUM_KV_HEADS into module-level constants so the CPU-only path does not import prefill; update uses of prefill.HIDDEN, prefill.HEAD_DIM, and prefill.NUM_KV_HEADS in the code that builds layer_weights and cache_rows to reference the new constants or to only access prefill after the conditional.

coderabbitai · 2026-05-06T02:40:45Z

+def build_rope_tables(
+    max_seq: int,
+    head_dim: int,
+    base: float = 1_000_000.0,
+    dtype: torch.dtype = torch.float32,
+) -> tuple[torch.Tensor, torch.Tensor]:
+    """Build cos/sin tables compatible with HF rotate_half form."""
+    inv_freq = 1.0 / (base ** (torch.arange(0, head_dim, 2, dtype=torch.float32) / head_dim))
+    pos = torch.arange(max_seq, dtype=torch.float32)
+    freqs = torch.outer(pos, inv_freq)
+    emb = torch.cat([freqs, freqs], dim=-1)
+    return emb.cos().to(dtype), emb.sin().to(dtype)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject odd head_dim in build_rope_tables.

For odd values, torch.arange(0, head_dim, 2) has ceil(head_dim / 2) entries, so torch.cat([freqs, freqs], dim=-1) returns one extra column. That silently produces RoPE tables with the wrong width instead of failing fast.

Proposed fix

def build_rope_tables( max_seq: int, head_dim: int, base: float = 1_000_000.0, dtype: torch.dtype = torch.float32, ) -> tuple[torch.Tensor, torch.Tensor]: """Build cos/sin tables compatible with HF rotate_half form.""" + if head_dim % 2 != 0: + raise ValueError(f"head_dim must be even for RoPE tables, got {head_dim}") inv_freq = 1.0 / (base ** (torch.arange(0, head_dim, 2, dtype=torch.float32) / head_dim)) pos = torch.arange(max_seq, dtype=torch.float32) freqs = torch.outer(pos, inv_freq) emb = torch.cat([freqs, freqs], dim=-1) return emb.cos().to(dtype), emb.sin().to(dtype)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm/testing/hf_compare/input_sampler.py` around lines 46 - 57, The build_rope_tables function must reject odd head_dim to avoid silently producing wrong-width RoPE tables; add an early validation in build_rope_tables that checks head_dim is even (e.g., head_dim % 2 == 0) and raise a ValueError with a clear message if not, then proceed to compute inv_freq, freqs, emb, and return emb.cos()/emb.sin() as before; reference the existing symbols head_dim, inv_freq, freqs, and emb so the check is inserted before inv_freq is computed.

coderabbitai · 2026-05-06T02:40:45Z

+    def forward(self, inputs: Mapping[str, torch.Tensor]) -> dict[str, torch.Tensor]:
+        if self._module is None:
+            raise RuntimeError(f"[{self.name}] prepare() must be called first.")
+        with torch.no_grad():
+            return self.forward_fn(self._module, inputs)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

find . -type f -name "reference.py" | head -20

Repository: hw-native-sys/pypto-lib

Length of output: 105

🏁 Script executed:

wc -l ./llm/testing/hf_compare/reference.py

Repository: hw-native-sys/pypto-lib

Length of output: 109

🏁 Script executed:

cat -n ./llm/testing/hf_compare/reference.py

Repository: hw-native-sys/pypto-lib

Length of output: 5445

🏁 Script executed:

find ./llm/testing/hf_compare -type f -name "*.py" | head -20

Repository: hw-native-sys/pypto-lib

Length of output: 656

🏁 Script executed:

rg "HFModuleReference" ./llm/testing/hf_compare --type py -B 2 -A 5

Repository: hw-native-sys/pypto-lib

Length of output: 1404

🏁 Script executed:

rg "HFModuleReference\(" ./llm/testing/hf_compare --type py -B 3 -A 10

Repository: hw-native-sys/pypto-lib

Length of output: 1275

🏁 Script executed:

rg "HFModuleReference\(" ./llm/testing/hf_compare/cases --type py -B 2 -A 15

Repository: hw-native-sys/pypto-lib

Length of output: 49

🏁 Script executed:

rg "HFModuleReference" --type py -B 0 -A 0 | grep -v "reference.py"

Repository: hw-native-sys/pypto-lib

Length of output: 49

🏁 Script executed:

rg "dtype=" ./llm/testing/hf_compare --type py | head -20

Repository: hw-native-sys/pypto-lib

Length of output: 2464

🏁 Script executed:

rg "HFModuleReference\s*\(" --type py -B 5 -A 10

Repository: hw-native-sys/pypto-lib

Length of output: 1398

🏁 Script executed:

rg "HFModuleReference\(" --type py -A 8 | grep -v "^class\|^@dataclass\|^--$"

Repository: hw-native-sys/pypto-lib

Length of output: 826

🏁 Script executed:

rg "dtype\s*=" ./llm/testing/hf_compare/cases --type py -B 3 -A 3 | head -50

Repository: hw-native-sys/pypto-lib

Length of output: 4527

🏁 Script executed:

cat -n ./llm/testing/hf_compare/cases/qwen3_14b_prefill.py | grep -A 30 "def run"

Repository: hw-native-sys/pypto-lib

Length of output: 49

🏁 Script executed:

cat ./llm/testing/hf_compare/cases/qwen3_14b_prefill.py | head -100

Repository: hw-native-sys/pypto-lib

Length of output: 3842

🏁 Script executed:

rg "dtype" ./llm/testing/hf_compare/reference.py -B 2 -A 2

Repository: hw-native-sys/pypto-lib

Length of output: 1191

🏁 Script executed:

rg "def forward_fn\|forward_fn\s*=" ./llm/testing/hf_compare --type py -B 2 -A 5 | head -80

Repository: hw-native-sys/pypto-lib

Length of output: 49

Cast floating inputs to self.dtype before calling the module.

The class documentation advertises running forward "in a chosen dtype", and prepare() explicitly casts both the module and its state dict values to self.dtype (lines 83, 90). However, forward() passes raw input tensors directly to forward_fn without casting them. When inputs are in a different dtype (e.g., bf16 while the module is fp32), this causes dtype mismatches inside the wrapped module or silently exercises an unintended compute path. Real-world code in this repository manually casts inputs to work around this (see qwen3_14b_prefill.py lines 74–76).

Proposed fix

def forward(self, inputs: Mapping[str, torch.Tensor]) -> dict[str, torch.Tensor]: if self._module is None: raise RuntimeError(f"[{self.name}] prepare() must be called first.") + cast_inputs = { + key: value.to(self.dtype) if torch.is_tensor(value) and torch.is_floating_point(value) else value + for key, value in inputs.items() + } with torch.no_grad(): - return self.forward_fn(self._module, inputs) + return self.forward_fn(self._module, cast_inputs)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def forward(self, inputs: Mapping[str, torch.Tensor]) -> dict[str, torch.Tensor]:

if self._module is None:

raise RuntimeError(f"[{self.name}] prepare() must be called first.")

with torch.no_grad():

return self.forward_fn(self._module, inputs)

def forward(self, inputs: Mapping[str, torch.Tensor]) -> dict[str, torch.Tensor]:

if self._module is None:

raise RuntimeError(f"[{self.name}] prepare() must be called first.")

cast_inputs = {

key: value.to(self.dtype) if torch.is_tensor(value) and torch.is_floating_point(value) else value

for key, value in inputs.items()

}

with torch.no_grad():

return self.forward_fn(self._module, cast_inputs)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm/testing/hf_compare/reference.py` around lines 99 - 103, forward() currently passes raw input tensors to forward_fn, which can mismatch the module dtype; modify forward() to first cast any floating-point tensors in the inputs mapping to self.dtype (use torch.is_floating_point to detect) into a new inputs dict (preserving non-floating tensors unchanged and preserving device), then call self.forward_fn(self._module, cast_inputs) inside the existing torch.no_grad() block; reference the forward method, self.dtype, forward_fn, and prepare() to locate the related code.

coderabbitai · 2026-05-06T02:40:45Z

+def run_chain(chain: ChainedComparisonCase) -> list[CompareReport]:
+    reports: list[CompareReport] = []
+    prev_ref_out: dict[str, torch.Tensor] = {}
+    for step in chain.steps:
+        if step.forward_map and prev_ref_out:
+            # Inject values from the previous step's reference outputs into
+            # this step's input spec by overriding materialize() via on_inputs.
+            mapping = step.forward_map
+            snapshot = {dst: prev_ref_out[src].clone() for src, dst in mapping.items() if src in prev_ref_out}
+            original_hook = step.case.on_inputs
+
+            def _hook(inputs: dict[str, torch.Tensor], _snap=snapshot, _orig=original_hook) -> None:
+                for name, value in _snap.items():
+                    inputs[name] = value
+                if _orig is not None:
+                    _orig(inputs)
+
+            step.case.on_inputs = _hook
+        report = run_case(step.case)
+        reports.append(report)
+        if not report.passed and chain.stop_on_fail:
+            break
+        # Re-run the reference just to capture its outputs for the next step.
+        # (ComparisonCase.run did the work but did not return ref outputs, so
+        # we re-materialize on the fly. For now we skip and rely on tests
+        # using this chain to either keep state externally or extend the API.)
+        prev_ref_out = {}
+    return reports


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

forward_map is dead and run_chain permanently mutates the user's case.

Two related defects in this loop:

prev_ref_out is unconditionally reset to {} at the bottom of every iteration (line 139), so the if step.forward_map and prev_ref_out guard on line 117 is never true except possibly for the very first step (where prev_ref_out is also empty). The feature is non-functional today — the comment at lines 135-138 acknowledges it but ships the broken path anyway. Either capture ref_out from run_case (return it alongside the report, or expose it via the case) and propagate it, or remove forward_map from the public API until it is actually wired.

Line 130 reassigns step.case.on_inputs = _hook. Since ComparisonCase instances are user-supplied, this leaks the wrapper hook onto the original case object — re-running the same chain (or running the same case standalone afterwards) compounds the wrappers. Wrap the hook locally and pass it to run_case without mutating step.case, e.g. by restoring the original after the call or by constructing a shallow copy of the case for the run.

♻️ Sketch: restore the original hook and stop relying on a non-existent prev_ref_out

def run_chain(chain: ChainedComparisonCase) -> list[CompareReport]: reports: list[CompareReport] = [] - prev_ref_out: dict[str, torch.Tensor] = {} for step in chain.steps: - if step.forward_map and prev_ref_out: - # Inject values from the previous step's reference outputs into - # this step's input spec by overriding materialize() via on_inputs. - mapping = step.forward_map - snapshot = {dst: prev_ref_out[src].clone() for src, dst in mapping.items() if src in prev_ref_out} - original_hook = step.case.on_inputs - - def _hook(inputs: dict[str, torch.Tensor], _snap=snapshot, _orig=original_hook) -> None: - for name, value in _snap.items(): - inputs[name] = value - if _orig is not None: - _orig(inputs) - - step.case.on_inputs = _hook - report = run_case(step.case) + original_hook = step.case.on_inputs + try: + # ... install combined hook here once forward_map is actually wired ... + report = run_case(step.case) + finally: + step.case.on_inputs = original_hook reports.append(report) if not report.passed and chain.stop_on_fail: break - prev_ref_out = {} return reports

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm/testing/hf_compare/runner.py` around lines 113 - 140, The run_chain loop never propagates reference outputs and permanently mutates user cases: stop clearing prev_ref_out each iteration and actually capture the reference outputs from run_case (e.g., change run_case to return (CompareReport, dict[str,torch.Tensor]) or expose ref outputs on ComparisonCase) and assign that to prev_ref_out so forward_map can work; also avoid mutating step.case.on_inputs in place—either restore the original hook after calling run_case or run a shallow copy of step.case with the wrapped on_inputs when invoking run_case (do not leave the wrapper assigned to the original case).

coderabbitai · 2026-05-06T02:40:45Z

+    def run(
+        self,
+        inputs: Mapping[str, torch.Tensor],
+        weights: Mapping[str, torch.Tensor],
+    ) -> dict[str, torch.Tensor]:
+        if self._module is None:
+            module = self.module_factory().to(self.dtype).eval()
+            prefix = self.state_dict_prefix
+            sd: dict[str, torch.Tensor] = {}
+            for k, v in weights.items():
+                if prefix and not k.startswith(prefix):
+                    continue
+                stripped = k[len(prefix):] if prefix else k
+                sd[stripped] = v.to(self.dtype)
+            missing, unexpected = module.load_state_dict(sd, strict=False)
+            if self.strict_load:
+                if missing:
+                    raise RuntimeError(f"[{self.name}] missing keys: {missing}")
+                if unexpected:
+                    raise RuntimeError(f"[{self.name}] unexpected keys: {unexpected}")
+            self._module = module
+        with torch.no_grad():
+            return self.forward_fn(self._module, inputs)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

find . -name "target.py" -type f | grep hf_compare

Repository: hw-native-sys/pypto-lib

Length of output: 102

🏁 Script executed:

fd -t f "target.py" | head -20

Repository: hw-native-sys/pypto-lib

Length of output: 100

🏁 Script executed:

cat -n llm/testing/hf_compare/target.py | head -200

Repository: hw-native-sys/pypto-lib

Length of output: 9219

🏁 Script executed:

rg -n "TorchTarget" llm/testing/hf_compare/ -A 2 -B 2 | head -100

Repository: hw-native-sys/pypto-lib

Length of output: 914

🏁 Script executed:

rg "TorchTarget" llm/testing/hf_compare/ --type py -l

Repository: hw-native-sys/pypto-lib

Length of output: 100

🏁 Script executed:

rg "forward_fn|inputs\[" llm/testing/hf_compare/ --type py -B 2 -A 2 | head -80

Repository: hw-native-sys/pypto-lib

Length of output: 6393

🏁 Script executed:

cat -n llm/testing/hf_compare/reference.py | head -220

Repository: hw-native-sys/pypto-lib

Length of output: 5445

🏁 Script executed:

rg "\.to\(.*dtype" llm/testing/hf_compare/ --type py -B 2 -A 2 | head -60

Repository: hw-native-sys/pypto-lib

Length of output: 4397

🏁 Script executed:

cat -n llm/testing/hf_compare/cases/qwen3_14b_prefill.py | sed -n '45,100p'

Repository: hw-native-sys/pypto-lib

Length of output: 2291

🏁 Script executed:

rg "TorchTarget\(" llm/testing/hf_compare/ --type py -B 3 -A 3 | head -100

Repository: hw-native-sys/pypto-lib

Length of output: 636

🏁 Script executed:

rg "def _hf.*forward|def.*forward_fn" llm/testing/hf_compare/cases/ --type py -A 10 | head -120

Repository: hw-native-sys/pypto-lib

Length of output: 3758

🏁 Script executed:

rg "TorchTarget\(" llm/testing/hf_compare/cases/ --type py -B 2 -A 8

Repository: hw-native-sys/pypto-lib

Length of output: 49

🏁 Script executed:

rg "TorchTarget" --type py -B 2 -A 8

Repository: hw-native-sys/pypto-lib

Length of output: 1721

🏁 Script executed:

cat -n llm/testing/hf_compare/reference.py | sed -n '99,105p'

Repository: hw-native-sys/pypto-lib

Length of output: 389

Normalize floating inputs to self.dtype before invoking the target module.

The module and loaded weights are converted to self.dtype, but the input mapping is passed through unchanged. That makes TorchTarget(dtype=torch.float32) unsafe for bf16-materialized test inputs and can fail in nn.Linear/matmul operations when dtypes mismatch.

Proposed fix

def run( self, inputs: Mapping[str, torch.Tensor], weights: Mapping[str, torch.Tensor], ) -> dict[str, torch.Tensor]: if self._module is None: module = self.module_factory().to(self.dtype).eval() prefix = self.state_dict_prefix sd: dict[str, torch.Tensor] = {} for k, v in weights.items(): if prefix and not k.startswith(prefix): continue stripped = k[len(prefix):] if prefix else k sd[stripped] = v.to(self.dtype) missing, unexpected = module.load_state_dict(sd, strict=False) if self.strict_load: if missing: raise RuntimeError(f"[{self.name}] missing keys: {missing}") if unexpected: raise RuntimeError(f"[{self.name}] unexpected keys: {unexpected}") self._module = module + cast_inputs = { + key: value.to(self.dtype) if torch.is_tensor(value) and torch.is_floating_point(value) else value + for key, value in inputs.items() + } with torch.no_grad(): - return self.forward_fn(self._module, inputs) + return self.forward_fn(self._module, cast_inputs)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def run(

self,

inputs: Mapping[str, torch.Tensor],

weights: Mapping[str, torch.Tensor],

) -> dict[str, torch.Tensor]:

if self._module is None:

module = self.module_factory().to(self.dtype).eval()

prefix = self.state_dict_prefix

sd: dict[str, torch.Tensor] = {}

for k, v in weights.items():

if prefix and not k.startswith(prefix):

continue

stripped = k[len(prefix):] if prefix else k

sd[stripped] = v.to(self.dtype)

missing, unexpected = module.load_state_dict(sd, strict=False)

if self.strict_load:

if missing:

raise RuntimeError(f"[{self.name}] missing keys: {missing}")

if unexpected:

raise RuntimeError(f"[{self.name}] unexpected keys: {unexpected}")

self._module = module

with torch.no_grad():

return self.forward_fn(self._module, inputs)

def run(

self,

inputs: Mapping[str, torch.Tensor],

weights: Mapping[str, torch.Tensor],

) -> dict[str, torch.Tensor]:

if self._module is None:

module = self.module_factory().to(self.dtype).eval()

prefix = self.state_dict_prefix

sd: dict[str, torch.Tensor] = {}

for k, v in weights.items():

if prefix and not k.startswith(prefix):

continue

stripped = k[len(prefix):] if prefix else k

sd[stripped] = v.to(self.dtype)

missing, unexpected = module.load_state_dict(sd, strict=False)

if self.strict_load:

if missing:

raise RuntimeError(f"[{self.name}] missing keys: {missing}")

if unexpected:

raise RuntimeError(f"[{self.name}] unexpected keys: {unexpected}")

self._module = module

cast_inputs = {

key: value.to(self.dtype) if torch.is_tensor(value) and torch.is_floating_point(value) else value

for key, value in inputs.items()

}

with torch.no_grad():

return self.forward_fn(self._module, cast_inputs)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm/testing/hf_compare/target.py` around lines 173 - 195, In run(), before calling forward_fn(self._module, inputs), ensure all floating-point tensors in the inputs Mapping are converted to self.dtype to avoid dtype mismatches with the module (e.g., convert any tensor where tensor.is_floating_point() is True to tensor.to(self.dtype)); do this conversion only when self._module is prepared (after module load) and pass the normalized inputs into forward_fn; reference symbols: run, forward_fn, module_factory, state_dict_prefix, strict_load, and self._module.

coderabbitai · 2026-05-06T02:40:46Z

+def Transpose(*dims: int) -> Op:
+    """Transpose the last two dims by default; pass dims for arbitrary perm."""
+    def _op(t: torch.Tensor) -> torch.Tensor:
+        if not dims:
+            if t.ndim < 2:
+                return t
+            return t.t()
+        return t.permute(*dims)


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use transpose(-2, -1) for the default branch here.

The docstring promises "transpose the last two dims by default", but t.t() only implements a matrix transpose. Any rank>2 tensor will violate the advertised behavior.

Proposed fix

def Transpose(*dims: int) -> Op: """Transpose the last two dims by default; pass dims for arbitrary perm.""" def _op(t: torch.Tensor) -> torch.Tensor: if not dims: if t.ndim < 2: return t - return t.t() + return t.transpose(-2, -1) return t.permute(*dims) _op.__name__ = f"Transpose{dims or ''}" return _op

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def Transpose(*dims: int) -> Op:

"""Transpose the last two dims by default; pass dims for arbitrary perm."""

def _op(t: torch.Tensor) -> torch.Tensor:

if not dims:

if t.ndim < 2:

return t

return t.t()

return t.permute(*dims)

def Transpose(*dims: int) -> Op:

"""Transpose the last two dims by default; pass dims for arbitrary perm."""

def _op(t: torch.Tensor) -> torch.Tensor:

if not dims:

if t.ndim < 2:

return t

return t.transpose(-2, -1)

return t.permute(*dims)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm/testing/hf_compare/weight_adapter.py` around lines 49 - 56, The Transpose Op implementation violates its docstring: in Transpose -> _op it uses t.t() which only works for 2D matrices; change the default branch to call t.transpose(-2, -1) (or tensor.transpose) to swap the last two dimensions for tensors with ndim>=2, leaving scalars/1D untouched, and keep the existing t.permute(*dims) behavior when dims are provided; update the _op in Transpose accordingly.

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

llm/core/pypto_executor.py (1)
329-368: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

The fused decode weights add a second full resident copy of the model layers.

_stack_decode_weights() concatenates every per-layer tensor into a new set of buffers, but compiled.layers keeps the original per-layer copies alive for prefill. For a 14B model that is another model-sized allocation at init time and is very likely to push CPU RAM over the edge. Please derive prefill views from the stacked buffers, or free the per-layer copies after creating equivalent slice views.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@llm/core/pypto_executor.py` around lines 329 - 368, The current flow in
_stack_decode_weights plus returning layers in _CompiledKernels keeps two full
copies of per-layer tensors; after you build the stacked decode buffers in
_stack_decode_weights, replace per-layer tensors used for prefill with slice
views into those stacked buffers (e.g., create per-layer tensors by
slicing/narrowing the stacked tensors) and store those views in the
compiled.layers structure (or free the original per-layer buffers immediately
after creating the views); update the return path that currently passes layers
into _CompiledKernels so it carries these view-backed _KernelLayerWeights (or
call _release_layer_weights after creating views) to avoid retaining a second
full resident copy for models like 14B.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@llm/core/pypto_executor.py`:
- Around line 19-20: The module currently leaves _TIMING_ENABLED = True which
causes noisy stdout prints on every prefill; change the default to False and
make timing opt-in: expose a configuration flag (e.g., module-level env var read
or a constructor/initializer parameter like timing_enabled on the main executor
class) so callers can enable timing explicitly, and update the code paths
referenced around lines 160-198 (the prefill/timing instrumentation) to consult
that flag. While doing so, replace direct stdout prints and flushes with a
conditional use of a logger or return the timing data to the caller when
enabled, and remove any unconditional sys.stdout.flush() calls so the hot path
is not forced to flush when timing is disabled. Ensure the symbol
_TIMING_ENABLED is used only as a default and can be overridden by the new
parameter (e.g., timing_enabled) in the executor's constructor or factory.

In `@llm/testing/hf_compare/comparator.py`:
- Around line 38-40: ref_f and tgt_f are only cast for dtype via
to(selector.cast_to) but may still live on different devices causing the
subtraction to crash; detach them, move both tensors to a single common device
(e.g., CPU) before reshaping and computing diff, e.g. perform
detach().to(device=common_device, dtype=selector.cast_to).reshape(-1) for both
ref and tgt (use a consistent device like torch.device('cpu') or .cpu()), then
compute diff = (ref_f - tgt_f).abs(); update the code around ref_f, tgt_f and
diff accordingly.

In `@models/qwen3/14b/qwen3_14b_decode_full.py`:
- Around line 145-176: The code allocates tensors (current_hidden, q_proj,
k_proj, v_proj, q_proj_norm, k_proj_norm, attn_out, next_hidden) with size
[batch,...] but writes into them using loops over batch_padded, causing
out-of-bounds when batch is not BATCH_TILE-aligned; fix by changing the
pl.create_tensor calls for these symbols to use batch_padded instead of batch
(e.g., pl.create_tensor([batch_padded, ...], ...)) so allocated buffers match
the assemble/write loops, or alternatively add an explicit validation early
(before the parallel loop) that enforces batch % BATCH_TILE == 0 and
error/adjust accordingly; update all occurrences in the shown block
(current_hidden creation and the
q_proj/k_proj/v_proj/q_proj_norm/k_proj_norm/attn_out/next_hidden creations) to
be consistent.

---

Outside diff comments:
In `@llm/core/pypto_executor.py`:
- Around line 329-368: The current flow in _stack_decode_weights plus returning
layers in _CompiledKernels keeps two full copies of per-layer tensors; after you
build the stacked decode buffers in _stack_decode_weights, replace per-layer
tensors used for prefill with slice views into those stacked buffers (e.g.,
create per-layer tensors by slicing/narrowing the stacked tensors) and store
those views in the compiled.layers structure (or free the original per-layer
buffers immediately after creating the views); update the return path that
currently passes layers into _CompiledKernels so it carries these view-backed
_KernelLayerWeights (or call _release_layer_weights after creating views) to
avoid retaining a second full resident copy for models like 14B.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8b1333ef-8341-4b3e-a2d0-6f69bf5ac766

📥 Commits

Reviewing files that changed from the base of the PR and between 7009dee and 7fdd092.

📒 Files selected for processing (23)

llm/core/engine.py
llm/core/kv_cache.py
llm/core/pypto_executor.py
llm/examples/qwen3_14b_npu_generate.py
llm/model/qwen3_14b_decode.py
llm/testing/__init__.py
llm/testing/hf_compare/__init__.py
llm/testing/hf_compare/__main__.py
llm/testing/hf_compare/base.py
llm/testing/hf_compare/cases/__init__.py
llm/testing/hf_compare/cases/qwen3_14b_decode.py
llm/testing/hf_compare/cases/qwen3_14b_decode_full.py
llm/testing/hf_compare/cases/qwen3_14b_e2e.py
llm/testing/hf_compare/cases/qwen3_14b_prefill.py
llm/testing/hf_compare/comparator.py
llm/testing/hf_compare/input_sampler.py
llm/testing/hf_compare/reference.py
llm/testing/hf_compare/runner.py
llm/testing/hf_compare/target.py
llm/testing/hf_compare/weight_adapter.py
models/qwen3/14b/qwen3_14b_decode_full.py
tests/llm/testing/test_hf_compare_batch_config.py
tests/llm/testing/test_hf_compare_e2e.py

✅ Files skipped from review due to trivial changes (7)

llm/testing/init.py
llm/model/qwen3_14b_decode.py
llm/testing/hf_compare/main.py
tests/llm/testing/test_hf_compare_batch_config.py
llm/testing/hf_compare/cases/init.py
llm/testing/hf_compare/input_sampler.py
llm/testing/hf_compare/target.py

🚧 Files skipped from review as they are similar to previous changes (7)

llm/testing/hf_compare/runner.py
llm/testing/hf_compare/init.py
llm/core/kv_cache.py
llm/testing/hf_compare/reference.py
llm/testing/hf_compare/cases/qwen3_14b_decode_full.py
llm/testing/hf_compare/cases/qwen3_14b_prefill.py
llm/testing/hf_compare/base.py

coderabbitai · 2026-05-07T11:57:13Z

+_TIMING_ENABLED = True
+


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't leave prefill timing permanently enabled in the core executor.

This now prints to stdout on every prefill and forces a flush in the hot path. That is noisy for library consumers, contaminates benchmark output, and adds avoidable latency unless profiling was explicitly requested.

Also applies to: 160-198

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm/core/pypto_executor.py` around lines 19 - 20, The module currently leaves _TIMING_ENABLED = True which causes noisy stdout prints on every prefill; change the default to False and make timing opt-in: expose a configuration flag (e.g., module-level env var read or a constructor/initializer parameter like timing_enabled on the main executor class) so callers can enable timing explicitly, and update the code paths referenced around lines 160-198 (the prefill/timing instrumentation) to consult that flag. While doing so, replace direct stdout prints and flushes with a conditional use of a logger or return the timing data to the caller when enabled, and remove any unconditional sys.stdout.flush() calls so the hot path is not forced to flush when timing is disabled. Ensure the symbol _TIMING_ENABLED is used only as a default and can be overridden by the new parameter (e.g., timing_enabled) in the executor's constructor or factory.

coderabbitai · 2026-05-07T11:57:13Z

+        ref_f = ref.detach().to(selector.cast_to).reshape(-1)
+        tgt_f = tgt.detach().to(selector.cast_to).reshape(-1)
+        diff = (ref_f - tgt_f).abs()


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize tensor devices before subtraction to avoid runtime mismatch

to(selector.cast_to) only changes dtype; it does not guarantee a common device. If ref and tgt are produced on different devices, subtraction crashes at runtime. Move both to a single device (typically CPU for test comparators) before computing diff.

Proposed fix

- ref_f = ref.detach().to(selector.cast_to).reshape(-1) - tgt_f = tgt.detach().to(selector.cast_to).reshape(-1) + ref_f = ref.detach().to(device="cpu", dtype=selector.cast_to).reshape(-1) + tgt_f = tgt.detach().to(device="cpu", dtype=selector.cast_to).reshape(-1) diff = (ref_f - tgt_f).abs()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

ref_f = ref.detach().to(selector.cast_to).reshape(-1)

tgt_f = tgt.detach().to(selector.cast_to).reshape(-1)

diff = (ref_f - tgt_f).abs()

ref_f = ref.detach().to(device="cpu", dtype=selector.cast_to).reshape(-1)

tgt_f = tgt.detach().to(device="cpu", dtype=selector.cast_to).reshape(-1)

diff = (ref_f - tgt_f).abs()

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm/testing/hf_compare/comparator.py` around lines 38 - 40, ref_f and tgt_f are only cast for dtype via to(selector.cast_to) but may still live on different devices causing the subtraction to crash; detach them, move both tensors to a single common device (e.g., CPU) before reshaping and computing diff, e.g. perform detach().to(device=common_device, dtype=selector.cast_to).reshape(-1) for both ref and tgt (use a consistent device like torch.device('cpu') or .cpu()), then compute diff = (ref_f - tgt_f).abs(); update the code around ref_f, tgt_f and diff accordingly.

coderabbitai · 2026-05-07T11:57:13Z

+            user_batch = pl.tensor.dim(hidden_states, 0)
+            batch_padded = ((user_batch + BATCH_TILE - 1) // BATCH_TILE) * BATCH_TILE
+
+            current_hidden = pl.create_tensor([batch, hidden], dtype=pl.BF16)
+            for b0 in pl.parallel(0, batch_padded, BATCH_TILE):
+                cur_valid = pl.min(BATCH_TILE, user_batch - b0)
+                with pl.at(level=pl.Level.CORE_GROUP):
+                    for kb in pl.range(hidden_blocks):
+                        k0 = kb * K_CHUNK
+                        hidden_chunk = pl.slice(
+                            hidden_states,
+                            [BATCH_TILE, K_CHUNK],
+                            [b0, k0],
+                            valid_shape=[cur_valid, K_CHUNK],
+                        )
+                        current_hidden = pl.assemble(current_hidden, hidden_chunk, [b0, k0])
+
+            for layer_idx in pl.range(num_layers):
+                layer_hidden_base = layer_idx * hidden
+                layer_inter_base = layer_idx * inter
+                layer_cache_base = layer_idx * layer_cache_rows
+                next_hidden = pl.create_tensor([batch, hidden], dtype=pl.BF16)
+
+                # Intermediate FP32 tensors between scope 1 and scope 2.
+                # Allocated at runtime batch_padded; pl.create_tensor zero-inits
+                # so trailing (batch_padded - user_batch) padded rows are 0,
+                # which is the invariant relied on by Q/K-norm and scope-3.
+                q_proj = pl.create_tensor([batch, hidden], dtype=pl.FP32)
+                k_proj = pl.create_tensor([batch, kv_hidden], dtype=pl.FP32)
+                v_proj = pl.create_tensor([batch, kv_hidden], dtype=pl.FP32)
+                q_proj_norm = pl.create_tensor([batch, hidden], dtype=pl.FP32)
+                k_proj_norm = pl.create_tensor([batch, kv_hidden], dtype=pl.FP32)


⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

🧩 Analysis chain

🏁 Script executed:

fd -type f -name "*qwen3_14b_decode_full.py"

Repository: hw-native-sys/pypto-lib

Length of output: 238

🏁 Script executed:

fd "qwen3_14b_decode_full.py"

Repository: hw-native-sys/pypto-lib

Length of output: 163

🏁 Script executed:

wc -l models/qwen3/14b/qwen3_14b_decode_full.py

Repository: hw-native-sys/pypto-lib

Length of output: 114

🏁 Script executed:

sed -n '145,176p' models/qwen3/14b/qwen3_14b_decode_full.py

Repository: hw-native-sys/pypto-lib

Length of output: 1931

🏁 Script executed:

sed -n '1,50p' models/qwen3/14b/qwen3_14b_decode_full.py

Repository: hw-native-sys/pypto-lib

Length of output: 2069

🏁 Script executed:

sed -n '50,145p' models/qwen3/14b/qwen3_14b_decode_full.py

Repository: hw-native-sys/pypto-lib

Length of output: 4013

🏁 Script executed:

sed -n '176,250p' models/qwen3/14b/qwen3_14b_decode_full.py

Repository: hw-native-sys/pypto-lib

Length of output: 4686

🏁 Script executed:

sed -n '100,145p' models/qwen3/14b/qwen3_14b_decode_full.py

Repository: hw-native-sys/pypto-lib

Length of output: 2648

🏁 Script executed:

sed -n '250,350p' models/qwen3/14b/qwen3_14b_decode_full.py

Repository: hw-native-sys/pypto-lib

Length of output: 6764

🏁 Script executed:

sed -n '260,300p' models/qwen3/14b/qwen3_14b_decode_full.py

Repository: hw-native-sys/pypto-lib

Length of output: 2901

🏁 Script executed:

rg "batch_padded" models/qwen3/14b/qwen3_14b_decode_full.py -A 2 -B 2

Repository: hw-native-sys/pypto-lib

Length of output: 3842

🏁 Script executed:

rg "pl\.create_tensor.*batch" models/qwen3/14b/qwen3_14b_decode_full.py -B 2 -A 1

Repository: hw-native-sys/pypto-lib

Length of output: 1321

🏁 Script executed:

sed -n '160,200p' models/qwen3/14b/qwen3_14b_decode_full.py | cat -n

Repository: hw-native-sys/pypto-lib

Length of output: 2943

🏁 Script executed:

sed -n '1000,1061p' models/qwen3/14b/qwen3_14b_decode_full.py

Repository: hw-native-sys/pypto-lib

Length of output: 2190

🏁 Script executed:

rg "attn_out\|all_q_padded" models/qwen3/14b/qwen3_14b_decode_full.py -B 2 -A 3

Repository: hw-native-sys/pypto-lib

Length of output: 49

🏁 Script executed:

rg "attn_out" models/qwen3/14b/qwen3_14b_decode_full.py -B 1 -A 1

Repository: hw-native-sys/pypto-lib

Length of output: 1661

Buffer allocation mismatch between dynamic padding and static allocation.

The tensors current_hidden, q_proj, k_proj, v_proj, q_proj_norm, k_proj_norm, attn_out, and next_hidden are allocated with the batch parameter, but the loops that write to them iterate over batch_padded (rounded up to BATCH_TILE). If the host passes batch=1 or any non-BATCH_TILE-aligned value, the loops can assemble data past the allocated buffer bounds. Either allocate these tensors using batch_padded instead of batch, or enforce that batch must be a multiple of BATCH_TILE.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@models/qwen3/14b/qwen3_14b_decode_full.py` around lines 145 - 176, The code allocates tensors (current_hidden, q_proj, k_proj, v_proj, q_proj_norm, k_proj_norm, attn_out, next_hidden) with size [batch,...] but writes into them using loops over batch_padded, causing out-of-bounds when batch is not BATCH_TILE-aligned; fix by changing the pl.create_tensor calls for these symbols to use batch_padded instead of batch (e.g., pl.create_tensor([batch_padded, ...], ...)) so allocated buffers match the assemble/write loops, or alternatively add an explicit validation early (before the parallel loop) that enforces batch % BATCH_TILE == 0 and error/adjust accordingly; update all occurrences in the shown block (current_hidden creation and the q_proj/k_proj/v_proj/q_proj_norm/k_proj_norm/attn_out/next_hidden creations) to be consistent.

Introduces a HuggingFace comparison framework under llm/testing/hf_compare/ that validates pypto kernels against HF reference implementations. Supports prefill, decode, and end-to-end test cases with batch configuration and weight adapter utilities.

Stacks weights across num_layers and compares the fused pypto kernel output against num_layers HF Qwen3DecoderLayer instances in sequence.

- Extract shared constants, weight adapter, and kernel loader into cases/_qwen3_14b_common.py so decode/decode_full/prefill cases reuse one source of truth (-391/+150 across the three cases). - Add cases/qwen3_14b_e2e_fused.py: end-to-end comparison driven by the production LLMEngine + PyptoQwen14BExecutor (fused multi-layer decode kernel) against a stock HF Qwen3ForCausalLM run. - Add hf_compare/paged_kv.py: shared paged-attention KV-cache helpers (block-table init, slot mapping, decode/prefill KV gather, paged->dense stitching) used by fused/per-layer cases. - Fix qwen3_14b_e2e.py kernel load paths to point at models/qwen3/14b/ instead of the legacy examples/models/ location.

wangqin1723-max force-pushed the feat/hf-compare-framework branch from b2bee35 to 9ce4312 Compare April 27, 2026 04:17

gemini-code-assist Bot reviewed Apr 27, 2026

View reviewed changes

wangqin1723-max force-pushed the feat/hf-compare-framework branch from 7009dee to 69db614 Compare May 6, 2026 02:34

coderabbitai Bot reviewed May 6, 2026

View reviewed changes

wangqin1723-max force-pushed the feat/hf-compare-framework branch from 69db614 to 7fdd092 Compare May 7, 2026 11:46

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

wangqin1723-max force-pushed the feat/hf-compare-framework branch 2 times, most recently from 0fa3e0d to 5e14bd0 Compare May 11, 2026 02:30

wangqin1723-max added 3 commits May 11, 2026 10:38

feat: Add HF compare case for Qwen3-14B fused multi-layer decode

dd69d60

Stacks weights across num_layers and compares the fused pypto kernel output against num_layers HF Qwen3DecoderLayer instances in sequence.

wangqin1723-max force-pushed the feat/hf-compare-framework branch 2 times, most recently from f2dded0 to 311cbfd Compare May 11, 2026 03:02

Conversation

wangqin1723-max commented Apr 27, 2026

Summary

Test plan

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading