Add: HF compare testing framework for Qwen3-14B model validation#184
Add: HF compare testing framework for Qwen3-14B model validation#184wangqin1723-max wants to merge 3 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR introduces a fused multi-layer Qwen3-14B decode kernel compiled into a single PyPTO program, updates the executor and engine to use stacked weights and padded block tables for paged attention, and adds a comprehensive HuggingFace comparison testing framework with reusable abstractions for reference models, target models, weight adapters, and comparison runners, including concrete implementations for single-layer prefill/decode, multi-layer decode, and end-to-end generation. ChangesFused Multi-Layer Decode & Host Integration
HF Comparison Testing Framework
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
b2bee35 to
9ce4312
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces a Hugging Face comparison framework for verifying PyPTO kernels and models, including core abstractions, weight adapters, and specific cases for Qwen3-14B. The review feedback identifies several performance bottlenecks where models and kernels are re-initialized or re-compiled during every run instead of the preparation phase. Additionally, there are logic errors in the case chaining implementation, missing tensors in kernel signatures, and inappropriate tolerance settings for token ID comparisons in end-to-end tests.
| "hidden_states", "seq_lens", "input_rms_weight", | ||
| "wq", "wk", "wv", "q_norm_weight", "k_norm_weight", | ||
| "rope_cos", "rope_sin", "k_cache", "v_cache", | ||
| "wo", "post_rms_weight", "w_gate", "w_up", "w_down", "out", | ||
| ] | ||
| self._decode_spec_order = [ | ||
| "hidden_states", "input_rms_weight", | ||
| "wq", "wk", "wv", "q_norm_weight", "k_norm_weight", | ||
| "seq_lens", "rope_cos", "rope_sin", "k_cache", "v_cache", | ||
| "wo", "post_rms_weight", "w_gate", "w_up", "w_down", "out", | ||
| ] |
| cfg = AutoConfig.from_pretrained(model_path) | ||
| cfg._attn_implementation = "eager" | ||
| layer = Qwen3DecoderLayer(cfg, layer_idx=0).to(hf_dtype).eval() |
There was a problem hiding this comment.
| compiled_prefill = ir.compile(prefill.build_qwen3_14b_prefill_program(batch=batch, max_seq=max_seq), backend_type=backend) | ||
| self._prefill_dir = compiled_prefill.output_dir | ||
| compiled_decode = ir.compile(dec.build_qwen3_decode_program(batch=batch, max_seq=max_seq), backend_type=backend) | ||
| self._decode_dir = compiled_decode.output_dir |
There was a problem hiding this comment.
| cfg._attn_implementation = "eager" | ||
| cfg.num_hidden_layers = num_layers | ||
|
|
||
| model = Qwen3ForCausalLM(cfg).to(hf_dtype).eval() |
| cast_to=torch.float32, | ||
| ), | ||
| ], | ||
| tolerance=Tolerance(atol=atol, rtol=rtol, pass_rate_threshold=1.0), |
There was a problem hiding this comment.
| cfg = AutoConfig.from_pretrained(model_path) | ||
| cfg._attn_implementation = "eager" | ||
| layer = Qwen3DecoderLayer(cfg, layer_idx=0).to(hf_dtype).eval() |
| # (ComparisonCase.run did the work but did not return ref outputs, so | ||
| # we re-materialize on the fly. For now we skip and rely on tests | ||
| # using this chain to either keep state externally or extend the API.) | ||
| prev_ref_out = {} |
There was a problem hiding this comment.
The prev_ref_out dictionary is explicitly cleared at the end of each step, and it is never populated with the actual reference outputs (ref_out). This breaks the chaining logic where step N+1 is supposed to receive inputs from step N's reference outputs. Additionally, run_case needs to be updated to return the reference outputs to the caller.
| return all_weights[prefix + name] | ||
|
|
||
| def proj(name: str) -> torch.Tensor: | ||
| return w(name).t().contiguous().to(torch.bfloat16) |
| hi = ki * q_per_kv + qi | ||
| q_head = q_rot[:, hi, :] | ||
| scores = (q_head.float() @ k_cached.float().T) * scale | ||
| causal = torch.triu(torch.full((seq, seq), float("-inf")), diagonal=1) |
7009dee to
69db614
Compare
There was a problem hiding this comment.
Actionable comments posted: 12
🧹 Nitpick comments (4)
llm/core/pypto_executor.py (2)
19-19: ⚡ Quick winHard-coded
_TIMING_ENABLED = Truewithprint()for every prefill/decode call.The flag is set unconditionally to
Trueand prints on every prefill/decode invocation, which will be noisy on hot inference paths and is awkward to silence for end-users (they'd have to edit the source). Consider gating the timing on an env var (or a constructor flag onPyptoQwen14BExecutor) and routing output throughlogginginstead of♻️ Sketch
-import time +import logging +import os +import time from dataclasses import dataclass from pathlib import Path import torch -_TIMING_ENABLED = True +_TIMING_ENABLED = os.environ.get("PYPTO_EXECUTOR_TIMING", "0") == "1" +_LOG = logging.getLogger(__name__) @@ - if _TIMING_ENABLED: - print( - f"[timing] prefill: {len(model.layers)} layers, " - f"{(time.perf_counter() - t_prefill_start) * 1000:.2f} ms", - flush=True, - ) + if _TIMING_ENABLED: + _LOG.info( + "prefill: %d layers, %.2f ms", + len(model.layers), + (time.perf_counter() - t_prefill_start) * 1000, + )Apply the same swap for the decode timing block.
Also applies to: 188-194, 237-243
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm/core/pypto_executor.py` at line 19, Replace the hard-coded _TIMING_ENABLED = True and the ad-hoc print()calls in the prefill and decode timing blocks with a configurable, non-invasive mechanism: add a boolean constructor parameter (e.g., timing_enabled) to PyptoQwen14BExecutor that defaults to reading an environment variable (e.g., PYRPT_TIMING) and remove the global hard-coded flag; replace all print(...) usages in the prefill and decode timing sections with logger.debug/info calls using the standard logging module so users can control verbosity via logging configuration, and gate the timing measurements behind the new instance attribute (self.timing_enabled) so timing is silent unless explicitly enabled.
334-361: 🏗️ Heavy liftPeak memory from list accumulation in
torch.catstacking.The
_stack_decode_weightsmethod materializes transient buffers by accumulating all per-layer results in a list before concatenating. For Qwen3-14B with 40 layers, eachstack_kernel_weightcall collects 40 per-layer tensors (each freshly produced by_kernel_weight—transpose + bf16 + contiguous operations) and holds them all simultaneously. For weights likew_gate/w_up/w_down(~170 MB/layer in bf16 each) across all 11 stacked weights, this pattern keeps peak memory at roughly(N+1) * per_layer_sizefor each weight type.If memory is constrained in your target environment, pre-allocate each stacked tensor and copy per-layer slices into it one at a time (releasing each
_kernel_weightresult before producing the next), reducing peak allocation to ~2 * per_layer_sizeper weight.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm/core/pypto_executor.py` around lines 334 - 361, The current _stack_decode_weights implementation (stack_kernel_weight and stack_norm_weight) builds Python lists of per-layer tensors then torch.cat's them, causing high peak memory; change each stack_* to pre-allocate the final tensor sized (len(model.layers), ...) and fill slices from each layer one-by-one by calling _kernel_weight or getattr(layer, ...) per layer and copying into the pre-allocated buffer, releasing the per-layer result immediately; update stack_kernel_weight to compute the final shape from a single layer (using _kernel_weight on layer 0 without keeping it long) and then loop layers to copy into buffer, and similarly update stack_norm_weight to allocate a (num_layers, feature) float tensor and fill row-wise from getattr(layer, attr).cpu().contiguous() so decode_input_rms_weight, decode_wq, decode_wk, decode_wv, decode_q_norm_weight, decode_k_norm_weight, decode_wo, decode_post_rms_weight, decode_w_gate, decode_w_up, and decode_w_down are built without materializing long lists.llm/testing/hf_compare/cases/qwen3_14b_prefill.py (2)
194-201: 💤 Low valueRepo-root resolution via
parents[4]is fragile.
Path(__file__).resolve().parents[4]silently produces a wrong root if the file is ever moved or the package structure changes (you'll then load a kernel from the wrong tree or get a confusingImportError). Consider centralizing this in a small helper that walks up to a known sentinel (e.g.pyproject.tomlorexamples/), or pinning it via a package-relative resource. Optional given this is test-only code.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm/testing/hf_compare/cases/qwen3_14b_prefill.py` around lines 194 - 201, The repo-root computation using Path(__file__).resolve().parents[4] is fragile; replace it with a small helper (e.g., find_repo_root_by_sentinel) that walks upward from Path(__file__).resolve() until it finds a known sentinel like "pyproject.toml" or the "examples" directory and returns that path, then use that helper to derive _KERNEL_PATH instead of repo_root = Path(...).parents[4]; update the Qwen3 loader code that sets _KERNEL_PATH and the ImportError message to use the helper and ensure it raises a clear error if the sentinel isn't found.
131-158: 💤 Low valueVectorize
_select_npu_kvto avoid per-element Python loops.For typical configs (e.g.
batch=16, seq_len=128, NUM_KV_HEADS=8), this runs ~16k Python-level row copies plus a.item()sync per(b, pos). It's easy to express with index arithmetic in a singleindex_select, which removes the host/device sync and the inner Python loops. Also,outis created viatorch.emptyand never aliased, so the trailing.clone()on line 158 is unnecessary.♻️ Vectorized sketch
def _select_npu_kv(flat, *, slot_mapping, block_size, batch, max_seq, seq_len): pos = torch.arange(seq_len, device=flat.device) b = torch.arange(batch, device=flat.device).unsqueeze(1) # [B, 1] slots = slot_mapping.view(batch, max_seq)[:, :seq_len].long() # [B, S] slot_block = slots // block_size slot_off = slots % block_size base = slot_block * (NUM_KV_HEADS * block_size) + slot_off # [B, S] kvh = torch.arange(NUM_KV_HEADS, device=flat.device) # [nkv] rows = base.unsqueeze(1) + kvh.view(1, -1, 1) * block_size # [B, nkv, S] return flat.index_select(0, rows.reshape(-1)).view(batch, NUM_KV_HEADS, seq_len, -1)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm/testing/hf_compare/cases/qwen3_14b_prefill.py` around lines 131 - 158, The _select_npu_kv function is doing heavy Python-level loops and device-sync .item() calls; replace the per-(b,pos) loops with a vectorized index computation using torch integer ops on slot_mapping to compute slot_block and slot_off, build a base index = slot_block*(NUM_KV_HEADS*block_size)+slot_off, add a kvh offset (kvh * block_size) to form a [B, NUM_KV_HEADS, S] rows tensor, then use flat.index_select(0, rows.reshape(-1)) and reshape to (batch, NUM_KV_HEADS, seq_len, HEAD_DIM) to return; also remove the unnecessary out.clone() at the end. Reference symbols: _select_npu_kv, flat, slot_mapping, block_size, batch, max_seq, seq_len, NUM_KV_HEADS, HEAD_DIM.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@llm/testing/hf_compare/__main__.py`:
- Around line 9-15: The module docstring at the top of
llm/testing/hf_compare/__main__.py advertises CLI flags (--hf-model-path,
--platform, --cpu-only) that the parser doesn't register; update the docstring
to show the actual usage of the implemented CLI (the "list" and "run"
subcommands) and example invocation for "run" using the registered positional
"case", the --json flag and the kwarg option (-k/--kwarg), e.g. show usages like
"python -m llm.testing.hf_compare list" and "python -m llm.testing.hf_compare
run qwen3_14b.decode -k hf_model_path=/path/to/Qwen3-14B -k cpu_only=true" so
copy-paste works and matches how the run subparser (the parser.add_subparsers /
run subparser and its --json and -k/--kwarg definitions) is actually
implemented.
In `@llm/testing/hf_compare/cases/qwen3_14b_decode_full.py`:
- Around line 240-263: Validate runtime parameters immediately after parsing:
ensure batch > 0 (already checked), raise ValueError if seq_len is None or
seq_len <= 0, raise ValueError if num_layers <= 0, and validate hf_dtype is one
of the supported strings (e.g. "fp32" or "bf16") and raise ValueError for
others; perform these checks before computing context_len/max_seq_eff
(variables: seq_len, num_layers, hf_dtype, context_len, max_seq_eff,
dec.MAX_SEQ) so invalid cases fail during case construction rather than during
decoding.
In `@llm/testing/hf_compare/cases/qwen3_14b_decode.py`:
- Around line 202-231: Validate inputs early: ensure seq_len (if provided) and
ctx_len (when used as fallback) are positive and raise ValueError if <= 0, and
validate hf_dtype against allowed values ("fp32" or "bf16") before mapping to
hf_dtype_t (currently set in hf_dtype_t = torch.float32 if hf_dtype == "fp32"
else torch.bfloat16). Specifically, add a check near where seq_len/ctx_len are
processed to reject non-positive lengths (reference variables seq_len, ctx_len,
context_len) and add an explicit hf_dtype validation that raises a ValueError
for unknown strings before assigning hf_dtype_t so typos don't silently become
bfloat16.
In `@llm/testing/hf_compare/cases/qwen3_14b_e2e.py`:
- Around line 66-72: The _parse_prompt_ids function currently converts
list/tuple inputs to integers but doesn't reject an empty list/tuple; ensure you
validate and raise the same ValueError("prompt_ids must contain at least one
token id") for list/tuple inputs too by checking the computed out list after
converting the values (from the isinstance(prompt_ids, (list, tuple)) branch)
and raising if empty so callers like the e2e runner won't receive an empty list;
keep the same error message and behavior for the string path.
- Around line 619-639: Validate inputs early: after coercing and converting
inputs (num_layers, max_new_tokens, hf_dtype) add checks that num_layers > 0 and
max_new_tokens > 0 and that hf_dtype is one of the allowed strings ("fp32" or
"bf16"); if any check fails, raise a ValueError with a clear message. Replace
the current silent fallback logic that sets hf_dtype_t = torch.float32 if
hf_dtype == "fp32" else torch.bfloat16 with an explicit mapping that first
validates hf_dtype and then assigns hf_dtype_t so typos don't silently pick
bf16; ensure these checks sit near the existing conversions around the variables
num_layers, max_new_tokens, and hf_dtype_t in this builder function.
- Around line 490-516: The call to _load_qwen3_modules() runs unconditionally
and pulls in backend-only code (e.g., pypto.language) even on the CPU-only path;
move the _load_qwen3_modules() invocation into the non-CPU branch (the branch
guarded by the cpu_only flag) or else hoist the static constants HIDDEN,
HEAD_DIM, and NUM_KV_HEADS into module-level constants so the CPU-only path does
not import prefill; update uses of prefill.HIDDEN, prefill.HEAD_DIM, and
prefill.NUM_KV_HEADS in the code that builds layer_weights and cache_rows to
reference the new constants or to only access prefill after the conditional.
In `@llm/testing/hf_compare/cases/qwen3_14b_prefill.py`:
- Around line 182-192: The code silently maps any hf_dtype string other than
"fp32" to bfloat16; change this to validate hf_dtype explicitly and raise a
clear error for unknown values: accept a defined set (e.g. "fp32", "float32" ->
torch.float32; "bfloat16", "bf16" -> torch.bfloat16; optionally "fp16" ->
torch.float16), map the allowed strings to the correct torch dtype and set
hf_dtype_t accordingly, and throw a ValueError with a descriptive message if
hf_dtype is not in the allowed set (update the assignment around hf_dtype_t to
perform the validation/mapping).
In `@llm/testing/hf_compare/input_sampler.py`:
- Around line 46-57: The build_rope_tables function must reject odd head_dim to
avoid silently producing wrong-width RoPE tables; add an early validation in
build_rope_tables that checks head_dim is even (e.g., head_dim % 2 == 0) and
raise a ValueError with a clear message if not, then proceed to compute
inv_freq, freqs, emb, and return emb.cos()/emb.sin() as before; reference the
existing symbols head_dim, inv_freq, freqs, and emb so the check is inserted
before inv_freq is computed.
In `@llm/testing/hf_compare/reference.py`:
- Around line 99-103: forward() currently passes raw input tensors to
forward_fn, which can mismatch the module dtype; modify forward() to first cast
any floating-point tensors in the inputs mapping to self.dtype (use
torch.is_floating_point to detect) into a new inputs dict (preserving
non-floating tensors unchanged and preserving device), then call
self.forward_fn(self._module, cast_inputs) inside the existing torch.no_grad()
block; reference the forward method, self.dtype, forward_fn, and prepare() to
locate the related code.
In `@llm/testing/hf_compare/runner.py`:
- Around line 113-140: The run_chain loop never propagates reference outputs and
permanently mutates user cases: stop clearing prev_ref_out each iteration and
actually capture the reference outputs from run_case (e.g., change run_case to
return (CompareReport, dict[str,torch.Tensor]) or expose ref outputs on
ComparisonCase) and assign that to prev_ref_out so forward_map can work; also
avoid mutating step.case.on_inputs in place—either restore the original hook
after calling run_case or run a shallow copy of step.case with the wrapped
on_inputs when invoking run_case (do not leave the wrapper assigned to the
original case).
In `@llm/testing/hf_compare/target.py`:
- Around line 173-195: In run(), before calling forward_fn(self._module,
inputs), ensure all floating-point tensors in the inputs Mapping are converted
to self.dtype to avoid dtype mismatches with the module (e.g., convert any
tensor where tensor.is_floating_point() is True to tensor.to(self.dtype)); do
this conversion only when self._module is prepared (after module load) and pass
the normalized inputs into forward_fn; reference symbols: run, forward_fn,
module_factory, state_dict_prefix, strict_load, and self._module.
In `@llm/testing/hf_compare/weight_adapter.py`:
- Around line 49-56: The Transpose Op implementation violates its docstring: in
Transpose -> _op it uses t.t() which only works for 2D matrices; change the
default branch to call t.transpose(-2, -1) (or tensor.transpose) to swap the
last two dimensions for tensors with ndim>=2, leaving scalars/1D untouched, and
keep the existing t.permute(*dims) behavior when dims are provided; update the
_op in Transpose accordingly.
---
Nitpick comments:
In `@llm/core/pypto_executor.py`:
- Line 19: Replace the hard-coded _TIMING_ENABLED = True and the ad-hoc
print()calls in the prefill and decode timing blocks with a configurable,
non-invasive mechanism: add a boolean constructor parameter (e.g.,
timing_enabled) to PyptoQwen14BExecutor that defaults to reading an environment
variable (e.g., PYRPT_TIMING) and remove the global hard-coded flag; replace all
print(...) usages in the prefill and decode timing sections with
logger.debug/info calls using the standard logging module so users can control
verbosity via logging configuration, and gate the timing measurements behind the
new instance attribute (self.timing_enabled) so timing is silent unless
explicitly enabled.
- Around line 334-361: The current _stack_decode_weights implementation
(stack_kernel_weight and stack_norm_weight) builds Python lists of per-layer
tensors then torch.cat's them, causing high peak memory; change each stack_* to
pre-allocate the final tensor sized (len(model.layers), ...) and fill slices
from each layer one-by-one by calling _kernel_weight or getattr(layer, ...) per
layer and copying into the pre-allocated buffer, releasing the per-layer result
immediately; update stack_kernel_weight to compute the final shape from a single
layer (using _kernel_weight on layer 0 without keeping it long) and then loop
layers to copy into buffer, and similarly update stack_norm_weight to allocate a
(num_layers, feature) float tensor and fill row-wise from getattr(layer,
attr).cpu().contiguous() so decode_input_rms_weight, decode_wq, decode_wk,
decode_wv, decode_q_norm_weight, decode_k_norm_weight, decode_wo,
decode_post_rms_weight, decode_w_gate, decode_w_up, and decode_w_down are built
without materializing long lists.
In `@llm/testing/hf_compare/cases/qwen3_14b_prefill.py`:
- Around line 194-201: The repo-root computation using
Path(__file__).resolve().parents[4] is fragile; replace it with a small helper
(e.g., find_repo_root_by_sentinel) that walks upward from
Path(__file__).resolve() until it finds a known sentinel like "pyproject.toml"
or the "examples" directory and returns that path, then use that helper to
derive _KERNEL_PATH instead of repo_root = Path(...).parents[4]; update the
Qwen3 loader code that sets _KERNEL_PATH and the ImportError message to use the
helper and ensure it raises a clear error if the sentinel isn't found.
- Around line 131-158: The _select_npu_kv function is doing heavy Python-level
loops and device-sync .item() calls; replace the per-(b,pos) loops with a
vectorized index computation using torch integer ops on slot_mapping to compute
slot_block and slot_off, build a base index =
slot_block*(NUM_KV_HEADS*block_size)+slot_off, add a kvh offset (kvh *
block_size) to form a [B, NUM_KV_HEADS, S] rows tensor, then use
flat.index_select(0, rows.reshape(-1)) and reshape to (batch, NUM_KV_HEADS,
seq_len, HEAD_DIM) to return; also remove the unnecessary out.clone() at the
end. Reference symbols: _select_npu_kv, flat, slot_mapping, block_size, batch,
max_seq, seq_len, NUM_KV_HEADS, HEAD_DIM.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: d5086288-7b7b-471c-969a-79ea791ebefd
📒 Files selected for processing (21)
examples/models/qwen3/14b/qwen3_14b_decode_full.pyllm/core/kv_cache.pyllm/core/pypto_executor.pyllm/model/qwen3_14b_decode.pyllm/testing/__init__.pyllm/testing/hf_compare/__init__.pyllm/testing/hf_compare/__main__.pyllm/testing/hf_compare/base.pyllm/testing/hf_compare/cases/__init__.pyllm/testing/hf_compare/cases/qwen3_14b_decode.pyllm/testing/hf_compare/cases/qwen3_14b_decode_full.pyllm/testing/hf_compare/cases/qwen3_14b_e2e.pyllm/testing/hf_compare/cases/qwen3_14b_prefill.pyllm/testing/hf_compare/comparator.pyllm/testing/hf_compare/input_sampler.pyllm/testing/hf_compare/reference.pyllm/testing/hf_compare/runner.pyllm/testing/hf_compare/target.pyllm/testing/hf_compare/weight_adapter.pytests/llm/testing/test_hf_compare_batch_config.pytests/llm/testing/test_hf_compare_e2e.py
| """CLI for the HF comparison framework. | ||
|
|
||
| Usage: | ||
| python -m llm.testing.hf_compare list | ||
| python -m llm.testing.hf_compare run qwen3_14b.decode \\ | ||
| --hf-model-path /path/to/Qwen3-14B --platform a2a3 [--cpu-only] | ||
| """ |
There was a problem hiding this comment.
Module docstring advertises CLI flags that don't exist.
The example shows --hf-model-path, --platform, and --cpu-only, but the run subparser only registers case, --json, and --kwarg/-k. Anyone copy-pasting this command will get error: unrecognized arguments. The PR test plan correctly uses -k hf_model_path=... -k cpu_only=true, so update the docstring to match.
📝 Suggested docstring fix
"""CLI for the HF comparison framework.
Usage:
python -m llm.testing.hf_compare list
- python -m llm.testing.hf_compare run qwen3_14b.decode \\
- --hf-model-path /path/to/Qwen3-14B --platform a2a3 [--cpu-only]
+ python -m llm.testing.hf_compare run qwen3_14b.decode \\
+ -k hf_model_path=/path/to/Qwen3-14B -k platform=a2a3 [-k cpu_only=true]
"""🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@llm/testing/hf_compare/__main__.py` around lines 9 - 15, The module docstring
at the top of llm/testing/hf_compare/__main__.py advertises CLI flags
(--hf-model-path, --platform, --cpu-only) that the parser doesn't register;
update the docstring to show the actual usage of the implemented CLI (the "list"
and "run" subcommands) and example invocation for "run" using the registered
positional "case", the --json flag and the kwarg option (-k/--kwarg), e.g. show
usages like "python -m llm.testing.hf_compare list" and "python -m
llm.testing.hf_compare run qwen3_14b.decode -k hf_model_path=/path/to/Qwen3-14B
-k cpu_only=true" so copy-paste works and matches how the run subparser (the
parser.add_subparsers / run subparser and its --json and -k/--kwarg definitions)
is actually implemented.
| device_id = int(device_id) if device_id is not None else None | ||
| batch = int(batch) | ||
| seed = int(seed) | ||
| num_layers = int(num_layers) | ||
| seq_len = int(seq_len) if seq_len is not None else None | ||
| max_seq = int(max_seq) if max_seq is not None else None | ||
| atol = float(atol) | ||
| rtol = float(rtol) | ||
| hf_dtype_t = torch.float32 if hf_dtype == "fp32" else torch.bfloat16 | ||
|
|
||
| repo_root = Path(__file__).resolve().parents[4] | ||
| _KERNEL_PATH = repo_root / "examples" / "models" / "qwen3" / "14b" / "qwen3_14b_decode_full.py" | ||
| _SPEC = importlib.util.spec_from_file_location("_qwen3_14b_decode_full_kernel", _KERNEL_PATH) | ||
| if _SPEC is None or _SPEC.loader is None: | ||
| raise ImportError(f"Cannot load fused decode kernel from {_KERNEL_PATH}") | ||
| dec = importlib.util.module_from_spec(_SPEC) | ||
| _SPEC.loader.exec_module(dec) | ||
|
|
||
| if batch <= 0: | ||
| raise ValueError(f"batch must be positive, got {batch}") | ||
| context_len = seq_len if seq_len is not None else dec.MAX_SEQ | ||
| max_seq_eff = max_seq or dec.MAX_SEQ | ||
| if context_len > max_seq_eff: | ||
| raise ValueError(f"seq_len={context_len} > max_seq={max_seq_eff}") |
There was a problem hiding this comment.
Reject zero/invalid runtime parameters up front.
seq_len=0 makes the decode position negative in the reference path, num_layers=0 creates a meaningless "fused" comparison, and any hf_dtype other than "fp32" currently falls through to bf16. These should fail during case construction.
Proposed fix
batch = int(batch)
seed = int(seed)
num_layers = int(num_layers)
seq_len = int(seq_len) if seq_len is not None else None
max_seq = int(max_seq) if max_seq is not None else None
atol = float(atol)
rtol = float(rtol)
- hf_dtype_t = torch.float32 if hf_dtype == "fp32" else torch.bfloat16
+ if num_layers <= 0:
+ raise ValueError(f"num_layers must be positive, got {num_layers}")
+ if hf_dtype not in {"fp32", "bf16"}:
+ raise ValueError(f"hf_dtype must be 'fp32' or 'bf16', got {hf_dtype!r}")
+ hf_dtype_t = torch.float32 if hf_dtype == "fp32" else torch.bfloat16
@@
if batch <= 0:
raise ValueError(f"batch must be positive, got {batch}")
context_len = seq_len if seq_len is not None else dec.MAX_SEQ
max_seq_eff = max_seq or dec.MAX_SEQ
+ if context_len <= 0:
+ raise ValueError(f"seq_len must be positive, got {context_len}")
if context_len > max_seq_eff:
raise ValueError(f"seq_len={context_len} > max_seq={max_seq_eff}")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@llm/testing/hf_compare/cases/qwen3_14b_decode_full.py` around lines 240 -
263, Validate runtime parameters immediately after parsing: ensure batch > 0
(already checked), raise ValueError if seq_len is None or seq_len <= 0, raise
ValueError if num_layers <= 0, and validate hf_dtype is one of the supported
strings (e.g. "fp32" or "bf16") and raise ValueError for others; perform these
checks before computing context_len/max_seq_eff (variables: seq_len, num_layers,
hf_dtype, context_len, max_seq_eff, dec.MAX_SEQ) so invalid cases fail during
case construction rather than during decoding.
| cpu_only = _coerce_bool(cpu_only) | ||
| device_id = int(device_id) if device_id is not None else None | ||
| batch = int(batch) | ||
| seed = int(seed) | ||
| seq_len = int(seq_len) if seq_len is not None else None | ||
| max_seq = int(max_seq) if max_seq is not None else None | ||
| ctx_len = int(ctx_len) if ctx_len is not None else None | ||
| atol = float(atol) | ||
| rtol = float(rtol) | ||
| hf_dtype_t = torch.float32 if hf_dtype == "fp32" else torch.bfloat16 | ||
|
|
||
| repo_root = Path(__file__).resolve().parents[4] | ||
| # Load the kernel directly from examples (bypass llm.model shim). | ||
| _KERNEL_PATH = repo_root / "examples" / "models" / "qwen3" / "14b" / "qwen3_14b_decode.py" | ||
| _SPEC = importlib.util.spec_from_file_location("_qwen3_14b_decode_kernel", _KERNEL_PATH) | ||
| if _SPEC is None or _SPEC.loader is None: | ||
| raise ImportError(f"Cannot load Qwen3-14B decode kernel from {_KERNEL_PATH}") | ||
| dec = importlib.util.module_from_spec(_SPEC) | ||
| _SPEC.loader.exec_module(dec) | ||
|
|
||
| if batch <= 0: | ||
| raise ValueError(f"batch must be positive, got {batch}") | ||
| # Decode is a single-token step that assumes position = (seq_len - 1). | ||
| # `seq_len` is the context length for this step; `max_seq` controls the | ||
| # cache capacity / RoPE table size (defaults to the kernel's MAX_SEQ). | ||
| # Back-compat: `ctx_len` is an alias for `seq_len` if `seq_len` is unset. | ||
| context_len = seq_len if seq_len is not None else (ctx_len or dec.MAX_SEQ) | ||
| max_seq_eff = max_seq or dec.MAX_SEQ | ||
| if context_len > max_seq_eff: | ||
| raise ValueError(f"seq_len={context_len} > max_seq={max_seq_eff}") |
There was a problem hiding this comment.
Reject seq_len <= 0 and unknown hf_dtype during case construction.
The reference path uses pos = seq_len - 1, so a zero/negative decode length is invalid, and the current ternary silently treats any hf_dtype typo as bf16. It is better to fail here than during input materialization or execution.
Proposed fix
batch = int(batch)
seed = int(seed)
seq_len = int(seq_len) if seq_len is not None else None
max_seq = int(max_seq) if max_seq is not None else None
ctx_len = int(ctx_len) if ctx_len is not None else None
atol = float(atol)
rtol = float(rtol)
- hf_dtype_t = torch.float32 if hf_dtype == "fp32" else torch.bfloat16
+ if hf_dtype not in {"fp32", "bf16"}:
+ raise ValueError(f"hf_dtype must be 'fp32' or 'bf16', got {hf_dtype!r}")
+ hf_dtype_t = torch.float32 if hf_dtype == "fp32" else torch.bfloat16
@@
# Decode is a single-token step that assumes position = (seq_len - 1).
# `seq_len` is the context length for this step; `max_seq` controls the
# cache capacity / RoPE table size (defaults to the kernel's MAX_SEQ).
# Back-compat: `ctx_len` is an alias for `seq_len` if `seq_len` is unset.
context_len = seq_len if seq_len is not None else (ctx_len or dec.MAX_SEQ)
max_seq_eff = max_seq or dec.MAX_SEQ
+ if context_len <= 0:
+ raise ValueError(f"seq_len must be positive, got {context_len}")
if context_len > max_seq_eff:
raise ValueError(f"seq_len={context_len} > max_seq={max_seq_eff}")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@llm/testing/hf_compare/cases/qwen3_14b_decode.py` around lines 202 - 231,
Validate inputs early: ensure seq_len (if provided) and ctx_len (when used as
fallback) are positive and raise ValueError if <= 0, and validate hf_dtype
against allowed values ("fp32" or "bf16") before mapping to hf_dtype_t
(currently set in hf_dtype_t = torch.float32 if hf_dtype == "fp32" else
torch.bfloat16). Specifically, add a check near where seq_len/ctx_len are
processed to reject non-positive lengths (reference variables seq_len, ctx_len,
context_len) and add an explicit hf_dtype validation that raises a ValueError
for unknown strings before assigning hf_dtype_t so typos don't silently become
bfloat16.
| def _parse_prompt_ids(prompt_ids: str | list[int] | tuple[int, ...]) -> list[int]: | ||
| if isinstance(prompt_ids, (list, tuple)): | ||
| return [int(v) for v in prompt_ids] | ||
| out = [int(part.strip()) for part in str(prompt_ids).split(",") if part.strip()] | ||
| if not out: | ||
| raise ValueError("prompt_ids must contain at least one token id") | ||
| return out |
There was a problem hiding this comment.
Validate non-empty prompts for list/tuple inputs too.
The string path rejects an empty prompt, but the list/tuple path returns [] unchanged. That later reaches prompt_len - 1 in the e2e runner and blows up far away from the actual input error.
Proposed fix
def _parse_prompt_ids(prompt_ids: str | list[int] | tuple[int, ...]) -> list[int]:
if isinstance(prompt_ids, (list, tuple)):
- return [int(v) for v in prompt_ids]
+ out = [int(v) for v in prompt_ids]
+ if not out:
+ raise ValueError("prompt_ids must contain at least one token id")
+ return out
out = [int(part.strip()) for part in str(prompt_ids).split(",") if part.strip()]
if not out:
raise ValueError("prompt_ids must contain at least one token id")
return out🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@llm/testing/hf_compare/cases/qwen3_14b_e2e.py` around lines 66 - 72, The
_parse_prompt_ids function currently converts list/tuple inputs to integers but
doesn't reject an empty list/tuple; ensure you validate and raise the same
ValueError("prompt_ids must contain at least one token id") for list/tuple
inputs too by checking the computed out list after converting the values (from
the isinstance(prompt_ids, (list, tuple)) branch) and raising if empty so
callers like the e2e runner won't receive an empty list; keep the same error
message and behavior for the string path.
| dec, prefill = _load_qwen3_modules() | ||
|
|
||
| input_ids = inputs["input_ids"].to(torch.long) | ||
| batch = int(input_ids.shape[0]) | ||
| if batch != 1: | ||
| raise ValueError(f"qwen3_14b.e2e expects batch=1, got {batch}") | ||
|
|
||
| prompt_len = int(input_ids.shape[1]) | ||
| if prompt_len + max_new_tokens - 1 > max_seq: | ||
| raise ValueError( | ||
| f"prompt_len + max_new_tokens - 1 = {prompt_len + max_new_tokens - 1} exceeds max_seq={max_seq}" | ||
| ) | ||
|
|
||
| hidden = int(prefill.HIDDEN) | ||
| head_dim = int(prefill.HEAD_DIM) | ||
| num_kv_heads = int(prefill.NUM_KV_HEADS) | ||
| cache_rows = batch * num_kv_heads * max_seq | ||
|
|
||
| layer_weights = [ | ||
| _extract_layer_weights(dict(weights), layer_idx=idx, hidden=hidden, head_dim=head_dim) | ||
| for idx in range(num_layers) | ||
| ] | ||
| embed_weight = weights["model.embed_tokens.weight"] | ||
| norm_weight = weights["model.norm.weight"].view(1, hidden) | ||
| lm_weight = weights["lm_head.weight"] | ||
| rope_cos, rope_sin = _build_rope(max_seq, head_dim) | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
for f in \
examples/models/qwen3/14b/qwen3_14b_decode.py \
examples/models/qwen3/14b/qwen3_14b_prefill.py
do
printf '\n== %s ==\n' "$f"
rg -n '^(from|import)\s+' "$f"
doneRepository: hw-native-sys/pypto-lib
Length of output: 236
🏁 Script executed:
# First, find where _load_qwen3_modules is defined and how it's used
rg -n "_load_qwen3_modules" llm/testing/hf_compare/cases/qwen3_14b_e2e.py
# Then check for cpu_only usage in the file
rg -n "cpu_only" llm/testing/hf_compare/cases/qwen3_14b_e2e.py
# Get the file size to determine how to read it
wc -l llm/testing/hf_compare/cases/qwen3_14b_e2e.pyRepository: hw-native-sys/pypto-lib
Length of output: 573
🏁 Script executed:
# Get context around the _load_qwen3_modules call and cpu_only handling
sed -n '480,535p' llm/testing/hf_compare/cases/qwen3_14b_e2e.pyRepository: hw-native-sys/pypto-lib
Length of output: 1898
🏁 Script executed:
# Get the definition of _load_qwen3_modules
sed -n '41,60p' llm/testing/hf_compare/cases/qwen3_14b_e2e.pyRepository: hw-native-sys/pypto-lib
Length of output: 833
🏁 Script executed:
# Check what constants are being extracted from the modules
sed -n '41,70p' llm/testing/hf_compare/cases/qwen3_14b_e2e.py | head -30
# Look at the entire flow after _load_qwen3_modules() to see which paths actually use prefill
sed -n '490,580p' llm/testing/hf_compare/cases/qwen3_14b_e2e.pyRepository: hw-native-sys/pypto-lib
Length of output: 4487
🏁 Script executed:
# Check if pypto is actually a backend-only package by looking at its usage in the helper files
cat examples/models/qwen3/14b/qwen3_14b_prefill.py | head -50
# Also check if the constants HIDDEN, HEAD_DIM, NUM_KV_HEADS are defined independently or depend on pypto
rg -A 5 -B 5 "HIDDEN|HEAD_DIM|NUM_KV_HEADS" examples/models/qwen3/14b/qwen3_14b_prefill.py | head -40Repository: hw-native-sys/pypto-lib
Length of output: 3626
Move _load_qwen3_modules() behind the non-CPU branch or hoist constants.
_load_qwen3_modules() at line 490 executes unconditionally before the cpu_only check at line 526, importing pypto.language even when the CPU-only path runs. The prefill module is only used to extract static constants (HIDDEN, HEAD_DIM, NUM_KV_HEADS), so either load the modules conditionally within the NPU branch or define these constants inline in this case module to keep the CPU-only path free of backend dependencies.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@llm/testing/hf_compare/cases/qwen3_14b_e2e.py` around lines 490 - 516, The
call to _load_qwen3_modules() runs unconditionally and pulls in backend-only
code (e.g., pypto.language) even on the CPU-only path; move the
_load_qwen3_modules() invocation into the non-CPU branch (the branch guarded by
the cpu_only flag) or else hoist the static constants HIDDEN, HEAD_DIM, and
NUM_KV_HEADS into module-level constants so the CPU-only path does not import
prefill; update uses of prefill.HIDDEN, prefill.HEAD_DIM, and
prefill.NUM_KV_HEADS in the code that builds layer_weights and cache_rows to
reference the new constants or to only access prefill after the conditional.
| def build_rope_tables( | ||
| max_seq: int, | ||
| head_dim: int, | ||
| base: float = 1_000_000.0, | ||
| dtype: torch.dtype = torch.float32, | ||
| ) -> tuple[torch.Tensor, torch.Tensor]: | ||
| """Build cos/sin tables compatible with HF rotate_half form.""" | ||
| inv_freq = 1.0 / (base ** (torch.arange(0, head_dim, 2, dtype=torch.float32) / head_dim)) | ||
| pos = torch.arange(max_seq, dtype=torch.float32) | ||
| freqs = torch.outer(pos, inv_freq) | ||
| emb = torch.cat([freqs, freqs], dim=-1) | ||
| return emb.cos().to(dtype), emb.sin().to(dtype) |
There was a problem hiding this comment.
Reject odd head_dim in build_rope_tables.
For odd values, torch.arange(0, head_dim, 2) has ceil(head_dim / 2) entries, so torch.cat([freqs, freqs], dim=-1) returns one extra column. That silently produces RoPE tables with the wrong width instead of failing fast.
Proposed fix
def build_rope_tables(
max_seq: int,
head_dim: int,
base: float = 1_000_000.0,
dtype: torch.dtype = torch.float32,
) -> tuple[torch.Tensor, torch.Tensor]:
"""Build cos/sin tables compatible with HF rotate_half form."""
+ if head_dim % 2 != 0:
+ raise ValueError(f"head_dim must be even for RoPE tables, got {head_dim}")
inv_freq = 1.0 / (base ** (torch.arange(0, head_dim, 2, dtype=torch.float32) / head_dim))
pos = torch.arange(max_seq, dtype=torch.float32)
freqs = torch.outer(pos, inv_freq)
emb = torch.cat([freqs, freqs], dim=-1)
return emb.cos().to(dtype), emb.sin().to(dtype)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@llm/testing/hf_compare/input_sampler.py` around lines 46 - 57, The
build_rope_tables function must reject odd head_dim to avoid silently producing
wrong-width RoPE tables; add an early validation in build_rope_tables that
checks head_dim is even (e.g., head_dim % 2 == 0) and raise a ValueError with a
clear message if not, then proceed to compute inv_freq, freqs, emb, and return
emb.cos()/emb.sin() as before; reference the existing symbols head_dim,
inv_freq, freqs, and emb so the check is inserted before inv_freq is computed.
| def forward(self, inputs: Mapping[str, torch.Tensor]) -> dict[str, torch.Tensor]: | ||
| if self._module is None: | ||
| raise RuntimeError(f"[{self.name}] prepare() must be called first.") | ||
| with torch.no_grad(): | ||
| return self.forward_fn(self._module, inputs) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -type f -name "reference.py" | head -20Repository: hw-native-sys/pypto-lib
Length of output: 105
🏁 Script executed:
wc -l ./llm/testing/hf_compare/reference.pyRepository: hw-native-sys/pypto-lib
Length of output: 109
🏁 Script executed:
cat -n ./llm/testing/hf_compare/reference.pyRepository: hw-native-sys/pypto-lib
Length of output: 5445
🏁 Script executed:
find ./llm/testing/hf_compare -type f -name "*.py" | head -20Repository: hw-native-sys/pypto-lib
Length of output: 656
🏁 Script executed:
rg "HFModuleReference" ./llm/testing/hf_compare --type py -B 2 -A 5Repository: hw-native-sys/pypto-lib
Length of output: 1404
🏁 Script executed:
rg "HFModuleReference\(" ./llm/testing/hf_compare --type py -B 3 -A 10Repository: hw-native-sys/pypto-lib
Length of output: 1275
🏁 Script executed:
rg "HFModuleReference\(" ./llm/testing/hf_compare/cases --type py -B 2 -A 15Repository: hw-native-sys/pypto-lib
Length of output: 49
🏁 Script executed:
rg "HFModuleReference" --type py -B 0 -A 0 | grep -v "reference.py"Repository: hw-native-sys/pypto-lib
Length of output: 49
🏁 Script executed:
rg "dtype=" ./llm/testing/hf_compare --type py | head -20Repository: hw-native-sys/pypto-lib
Length of output: 2464
🏁 Script executed:
rg "HFModuleReference\s*\(" --type py -B 5 -A 10Repository: hw-native-sys/pypto-lib
Length of output: 1398
🏁 Script executed:
rg "HFModuleReference\(" --type py -A 8 | grep -v "^class\|^@dataclass\|^--$"Repository: hw-native-sys/pypto-lib
Length of output: 826
🏁 Script executed:
rg "dtype\s*=" ./llm/testing/hf_compare/cases --type py -B 3 -A 3 | head -50Repository: hw-native-sys/pypto-lib
Length of output: 4527
🏁 Script executed:
cat -n ./llm/testing/hf_compare/cases/qwen3_14b_prefill.py | grep -A 30 "def run"Repository: hw-native-sys/pypto-lib
Length of output: 49
🏁 Script executed:
cat ./llm/testing/hf_compare/cases/qwen3_14b_prefill.py | head -100Repository: hw-native-sys/pypto-lib
Length of output: 3842
🏁 Script executed:
rg "dtype" ./llm/testing/hf_compare/reference.py -B 2 -A 2Repository: hw-native-sys/pypto-lib
Length of output: 1191
🏁 Script executed:
rg "def forward_fn\|forward_fn\s*=" ./llm/testing/hf_compare --type py -B 2 -A 5 | head -80Repository: hw-native-sys/pypto-lib
Length of output: 49
Cast floating inputs to self.dtype before calling the module.
The class documentation advertises running forward "in a chosen dtype", and prepare() explicitly casts both the module and its state dict values to self.dtype (lines 83, 90). However, forward() passes raw input tensors directly to forward_fn without casting them. When inputs are in a different dtype (e.g., bf16 while the module is fp32), this causes dtype mismatches inside the wrapped module or silently exercises an unintended compute path. Real-world code in this repository manually casts inputs to work around this (see qwen3_14b_prefill.py lines 74–76).
Proposed fix
def forward(self, inputs: Mapping[str, torch.Tensor]) -> dict[str, torch.Tensor]:
if self._module is None:
raise RuntimeError(f"[{self.name}] prepare() must be called first.")
+ cast_inputs = {
+ key: value.to(self.dtype) if torch.is_tensor(value) and torch.is_floating_point(value) else value
+ for key, value in inputs.items()
+ }
with torch.no_grad():
- return self.forward_fn(self._module, inputs)
+ return self.forward_fn(self._module, cast_inputs)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def forward(self, inputs: Mapping[str, torch.Tensor]) -> dict[str, torch.Tensor]: | |
| if self._module is None: | |
| raise RuntimeError(f"[{self.name}] prepare() must be called first.") | |
| with torch.no_grad(): | |
| return self.forward_fn(self._module, inputs) | |
| def forward(self, inputs: Mapping[str, torch.Tensor]) -> dict[str, torch.Tensor]: | |
| if self._module is None: | |
| raise RuntimeError(f"[{self.name}] prepare() must be called first.") | |
| cast_inputs = { | |
| key: value.to(self.dtype) if torch.is_tensor(value) and torch.is_floating_point(value) else value | |
| for key, value in inputs.items() | |
| } | |
| with torch.no_grad(): | |
| return self.forward_fn(self._module, cast_inputs) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@llm/testing/hf_compare/reference.py` around lines 99 - 103, forward()
currently passes raw input tensors to forward_fn, which can mismatch the module
dtype; modify forward() to first cast any floating-point tensors in the inputs
mapping to self.dtype (use torch.is_floating_point to detect) into a new inputs
dict (preserving non-floating tensors unchanged and preserving device), then
call self.forward_fn(self._module, cast_inputs) inside the existing
torch.no_grad() block; reference the forward method, self.dtype, forward_fn, and
prepare() to locate the related code.
| def run_chain(chain: ChainedComparisonCase) -> list[CompareReport]: | ||
| reports: list[CompareReport] = [] | ||
| prev_ref_out: dict[str, torch.Tensor] = {} | ||
| for step in chain.steps: | ||
| if step.forward_map and prev_ref_out: | ||
| # Inject values from the previous step's reference outputs into | ||
| # this step's input spec by overriding materialize() via on_inputs. | ||
| mapping = step.forward_map | ||
| snapshot = {dst: prev_ref_out[src].clone() for src, dst in mapping.items() if src in prev_ref_out} | ||
| original_hook = step.case.on_inputs | ||
|
|
||
| def _hook(inputs: dict[str, torch.Tensor], _snap=snapshot, _orig=original_hook) -> None: | ||
| for name, value in _snap.items(): | ||
| inputs[name] = value | ||
| if _orig is not None: | ||
| _orig(inputs) | ||
|
|
||
| step.case.on_inputs = _hook | ||
| report = run_case(step.case) | ||
| reports.append(report) | ||
| if not report.passed and chain.stop_on_fail: | ||
| break | ||
| # Re-run the reference just to capture its outputs for the next step. | ||
| # (ComparisonCase.run did the work but did not return ref outputs, so | ||
| # we re-materialize on the fly. For now we skip and rely on tests | ||
| # using this chain to either keep state externally or extend the API.) | ||
| prev_ref_out = {} | ||
| return reports |
There was a problem hiding this comment.
forward_map is dead and run_chain permanently mutates the user's case.
Two related defects in this loop:
prev_ref_outis unconditionally reset to{}at the bottom of every iteration (line 139), so theif step.forward_map and prev_ref_outguard on line 117 is never true except possibly for the very first step (whereprev_ref_outis also empty). The feature is non-functional today — the comment at lines 135-138 acknowledges it but ships the broken path anyway. Either captureref_outfromrun_case(return it alongside the report, or expose it via the case) and propagate it, or removeforward_mapfrom the public API until it is actually wired.- Line 130 reassigns
step.case.on_inputs = _hook. SinceComparisonCaseinstances are user-supplied, this leaks the wrapper hook onto the original case object — re-running the same chain (or running the same case standalone afterwards) compounds the wrappers. Wrap the hook locally and pass it torun_casewithout mutatingstep.case, e.g. by restoring the original after the call or by constructing a shallow copy of the case for the run.
♻️ Sketch: restore the original hook and stop relying on a non-existent prev_ref_out
def run_chain(chain: ChainedComparisonCase) -> list[CompareReport]:
reports: list[CompareReport] = []
- prev_ref_out: dict[str, torch.Tensor] = {}
for step in chain.steps:
- if step.forward_map and prev_ref_out:
- # Inject values from the previous step's reference outputs into
- # this step's input spec by overriding materialize() via on_inputs.
- mapping = step.forward_map
- snapshot = {dst: prev_ref_out[src].clone() for src, dst in mapping.items() if src in prev_ref_out}
- original_hook = step.case.on_inputs
-
- def _hook(inputs: dict[str, torch.Tensor], _snap=snapshot, _orig=original_hook) -> None:
- for name, value in _snap.items():
- inputs[name] = value
- if _orig is not None:
- _orig(inputs)
-
- step.case.on_inputs = _hook
- report = run_case(step.case)
+ original_hook = step.case.on_inputs
+ try:
+ # ... install combined hook here once forward_map is actually wired ...
+ report = run_case(step.case)
+ finally:
+ step.case.on_inputs = original_hook
reports.append(report)
if not report.passed and chain.stop_on_fail:
break
- prev_ref_out = {}
return reports🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@llm/testing/hf_compare/runner.py` around lines 113 - 140, The run_chain loop
never propagates reference outputs and permanently mutates user cases: stop
clearing prev_ref_out each iteration and actually capture the reference outputs
from run_case (e.g., change run_case to return (CompareReport,
dict[str,torch.Tensor]) or expose ref outputs on ComparisonCase) and assign that
to prev_ref_out so forward_map can work; also avoid mutating step.case.on_inputs
in place—either restore the original hook after calling run_case or run a
shallow copy of step.case with the wrapped on_inputs when invoking run_case (do
not leave the wrapper assigned to the original case).
| def run( | ||
| self, | ||
| inputs: Mapping[str, torch.Tensor], | ||
| weights: Mapping[str, torch.Tensor], | ||
| ) -> dict[str, torch.Tensor]: | ||
| if self._module is None: | ||
| module = self.module_factory().to(self.dtype).eval() | ||
| prefix = self.state_dict_prefix | ||
| sd: dict[str, torch.Tensor] = {} | ||
| for k, v in weights.items(): | ||
| if prefix and not k.startswith(prefix): | ||
| continue | ||
| stripped = k[len(prefix):] if prefix else k | ||
| sd[stripped] = v.to(self.dtype) | ||
| missing, unexpected = module.load_state_dict(sd, strict=False) | ||
| if self.strict_load: | ||
| if missing: | ||
| raise RuntimeError(f"[{self.name}] missing keys: {missing}") | ||
| if unexpected: | ||
| raise RuntimeError(f"[{self.name}] unexpected keys: {unexpected}") | ||
| self._module = module | ||
| with torch.no_grad(): | ||
| return self.forward_fn(self._module, inputs) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "target.py" -type f | grep hf_compareRepository: hw-native-sys/pypto-lib
Length of output: 102
🏁 Script executed:
fd -t f "target.py" | head -20Repository: hw-native-sys/pypto-lib
Length of output: 100
🏁 Script executed:
cat -n llm/testing/hf_compare/target.py | head -200Repository: hw-native-sys/pypto-lib
Length of output: 9219
🏁 Script executed:
rg -n "TorchTarget" llm/testing/hf_compare/ -A 2 -B 2 | head -100Repository: hw-native-sys/pypto-lib
Length of output: 914
🏁 Script executed:
rg "TorchTarget" llm/testing/hf_compare/ --type py -lRepository: hw-native-sys/pypto-lib
Length of output: 100
🏁 Script executed:
rg "forward_fn|inputs\[" llm/testing/hf_compare/ --type py -B 2 -A 2 | head -80Repository: hw-native-sys/pypto-lib
Length of output: 6393
🏁 Script executed:
cat -n llm/testing/hf_compare/reference.py | head -220Repository: hw-native-sys/pypto-lib
Length of output: 5445
🏁 Script executed:
rg "\.to\(.*dtype" llm/testing/hf_compare/ --type py -B 2 -A 2 | head -60Repository: hw-native-sys/pypto-lib
Length of output: 4397
🏁 Script executed:
cat -n llm/testing/hf_compare/cases/qwen3_14b_prefill.py | sed -n '45,100p'Repository: hw-native-sys/pypto-lib
Length of output: 2291
🏁 Script executed:
rg "TorchTarget\(" llm/testing/hf_compare/ --type py -B 3 -A 3 | head -100Repository: hw-native-sys/pypto-lib
Length of output: 636
🏁 Script executed:
rg "def _hf.*forward|def.*forward_fn" llm/testing/hf_compare/cases/ --type py -A 10 | head -120Repository: hw-native-sys/pypto-lib
Length of output: 3758
🏁 Script executed:
rg "TorchTarget\(" llm/testing/hf_compare/cases/ --type py -B 2 -A 8Repository: hw-native-sys/pypto-lib
Length of output: 49
🏁 Script executed:
rg "TorchTarget" --type py -B 2 -A 8Repository: hw-native-sys/pypto-lib
Length of output: 1721
🏁 Script executed:
cat -n llm/testing/hf_compare/reference.py | sed -n '99,105p'Repository: hw-native-sys/pypto-lib
Length of output: 389
Normalize floating inputs to self.dtype before invoking the target module.
The module and loaded weights are converted to self.dtype, but the input mapping is passed through unchanged. That makes TorchTarget(dtype=torch.float32) unsafe for bf16-materialized test inputs and can fail in nn.Linear/matmul operations when dtypes mismatch.
Proposed fix
def run(
self,
inputs: Mapping[str, torch.Tensor],
weights: Mapping[str, torch.Tensor],
) -> dict[str, torch.Tensor]:
if self._module is None:
module = self.module_factory().to(self.dtype).eval()
prefix = self.state_dict_prefix
sd: dict[str, torch.Tensor] = {}
for k, v in weights.items():
if prefix and not k.startswith(prefix):
continue
stripped = k[len(prefix):] if prefix else k
sd[stripped] = v.to(self.dtype)
missing, unexpected = module.load_state_dict(sd, strict=False)
if self.strict_load:
if missing:
raise RuntimeError(f"[{self.name}] missing keys: {missing}")
if unexpected:
raise RuntimeError(f"[{self.name}] unexpected keys: {unexpected}")
self._module = module
+ cast_inputs = {
+ key: value.to(self.dtype) if torch.is_tensor(value) and torch.is_floating_point(value) else value
+ for key, value in inputs.items()
+ }
with torch.no_grad():
- return self.forward_fn(self._module, inputs)
+ return self.forward_fn(self._module, cast_inputs)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def run( | |
| self, | |
| inputs: Mapping[str, torch.Tensor], | |
| weights: Mapping[str, torch.Tensor], | |
| ) -> dict[str, torch.Tensor]: | |
| if self._module is None: | |
| module = self.module_factory().to(self.dtype).eval() | |
| prefix = self.state_dict_prefix | |
| sd: dict[str, torch.Tensor] = {} | |
| for k, v in weights.items(): | |
| if prefix and not k.startswith(prefix): | |
| continue | |
| stripped = k[len(prefix):] if prefix else k | |
| sd[stripped] = v.to(self.dtype) | |
| missing, unexpected = module.load_state_dict(sd, strict=False) | |
| if self.strict_load: | |
| if missing: | |
| raise RuntimeError(f"[{self.name}] missing keys: {missing}") | |
| if unexpected: | |
| raise RuntimeError(f"[{self.name}] unexpected keys: {unexpected}") | |
| self._module = module | |
| with torch.no_grad(): | |
| return self.forward_fn(self._module, inputs) | |
| def run( | |
| self, | |
| inputs: Mapping[str, torch.Tensor], | |
| weights: Mapping[str, torch.Tensor], | |
| ) -> dict[str, torch.Tensor]: | |
| if self._module is None: | |
| module = self.module_factory().to(self.dtype).eval() | |
| prefix = self.state_dict_prefix | |
| sd: dict[str, torch.Tensor] = {} | |
| for k, v in weights.items(): | |
| if prefix and not k.startswith(prefix): | |
| continue | |
| stripped = k[len(prefix):] if prefix else k | |
| sd[stripped] = v.to(self.dtype) | |
| missing, unexpected = module.load_state_dict(sd, strict=False) | |
| if self.strict_load: | |
| if missing: | |
| raise RuntimeError(f"[{self.name}] missing keys: {missing}") | |
| if unexpected: | |
| raise RuntimeError(f"[{self.name}] unexpected keys: {unexpected}") | |
| self._module = module | |
| cast_inputs = { | |
| key: value.to(self.dtype) if torch.is_tensor(value) and torch.is_floating_point(value) else value | |
| for key, value in inputs.items() | |
| } | |
| with torch.no_grad(): | |
| return self.forward_fn(self._module, cast_inputs) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@llm/testing/hf_compare/target.py` around lines 173 - 195, In run(), before
calling forward_fn(self._module, inputs), ensure all floating-point tensors in
the inputs Mapping are converted to self.dtype to avoid dtype mismatches with
the module (e.g., convert any tensor where tensor.is_floating_point() is True to
tensor.to(self.dtype)); do this conversion only when self._module is prepared
(after module load) and pass the normalized inputs into forward_fn; reference
symbols: run, forward_fn, module_factory, state_dict_prefix, strict_load, and
self._module.
| def Transpose(*dims: int) -> Op: | ||
| """Transpose the last two dims by default; pass dims for arbitrary perm.""" | ||
| def _op(t: torch.Tensor) -> torch.Tensor: | ||
| if not dims: | ||
| if t.ndim < 2: | ||
| return t | ||
| return t.t() | ||
| return t.permute(*dims) |
There was a problem hiding this comment.
Use transpose(-2, -1) for the default branch here.
The docstring promises "transpose the last two dims by default", but t.t() only implements a matrix transpose. Any rank>2 tensor will violate the advertised behavior.
Proposed fix
def Transpose(*dims: int) -> Op:
"""Transpose the last two dims by default; pass dims for arbitrary perm."""
def _op(t: torch.Tensor) -> torch.Tensor:
if not dims:
if t.ndim < 2:
return t
- return t.t()
+ return t.transpose(-2, -1)
return t.permute(*dims)
_op.__name__ = f"Transpose{dims or ''}"
return _op📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def Transpose(*dims: int) -> Op: | |
| """Transpose the last two dims by default; pass dims for arbitrary perm.""" | |
| def _op(t: torch.Tensor) -> torch.Tensor: | |
| if not dims: | |
| if t.ndim < 2: | |
| return t | |
| return t.t() | |
| return t.permute(*dims) | |
| def Transpose(*dims: int) -> Op: | |
| """Transpose the last two dims by default; pass dims for arbitrary perm.""" | |
| def _op(t: torch.Tensor) -> torch.Tensor: | |
| if not dims: | |
| if t.ndim < 2: | |
| return t | |
| return t.transpose(-2, -1) | |
| return t.permute(*dims) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@llm/testing/hf_compare/weight_adapter.py` around lines 49 - 56, The Transpose
Op implementation violates its docstring: in Transpose -> _op it uses t.t()
which only works for 2D matrices; change the default branch to call
t.transpose(-2, -1) (or tensor.transpose) to swap the last two dimensions for
tensors with ndim>=2, leaving scalars/1D untouched, and keep the existing
t.permute(*dims) behavior when dims are provided; update the _op in Transpose
accordingly.
69db614 to
7fdd092
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
llm/core/pypto_executor.py (1)
329-368:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftThe fused decode weights add a second full resident copy of the model layers.
_stack_decode_weights()concatenates every per-layer tensor into a new set of buffers, butcompiled.layerskeeps the original per-layer copies alive for prefill. For a 14B model that is another model-sized allocation at init time and is very likely to push CPU RAM over the edge. Please derive prefill views from the stacked buffers, or free the per-layer copies after creating equivalent slice views.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@llm/core/pypto_executor.py` around lines 329 - 368, The current flow in _stack_decode_weights plus returning layers in _CompiledKernels keeps two full copies of per-layer tensors; after you build the stacked decode buffers in _stack_decode_weights, replace per-layer tensors used for prefill with slice views into those stacked buffers (e.g., create per-layer tensors by slicing/narrowing the stacked tensors) and store those views in the compiled.layers structure (or free the original per-layer buffers immediately after creating the views); update the return path that currently passes layers into _CompiledKernels so it carries these view-backed _KernelLayerWeights (or call _release_layer_weights after creating views) to avoid retaining a second full resident copy for models like 14B.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@llm/core/pypto_executor.py`:
- Around line 19-20: The module currently leaves _TIMING_ENABLED = True which
causes noisy stdout prints on every prefill; change the default to False and
make timing opt-in: expose a configuration flag (e.g., module-level env var read
or a constructor/initializer parameter like timing_enabled on the main executor
class) so callers can enable timing explicitly, and update the code paths
referenced around lines 160-198 (the prefill/timing instrumentation) to consult
that flag. While doing so, replace direct stdout prints and flushes with a
conditional use of a logger or return the timing data to the caller when
enabled, and remove any unconditional sys.stdout.flush() calls so the hot path
is not forced to flush when timing is disabled. Ensure the symbol
_TIMING_ENABLED is used only as a default and can be overridden by the new
parameter (e.g., timing_enabled) in the executor's constructor or factory.
In `@llm/testing/hf_compare/comparator.py`:
- Around line 38-40: ref_f and tgt_f are only cast for dtype via
to(selector.cast_to) but may still live on different devices causing the
subtraction to crash; detach them, move both tensors to a single common device
(e.g., CPU) before reshaping and computing diff, e.g. perform
detach().to(device=common_device, dtype=selector.cast_to).reshape(-1) for both
ref and tgt (use a consistent device like torch.device('cpu') or .cpu()), then
compute diff = (ref_f - tgt_f).abs(); update the code around ref_f, tgt_f and
diff accordingly.
In `@models/qwen3/14b/qwen3_14b_decode_full.py`:
- Around line 145-176: The code allocates tensors (current_hidden, q_proj,
k_proj, v_proj, q_proj_norm, k_proj_norm, attn_out, next_hidden) with size
[batch,...] but writes into them using loops over batch_padded, causing
out-of-bounds when batch is not BATCH_TILE-aligned; fix by changing the
pl.create_tensor calls for these symbols to use batch_padded instead of batch
(e.g., pl.create_tensor([batch_padded, ...], ...)) so allocated buffers match
the assemble/write loops, or alternatively add an explicit validation early
(before the parallel loop) that enforces batch % BATCH_TILE == 0 and
error/adjust accordingly; update all occurrences in the shown block
(current_hidden creation and the
q_proj/k_proj/v_proj/q_proj_norm/k_proj_norm/attn_out/next_hidden creations) to
be consistent.
---
Outside diff comments:
In `@llm/core/pypto_executor.py`:
- Around line 329-368: The current flow in _stack_decode_weights plus returning
layers in _CompiledKernels keeps two full copies of per-layer tensors; after you
build the stacked decode buffers in _stack_decode_weights, replace per-layer
tensors used for prefill with slice views into those stacked buffers (e.g.,
create per-layer tensors by slicing/narrowing the stacked tensors) and store
those views in the compiled.layers structure (or free the original per-layer
buffers immediately after creating the views); update the return path that
currently passes layers into _CompiledKernels so it carries these view-backed
_KernelLayerWeights (or call _release_layer_weights after creating views) to
avoid retaining a second full resident copy for models like 14B.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8b1333ef-8341-4b3e-a2d0-6f69bf5ac766
📒 Files selected for processing (23)
llm/core/engine.pyllm/core/kv_cache.pyllm/core/pypto_executor.pyllm/examples/qwen3_14b_npu_generate.pyllm/model/qwen3_14b_decode.pyllm/testing/__init__.pyllm/testing/hf_compare/__init__.pyllm/testing/hf_compare/__main__.pyllm/testing/hf_compare/base.pyllm/testing/hf_compare/cases/__init__.pyllm/testing/hf_compare/cases/qwen3_14b_decode.pyllm/testing/hf_compare/cases/qwen3_14b_decode_full.pyllm/testing/hf_compare/cases/qwen3_14b_e2e.pyllm/testing/hf_compare/cases/qwen3_14b_prefill.pyllm/testing/hf_compare/comparator.pyllm/testing/hf_compare/input_sampler.pyllm/testing/hf_compare/reference.pyllm/testing/hf_compare/runner.pyllm/testing/hf_compare/target.pyllm/testing/hf_compare/weight_adapter.pymodels/qwen3/14b/qwen3_14b_decode_full.pytests/llm/testing/test_hf_compare_batch_config.pytests/llm/testing/test_hf_compare_e2e.py
✅ Files skipped from review due to trivial changes (7)
- llm/testing/init.py
- llm/model/qwen3_14b_decode.py
- llm/testing/hf_compare/main.py
- tests/llm/testing/test_hf_compare_batch_config.py
- llm/testing/hf_compare/cases/init.py
- llm/testing/hf_compare/input_sampler.py
- llm/testing/hf_compare/target.py
🚧 Files skipped from review as they are similar to previous changes (7)
- llm/testing/hf_compare/runner.py
- llm/testing/hf_compare/init.py
- llm/core/kv_cache.py
- llm/testing/hf_compare/reference.py
- llm/testing/hf_compare/cases/qwen3_14b_decode_full.py
- llm/testing/hf_compare/cases/qwen3_14b_prefill.py
- llm/testing/hf_compare/base.py
| _TIMING_ENABLED = True | ||
|
|
There was a problem hiding this comment.
Don't leave prefill timing permanently enabled in the core executor.
This now prints to stdout on every prefill and forces a flush in the hot path. That is noisy for library consumers, contaminates benchmark output, and adds avoidable latency unless profiling was explicitly requested.
Also applies to: 160-198
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@llm/core/pypto_executor.py` around lines 19 - 20, The module currently leaves
_TIMING_ENABLED = True which causes noisy stdout prints on every prefill; change
the default to False and make timing opt-in: expose a configuration flag (e.g.,
module-level env var read or a constructor/initializer parameter like
timing_enabled on the main executor class) so callers can enable timing
explicitly, and update the code paths referenced around lines 160-198 (the
prefill/timing instrumentation) to consult that flag. While doing so, replace
direct stdout prints and flushes with a conditional use of a logger or return
the timing data to the caller when enabled, and remove any unconditional
sys.stdout.flush() calls so the hot path is not forced to flush when timing is
disabled. Ensure the symbol _TIMING_ENABLED is used only as a default and can be
overridden by the new parameter (e.g., timing_enabled) in the executor's
constructor or factory.
| ref_f = ref.detach().to(selector.cast_to).reshape(-1) | ||
| tgt_f = tgt.detach().to(selector.cast_to).reshape(-1) | ||
| diff = (ref_f - tgt_f).abs() |
There was a problem hiding this comment.
Normalize tensor devices before subtraction to avoid runtime mismatch
to(selector.cast_to) only changes dtype; it does not guarantee a common device. If ref and tgt are produced on different devices, subtraction crashes at runtime. Move both to a single device (typically CPU for test comparators) before computing diff.
Proposed fix
- ref_f = ref.detach().to(selector.cast_to).reshape(-1)
- tgt_f = tgt.detach().to(selector.cast_to).reshape(-1)
+ ref_f = ref.detach().to(device="cpu", dtype=selector.cast_to).reshape(-1)
+ tgt_f = tgt.detach().to(device="cpu", dtype=selector.cast_to).reshape(-1)
diff = (ref_f - tgt_f).abs()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ref_f = ref.detach().to(selector.cast_to).reshape(-1) | |
| tgt_f = tgt.detach().to(selector.cast_to).reshape(-1) | |
| diff = (ref_f - tgt_f).abs() | |
| ref_f = ref.detach().to(device="cpu", dtype=selector.cast_to).reshape(-1) | |
| tgt_f = tgt.detach().to(device="cpu", dtype=selector.cast_to).reshape(-1) | |
| diff = (ref_f - tgt_f).abs() |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@llm/testing/hf_compare/comparator.py` around lines 38 - 40, ref_f and tgt_f
are only cast for dtype via to(selector.cast_to) but may still live on different
devices causing the subtraction to crash; detach them, move both tensors to a
single common device (e.g., CPU) before reshaping and computing diff, e.g.
perform detach().to(device=common_device, dtype=selector.cast_to).reshape(-1)
for both ref and tgt (use a consistent device like torch.device('cpu') or
.cpu()), then compute diff = (ref_f - tgt_f).abs(); update the code around
ref_f, tgt_f and diff accordingly.
| user_batch = pl.tensor.dim(hidden_states, 0) | ||
| batch_padded = ((user_batch + BATCH_TILE - 1) // BATCH_TILE) * BATCH_TILE | ||
|
|
||
| current_hidden = pl.create_tensor([batch, hidden], dtype=pl.BF16) | ||
| for b0 in pl.parallel(0, batch_padded, BATCH_TILE): | ||
| cur_valid = pl.min(BATCH_TILE, user_batch - b0) | ||
| with pl.at(level=pl.Level.CORE_GROUP): | ||
| for kb in pl.range(hidden_blocks): | ||
| k0 = kb * K_CHUNK | ||
| hidden_chunk = pl.slice( | ||
| hidden_states, | ||
| [BATCH_TILE, K_CHUNK], | ||
| [b0, k0], | ||
| valid_shape=[cur_valid, K_CHUNK], | ||
| ) | ||
| current_hidden = pl.assemble(current_hidden, hidden_chunk, [b0, k0]) | ||
|
|
||
| for layer_idx in pl.range(num_layers): | ||
| layer_hidden_base = layer_idx * hidden | ||
| layer_inter_base = layer_idx * inter | ||
| layer_cache_base = layer_idx * layer_cache_rows | ||
| next_hidden = pl.create_tensor([batch, hidden], dtype=pl.BF16) | ||
|
|
||
| # Intermediate FP32 tensors between scope 1 and scope 2. | ||
| # Allocated at runtime batch_padded; pl.create_tensor zero-inits | ||
| # so trailing (batch_padded - user_batch) padded rows are 0, | ||
| # which is the invariant relied on by Q/K-norm and scope-3. | ||
| q_proj = pl.create_tensor([batch, hidden], dtype=pl.FP32) | ||
| k_proj = pl.create_tensor([batch, kv_hidden], dtype=pl.FP32) | ||
| v_proj = pl.create_tensor([batch, kv_hidden], dtype=pl.FP32) | ||
| q_proj_norm = pl.create_tensor([batch, hidden], dtype=pl.FP32) | ||
| k_proj_norm = pl.create_tensor([batch, kv_hidden], dtype=pl.FP32) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd -type f -name "*qwen3_14b_decode_full.py"Repository: hw-native-sys/pypto-lib
Length of output: 238
🏁 Script executed:
fd "qwen3_14b_decode_full.py"Repository: hw-native-sys/pypto-lib
Length of output: 163
🏁 Script executed:
wc -l models/qwen3/14b/qwen3_14b_decode_full.pyRepository: hw-native-sys/pypto-lib
Length of output: 114
🏁 Script executed:
sed -n '145,176p' models/qwen3/14b/qwen3_14b_decode_full.pyRepository: hw-native-sys/pypto-lib
Length of output: 1931
🏁 Script executed:
sed -n '1,50p' models/qwen3/14b/qwen3_14b_decode_full.pyRepository: hw-native-sys/pypto-lib
Length of output: 2069
🏁 Script executed:
sed -n '50,145p' models/qwen3/14b/qwen3_14b_decode_full.pyRepository: hw-native-sys/pypto-lib
Length of output: 4013
🏁 Script executed:
sed -n '176,250p' models/qwen3/14b/qwen3_14b_decode_full.pyRepository: hw-native-sys/pypto-lib
Length of output: 4686
🏁 Script executed:
sed -n '100,145p' models/qwen3/14b/qwen3_14b_decode_full.pyRepository: hw-native-sys/pypto-lib
Length of output: 2648
🏁 Script executed:
sed -n '250,350p' models/qwen3/14b/qwen3_14b_decode_full.pyRepository: hw-native-sys/pypto-lib
Length of output: 6764
🏁 Script executed:
sed -n '260,300p' models/qwen3/14b/qwen3_14b_decode_full.pyRepository: hw-native-sys/pypto-lib
Length of output: 2901
🏁 Script executed:
rg "batch_padded" models/qwen3/14b/qwen3_14b_decode_full.py -A 2 -B 2Repository: hw-native-sys/pypto-lib
Length of output: 3842
🏁 Script executed:
rg "pl\.create_tensor.*batch" models/qwen3/14b/qwen3_14b_decode_full.py -B 2 -A 1Repository: hw-native-sys/pypto-lib
Length of output: 1321
🏁 Script executed:
sed -n '160,200p' models/qwen3/14b/qwen3_14b_decode_full.py | cat -nRepository: hw-native-sys/pypto-lib
Length of output: 2943
🏁 Script executed:
sed -n '1000,1061p' models/qwen3/14b/qwen3_14b_decode_full.pyRepository: hw-native-sys/pypto-lib
Length of output: 2190
🏁 Script executed:
rg "attn_out\|all_q_padded" models/qwen3/14b/qwen3_14b_decode_full.py -B 2 -A 3Repository: hw-native-sys/pypto-lib
Length of output: 49
🏁 Script executed:
rg "attn_out" models/qwen3/14b/qwen3_14b_decode_full.py -B 1 -A 1Repository: hw-native-sys/pypto-lib
Length of output: 1661
Buffer allocation mismatch between dynamic padding and static allocation.
The tensors current_hidden, q_proj, k_proj, v_proj, q_proj_norm, k_proj_norm, attn_out, and next_hidden are allocated with the batch parameter, but the loops that write to them iterate over batch_padded (rounded up to BATCH_TILE). If the host passes batch=1 or any non-BATCH_TILE-aligned value, the loops can assemble data past the allocated buffer bounds. Either allocate these tensors using batch_padded instead of batch, or enforce that batch must be a multiple of BATCH_TILE.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@models/qwen3/14b/qwen3_14b_decode_full.py` around lines 145 - 176, The code
allocates tensors (current_hidden, q_proj, k_proj, v_proj, q_proj_norm,
k_proj_norm, attn_out, next_hidden) with size [batch,...] but writes into them
using loops over batch_padded, causing out-of-bounds when batch is not
BATCH_TILE-aligned; fix by changing the pl.create_tensor calls for these symbols
to use batch_padded instead of batch (e.g., pl.create_tensor([batch_padded,
...], ...)) so allocated buffers match the assemble/write loops, or
alternatively add an explicit validation early (before the parallel loop) that
enforces batch % BATCH_TILE == 0 and error/adjust accordingly; update all
occurrences in the shown block (current_hidden creation and the
q_proj/k_proj/v_proj/q_proj_norm/k_proj_norm/attn_out/next_hidden creations) to
be consistent.
0fa3e0d to
5e14bd0
Compare
Introduces a HuggingFace comparison framework under llm/testing/hf_compare/ that validates pypto kernels against HF reference implementations. Supports prefill, decode, and end-to-end test cases with batch configuration and weight adapter utilities.
Stacks weights across num_layers and compares the fused pypto kernel output against num_layers HF Qwen3DecoderLayer instances in sequence.
- Extract shared constants, weight adapter, and kernel loader into cases/_qwen3_14b_common.py so decode/decode_full/prefill cases reuse one source of truth (-391/+150 across the three cases). - Add cases/qwen3_14b_e2e_fused.py: end-to-end comparison driven by the production LLMEngine + PyptoQwen14BExecutor (fused multi-layer decode kernel) against a stock HF Qwen3ForCausalLM run. - Add hf_compare/paged_kv.py: shared paged-attention KV-cache helpers (block-table init, slot mapping, decode/prefill KV gather, paged->dense stitching) used by fused/per-layer cases. - Fix qwen3_14b_e2e.py kernel load paths to point at models/qwen3/14b/ instead of the legacy examples/models/ location.
f2dded0 to
311cbfd
Compare
Summary
llm/testing/hf_compare/for validating pypto kernels against HF reference implementationstests/llm/testing/Test plan
python -m pytest tests/llm/testing/python -m llm.testing.hf_compare run qwen3_14b.e2e -k hf_model_path=<model> -k cpu_only=true🤖 Generated with Claude Code