Q4_K/Q6_K-direct CPU prefill + no-shims weights-immutability + kquant de-dup#171
Merged
Conversation
…try_tag Dedupe the hardcoded tag→QuantFormat match arms in resolve_attn_weights / resolve_ffn_weights against the single source of truth (QuantFormat::from_registry_tag), keeping each surface's supported-subset guard. Behaviour-equivalent; orthogonal to BitNet.
…ward on it Add the int8-activation (A8) ternary matvec — scalar sign-select plus a bit-identical NEON kernel (aarch64) — and wire the larql-inference BitNet forward onto it. Q/K/V and gate/up quantise the shared activation once. Parity-tested vs the f32 reference; validated e2e on microsoft/bitnet-b1.58-2B-4T. x86_64 runs scalar A8; AVX2 twin is follow-up.
Route bitnet-* model_type to a thin named BitnetArch instead of silently collapsing to GenericArch; norm_eps honoured from config. Prevents silent generic-config degradation and gives BitNet a home for first-class overrides. Covered by test_detect_bitnet_is_explicit_not_generic.
… (G1) Make the BitNet ternary kernel reachable through the backend registry, not only by direct call: add QuantFormat::I2S (+ registry_tag/from_registry_tag round-trip, is_ternary), a dedicated QuantMatVec::ternary_matvec method (BitLinearWeight carries the per-channel scales the &[u8] quant_matvec signature can't), and a CpuBackend impl on the best-available A8 kernel. quant_matvec returns None for I2S (loud, like Q8_0); Metal panics (no ternary shader). Foundation for vindex tag unification (G3).
… falsified) Close contact with the code revised the BitNet graduation plan: G1 (QuantFormat ternary + dispatch) landed; G2 (first-class KvEngine) is blocked on KvEngine's dense &ModelWeights signature — needs a breaking trait change, not a type-lie; G3's fold-into-one-registry goal is falsified (bitnet vindexes are mixed dense+ternary, quant:None is correct, two-field design is right); G4 (AVX2 twin) is blocked on an x86 build/test box (can't compile-check the C-FFI crate cross from aarch64, can't validate SIMD).
…productize) Record the staged plan after choosing to make BitNet a served path: P-A wire larql run + server to the ternary path (no trait change, start here); P-B first-class KvEngine (8 engines + ~171 call sites; trait-design fork B1/B2/B3); P-C AVX2 twin on x86 CI.
Detect config.bitnet_layout at run() and route to the ternary forward (load_bitnet_model + generate_streaming_bitnet) instead of the dense engine dispatch, bypassing the walk_cmd path the dense run delegates to. Greedy streaming generation + a simple stdin chat REPL; EOS via EosConfig::from_vindex_dir (same as the dense path). MVP: raw prompt encode (no chat-template) + greedy sampling. Compile-checked + clippy-clean; needs a --keep-quant BitNet vindex to smoke-test. Server route wiring + chat-template parity are the remaining P-A increments.
… caveat Smoke test against ~/larql-vindex/bitnet-2b.vindex passes (Paris, deterministic) → P-A CLI behaviour-verified, output captured as the P-B oracle. B1 locked for P-B; read-only check done — dense/resident paths are read-only but prefill_quant/decode_step_quant take &mut ModelWeights (resident-quant memoization), so B1 bundles a sub-decision: relocate that cache to engine state (rec) or interior-mutability.
…attn_tensors_dequantised Real-code scope of clean B1a: &mut chokepoint is ensure_attn_tensors_dequantised (vindex/dequant.rs:35), which memoizes dequantised Q4K Q/K/V/O into weights.tensors (derivative state). Stages: P-B.1 relocate that cache to engine state + resolver through the forward read path (drops &mut); P-B.2 Arc-owned weights (~171 sites); P-B.3 BitnetEngine + dispatch; P-B.4 validate vs oracle. Run in a worktree.
…), not RwLock-in-ModelWeights Server serializes every generation behind an exclusive write lock (state.rs:186 lock_weights_for_gen, all OpenAI gen routes) specifically because the dequant cache mutates weights. An interior-mut RwLock field can't lift that serialization (transient evicting scratch races across shared-Arc forwards) and taxes the dense path per-resolve. Engine-owned scratch makes ModelWeights truly immutable → Arc shared across concurrent generations. Provisional RwLock impl tried + reverted before the read-site/trait sweep could cement it.
…ch) — P-B.1 foundation The correct-shaped scratch resolver for P-B.1: DequantScratch is engine-owned per-forward state (keyed by the same tensor names as ModelWeights::tensors); WeightsView bundles &ModelWeights + &DequantScratch, Derefs to ModelWeights, and tensor() resolves scratch-first-then-canonical returning a borrow (no lock, no clone — dense path pays nothing). Foundation only: nothing threads it yet. Next: relocate the ~13 bulk-path inserts + per-layer insert/evict to the engine scratch, thread WeightsView through the shared quant forward to the 19 read sites, drop &mut from prefill_quant/decode_step_quant; validate vs dense Q4K parity + an evict-identity test (insert L then evict L → gone).
…opy + dense()/with_scratch()/From The shared forward is reached by non-quant callers too, so the resolver must serve them without manufacturing an empty map: scratch is now Option, WeightsView::dense terminates the threading cascade at non-quant call sites, with_scratch carries the engine cache on the quant path. Copy + two words so threading by value is free.
…allers wrap dense() — P-B.1 Stage 1 Stage 1 of the engine-owned dequant relocation: the attention-side scratch reader (run_attention_with_kv_backend, in larql-compute) now takes a WeightsView and resolves Q/K/V/O via .tensor(); all ~22 callers across larql-compute/inference/kv wrap with WeightsView::dense(weights). Behavior-identical (dense view + .tensor() resolves to weights.tensors exactly as before, and the inserter still populates weights.tensors) — the workspace-spanning signature change lands decoupled from any behavior change. run_ffn unchanged (reads only canonical norms; the FFN-side scratch readers are FfnBackend impls, deferred to Stage 2). Guards: run_attention full-recompute/no-backend parity, 50 kquant_forward tests, dequant parity — all green; workspace --all-targets green; clippy clean. Stage 2 relocates the scratch off weights.tensors into a forward-local DequantScratch and flips the ~18 quant sites dense()→with_scratch.
…Ffn wrap dense() internally — P-B.1 Stage 2a FFN-side analog of Stage 1, and the reason the 326 WeightFfn construction sites stay untouched: dense_ffn_forward / dense_ffn_forward_backend now take a WeightsView and resolve gate/up/down via .tensor(); WeightFfn and BackendFfn wrap WeightsView::dense(self.weights) *internally*, so their ~326 call sites are unaffected. Behavior-identical (dense view → canonical resolve); 15 FFN-weight + 38 sparse tests pass incl. backend-matches-no-backend parity; workspace --all-targets green; clippy clean. Sets up Stage 2b: the ~5-10 quant loops construct a scratch-aware FFN (with_scratch view) so the dequantised FFN reads resolve from the engine scratch.
…n + entry conditions Stage 1 (run_attention→WeightsView) + Stage 2a (dense_ffn_forward→WeightsView) committed behavior-identical. Stage 2b (relocation) reverted to green on a CATEGORICAL distinction: the silent-break reader set is type-system-invisible (Deref reads canonical, wrong only on decode under a real Q4K vindex), unlike the prior 4 compiler-visible cost escalations. Closure is making misses LOUD (leave canonical empty of dequant keys → None → existing panic/?-bail fires on first decode), not grep enumeration (inventory is current, not complete). Entry conditions met: qwen3-0.6b-q4k.vindex exists; capture a multi-token DECODE oracle at Stage 2a as the regression spine.
…htsView — P-B.1 Stage 2b-pre Converts the third (and final) quant-path scratch reader — the DECODE attention reader Stage 1 missed (it only did the prefill reader) — to take WeightsView and resolve attn Q/K/V/O via .tensor(); ~14 callers across compute/inference/kv wrap dense(). Behavior-identical: 18 decode-attention + 50 kquant_forward tests pass, clippy clean, workspace --all-targets green. With all three readers (prefill attn, dense FFN, decode attn) now resolving through the view, the relocation can flip dense()→with_scratch at the quant loops without a silent-break — every reader they reach consults the scratch. Sets up the relocation proper (inserters→scratch, ViewFfn, loops, drop &mut), guarded by the captured Q4K decode oracle + loud-break (canonical empty of dequant keys → None → panic on miss).
…finding + loud-break safety All 3 primary quant-path readers converted to WeightsView (Stage 1/2a/2b-pre, committed behavior-identical) + Q4K decode oracle captured. Relocation reverted to green on the reader-set expanding again: hidden.rs/interventions.rs reach attention via run_layer_with_ffn → run_attention_inner/with_kv_cache → run_attention_block_core/block_gpu — un-converted readers the grep missed. True scope = convert the whole attention-reader family, each a Stage-1-style cascade. Loud-break (canonical empty → None → panic) makes the remaining conversions safe to do incrementally against the oracle.
…sView — P-B.1 Stage 2b-pre2 The remaining quant-path attention readers (run_attention_block_core, run_attention_block_gpu) + the run_layer_with_ffn chain (run_attention/run_attention_inner/run_attention_with_kv_cache/run_layer_with_ffn/run_layer_with_capture[_hooked]/run_attention_public + the block.rs family) now take WeightsView and resolve via .tensor(); ~100 callers across compute/inference/kv/cli/server/examples wrap WeightsView::dense (run_ffn/apply_layer_scalar stay &ModelWeights; the un-converted auto_inplace path stays &ModelWeights). Behavior-identical: 92 attention + 10 layer + 50 kquant_forward tests pass incl. full-recompute parity; workspace --all-targets green; clippy clean. With the ENTIRE reader family now resolving through the view, the relocation's loops can flip dense()→with_scratch with no silent-break — every reader they reach consults the scratch. Oracle confirmation pending the release rebuild.
…1 Stage 2b The production Q4K decode path (predict_kquant_prefill/decode_step + hidden + interventions) now dequantises attn/FFN into a forward-local DequantScratch and resolves it via WeightsView::with_scratch + ViewFfn — weights stay &ModelWeights (immutable, Arc-able) on the decode path, no more &mut. Inserters (insert_q4k_layer_tensors/remove_layer_tensors/ensure_attn_tensors_dequantised) take &mut DequantScratch + &ModelWeights. Bulk f32-fallback + dev/research drivers (KvEngine prefill_quant/decode_step_quant trait defaults, all larql-kv quant-engine overrides, apollo, the ov_rd CLI tooling, the lql relation resolver, the vision/image CLI path, examples) keep their in-weights behaviour via *_resident shims (dequant into a scratch, then merge into weights.tensors). Their &mut-drop needs engine-owned scratch state — a documented follow-up; loud-break (canonical empty → None → panic) guards any reader still on canonical. Validated: workspace --all-targets green; clippy clean (0 warnings); 50 kquant_forward + 13 dequant + resident_identity tests pass; decode against qwen3-0.6b-q4k BYTE-IDENTICAL to the Stage-2a oracle (bench/oracles/q4k_qwen3_history_of_computing.txt).
…nt relocated, oracle byte-identical
…inding (WIP stashed)
…tch, weights immutable Completes the P-B.1 relocation with ZERO weights.tensors merge shims on the engine/serving path. Each KvEngine (standard, no_cache, markov_residual, markov_residual_codec, boundary_per_layer, boundary_kv, turbo_quant, unlimited_context, apollo) owns a dequant_scratch field; prefill_quant/decode_step_quant dequantise into it and the forward resolves attention/FFN through WeightsView::with_scratch — no &mut ModelWeights, no weights.tensors mutation. Converted to WeightsView across the dispatch + engine layers: KvDispatch trait (5 methods) + cpu/metal/async impls + the 7 kv_*_via_dispatch helpers; coarse_prefill/coarse_decode_step drop &mut (delegate to the relocated predict_kquant_*); the SECOND kquant_forward (larql-compute's, the real oracle/coarse path) relocated to forward-local scratch + ViewFfn; LayerExecutor trait + local_walk; every engine's walk/compute/executor/dispatch/cold_tier modules; recompute_kv + attn_kv_projection_weights; run_attention_block_decode_step_auto/_auto_inplace; kv_prefill_run; forward_raw_logits/forward_from_layer. RetrievalEngine quant defaults → loud error (apollo overrides). The *_resident helpers remain for ~58 dev/research call sites (ov_rd CLI, lql resolver, vision CLI, examples) that own their weights and run one-off forwards. Validated: workspace --all-targets green, clippy 0 warnings, 766 larql-kv + 40 kquant + resident_identity + 4 dispatch_parity (cross-engine bit-parity) tests pass, decode BYTE-IDENTICAL to the oracle, and markov-rs/unlimited/turbo-quant/no-cache engines smoke-tested coherent at runtime.
… to 0, oracle byte-identical
…forward internals Thread WeightsView (engine-owned dequant scratch) through every KvEngine's forward/recompute internals — walk.rs/compute.rs/executor.rs/engine.rs across markov_residual, markov_residual_codec, boundary_per_layer, turbo_quant, unlimited_context, no_cache, standard — plus generation.rs cached loops. Weights stay &ModelWeights on the engine/serving path: 0 &mut ModelWeights and 0 weights.tensors.extend in larql-kv engines. Validated: cargo check --workspace --all-targets green, fmt --all --check clean, larql-kv 766 + 4 dispatch_parity (cross-engine bit-parity), larql-inference 1248 (incl. A8 ternary FFN + streaming), larql-compute/larql-models green.
bench_decode_step reused one engine and called decode_step in b.iter without resetting, so the unbounded `standard` engine (window_size: None) appended to its K/V cache every iteration. Over criterion's 25k+ iterations the per-call cost grew linearly, making the measurement non-stationary and iteration-count-dependent: standard swung 12-25us run-to-run, and no-cache (which re-forwards the full context) read ~40us. Measure one decode step on a fresh prefilled engine via iter_batched_ref (setup untimed). standard now reports true single-step latency (~12.8us, matching the bounded window-4/unlimited engines) and no-cache drops to a deterministic ~18us. Bounded engines were already correct.
Add q4k_matmul_into: decode each Q4_K super-block to f32 once and FMA it across all seq activation columns (8 independent accumulators so LLVM lowers the reduction to NEON), row-parallel, output transposed to the [seq, rows] contract. Wire CpuBackend::q4k_matmul to it — previously only Metal implemented the trait method, so CPU prefill fell back to dequant-whole-layer + sgemm. Reads the Q4_K weight once instead of seq x (per-position matvec) or 4x the bytes (dequant to f32); beats the per-position matvec loop at seq>=5. Parity-tested vs dequant->matmul and vs q4k_matvec row-for-row. Also: attn_prefill_f32_vs_q4k example now benches the real kernel and takes a seq arg to probe the short-prompt prefill regime.
Add Q4kMatmulFfn (FfnBackend) that runs gate/up/down straight on the vindex's Q4_K bytes via q4k_matmul, plus insert_q4k_attn_tensors which dequantises only Q/K/V/O. The Q4_K CPU prefill loop (predict_kquant_prefill_with_state) now uses both when the vindex exposes interleaved Q4_K FFN bytes and hidden is a 256-multiple, falling back to the full dequant + dense FFN otherwise. The FFN weights (gate/up/down) are ~4x the attention weights, so skipping their f32 materialisation removes the bulk of prefill's per-layer dequant cost. gemma3-4b Q4_K CPU prefill (standard engine): 2746ms -> 572ms (4.8x) on a 5-token prompt, generation unchanged. Q4kMatmulFfn is parity-tested against dequantising the same bytes + the dense FFN (q4k_matmul_ffn_matches_dequant_dense); the existing kquant prefill tests now exercise the q4k-direct path on the fixture.
The Q4_K CPU forward lived in two near-identical copies — larql-inference's vindex/kquant_forward and larql-compute's substrate copy (ADR-0022). That duplication is why the larql-cpu generate path missed the q4k-direct FFN prefill speedup the standard engine (on larql-compute's copy) already got. Replace the duplicated bodies of predict_kquant_prefill_with_state, predict_kquant_decode_step, and predict_kquant_decode_step_direct_with_state with thin delegations to larql-compute (coerce VectorIndex -> &dyn KvIndex, bridge the per-crate CachedTimings). Public API unchanged; ~150 duplicated lines + 6 now-unused imports removed. Effect: larql-cpu Q4_K prefill 2734ms -> 880ms (now shares the q4k-direct FFN). larql-inference 1248 + larql-kv 766 tests pass. Follow-up: the native decode kernels (attention/ffn_decode_step_native), the fused_* Metal path, and the supports_* predicates are still duplicated and can be delegated the same way.
…gation Finish de-duplicating the Q4_K CPU forward. supports_cached_decode, supports_direct_matvec_decode, fused_prefill, fused_decode_step[_with_state], and attention/ffn_decode_step_native now delegate to larql-compute's substrate copy (coerce VectorIndex -> &dyn KvIndex). cached.rs is now purely the VectorIndex-typed adapter over the substrate (ADR-0022) with no duplicated forward logic; module doc updated to state that role. Removed the test-only matvec_q4k_or_q6k_q8k / layer_supports_direct_matvec helper copies and their unit tests — larql-compute owns those (and tests them). cached.rs: 1640 -> 720 lines. larql-inference 1240 + larql-kv 766 tests pass, workspace clean, 0 warnings.
Q4kMatmulFfn assumed every FFN component was Q4_K, but the default q4k
vindex stores down_proj as Q6_K (210 B/block, not 144). Decoding the
Q6_K bytes through q4k_matmul produced garbage plus a wrong derived
column count, so the q4k-direct FFN prefill emitted nonsense on real
models — gemma3-4b "The capital of France is **Paris**" became
" peregr...Everett Monroe...". The all-Q4_K test fixture masked it.
Dispatch each FFN component on its stored format tag: Q4_K through the
amortised q4k_matmul (reads bytes once, no f32 staging — the prefill
win), Q6_K via dequantize_matrix + dot_proj. gate/up stay Q4_K (fast
path); only the Q6_K down is dequantised. Makes dequantize_matrix /
its module pub(crate) so the FFN backend can reach it.
Adds project_down_dispatches_on_q6k_format regression test (the all-Q4_K
fixture never exercised Q6_K). Real gemma3-4b CPU prefill is correct
again ("Paris") at 1646ms vs the 2746ms full-dequant baseline (1.67x);
the earlier 572ms was the broken path that skipped the down decode.
larql-compute 741 / inference / kv tests pass, fmt clean.
The per-format fix dequantised the Q6_K down_proj to f32 before matmul,
leaving prefill at 1.67x. Add q6k_matmul_into — the Q6_K twin of
q4k_matmul_into — so the down also reads its quantised bytes once and
skips the f32 materialisation. Factor the shared amortised matmul loop
into kquant_matmul_into (decode closure + block_bytes); q4k/q6k differ
only in the per-block decode, addressing the kernel duplication the
review flagged.
Q4kMatmulFfn::matmul now dispatches Q4_K/Q6_K to their kernels and only
falls back to dequant for a format without one (none today). All FFN
projections are dequant-free.
gemma3-4b Q4_K CPU prefill (standard engine): 2746ms full-dequant ->
577ms (4.76x), correct ("Paris") — matching the originally-broken 572ms
but with right output. Decode unchanged.
larql-compute 741 (incl. project_down_dispatches_on_q6k_format) +
inference + kv tests pass, fmt clean.
Project the attention Q/K/V/O straight from the vindex's Q4_K/Q6_K bytes
(Q/K/O are Q4_K, V is Q6_K) via the amortised q4k/q6k matmul, skipping the
insert_q4k_attn_tensors f32 dequant — the last per-layer dequant on the
prefill path. run_attention_with_kv_backend gains an optional `index`: when
Some it projects from bytes, else the f32 view path (all existing callers
pass None). The prefill gates `use_q4k_attn` (attn bytes present, hidden and
q_dim 256-aligned) and skips the attn dequant when both attn and FFN go
direct.
Factor the Q4_K/Q6_K dispatch into shared quant_matmul / quant_proj (reused
by the FFN backend and attention), removing the duplication the review
flagged.
gemma3-4b Q4_K CPU prefill (standard engine): 2746ms full-dequant -> 291ms
(9.4x), correct ("Paris"). vs llama.cpp pp5 the gap is now ~4x (was 39x at
the start). Decode unchanged. larql-compute 741 + inference 1240 + kv 766
tests pass, workspace clean.
Replace the portable multi-accumulator dot in kquant_matmul_into with a
hand-written aarch64 NEON dot (dot_256_f32): 8 float32x4 accumulators +
vfmaq_f32, enough independent accumulators to hide the M-series ~4-cycle
FMA latency. Scalar 8-accumulator fallback for other targets + as the
NEON parity oracle. q4k/q6k share it via the generic kquant_matmul_into.
At seq=5 the q4k_matmul now BEATS f32 AMX sgemm (mm/f32 1.0-1.3x, was
0.4-0.7x) while still skipping the dequant. gemma3-4b Q4_K CPU prefill
(standard engine): 291ms -> 233ms (11.8x over the 2746ms baseline),
correct ("Paris").
Tests: dot_256_f32 NEON-vs-scalar parity, q4k_matmul rows>32 multi-chunk
+ seq=1, and a direct q6k_matmul_into-vs-dequant kernel parity — the
review-flagged coverage gaps. larql-compute 744 tests pass, the touched
code is clippy-clean and fmt clean.
…doc) - bench/baselines/cpu/COMPARISON.md: dated 2026-06-22 update — decode now at/ahead of llama.cpp, prefill 2746ms->233ms (11.8x), gap 55x -> ~3x. - ROADMAP.md: new CPU prefill row in the honest-accounting table. - README.md: CPU-prefill paragraph (q4k/q6k-direct attn+FFN + NEON dot). - kquant_forward/cached.rs: module doc updated — prefill no longer pays the per-layer f32 dequant (the old "follow-up" is done).
…pace The no-shims weights-immutability migration left ~230 methods/functions taking &mut params they no longer mutate (needless_pass_by_ref_mut), plus the resulting unused_mut at the call sites, plus one needless_range_loop in a test. cargo clippy --fix swept them; behavior-preserving (those &mut were already never used mutably). cargo clippy --workspace --tests -- -D warnings now passes (0 warnings). larql-compute 744 + inference 1240 + kv 766 tests pass, fmt clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Completes the no-shims weights-immutability refactor and lands Q4_K/Q6_K-direct CPU prefill — eliminating the per-layer f32 dequant that was the dominant prefill cost. Gemma 3 4B Q4_K CPU prefill drops 2746 ms → 233 ms (11.8×), correct, closing the gap to llama.cpp
pp5from ~55× → ~3×. Decode stays at/ahead of llama.cpp.What's here
No-shims foundation (P-B.1/P-B.1b) — every KV engine owns its
DequantScratch; weights resolve throughWeightsViewand stay immutable on the engine/serving path. Oracle byte-identical, dispatch-parity green.Q4_K/Q6_K-direct prefill — Q/K/V/O and gate/up/down now project straight from the vindex's Q4_K/Q6_K bytes via an amortised
q4k_matmul(+ its Q6_K twinq6k_matmul, for the default Q6_Kv_proj/down_proj) — no f32 materialisation. Gated per-component on the stored format tag, with a dequant+sgemm fallback for unsupported shapes.Hand-written aarch64 NEON inner dot (
dot_256_f32, 8×float32x4+vfmaq_f32) — at seq=5 theq4k_matmulnow beats f32 AMX sgemm while still skipping the dequant. Portable scalar fallback + parity oracle.kquant de-duplication — the Q4_K CPU forward existed in two near-identical copies (larql-inference + larql-compute). larql-inference is now a thin
VectorIndex→KvIndexadapter delegating to the single larql-compute substrate copy (~740 lines removed).Correctness fix caught in review
The first q4k-direct FFN assumed every component was Q4_K, but the default vindex stores
down_proj/v_projas Q6_K → garbage output ("…Paris"became" peregr区块лени Everett…"). Fixed by dispatching on the per-component format tag; added aproject_down_dispatches_on_q6k_formatregression test (the all-Q4_K fixture never exercised it).Quality
fix(bench):engine_decodemeasured growing-context cost (unbounded K/V in the timed loop), not single-step latency — nowiter_batched.q4k_matmulrows>32 multi-chunk + seq=1, directq6k_matmulparity.make lintclean: swept ~230 vestigial&mut(needless_pass_by_ref_mut) left by the no-shims migration + theunused_mutcascade.Validation
cargo test --workspace --lib: 4530 passed, 0 failed.cargo clippy --workspace --tests -- -D warnings: clean.cargo fmt --all --check: clean.