Q4_K/Q6_K-direct CPU prefill + no-shims weights-immutability + kquant de-dup by chrishayuk · Pull Request #171 · chrishayuk/larql

chrishayuk · 2026-06-21T23:54:16Z

Summary

Completes the no-shims weights-immutability refactor and lands Q4_K/Q6_K-direct CPU prefill — eliminating the per-layer f32 dequant that was the dominant prefill cost. Gemma 3 4B Q4_K CPU prefill drops 2746 ms → 233 ms (11.8×), correct, closing the gap to llama.cpp pp5 from ~55× → ~3×. Decode stays at/ahead of llama.cpp.

What's here

No-shims foundation (P-B.1/P-B.1b) — every KV engine owns its DequantScratch; weights resolve through WeightsView and stay immutable on the engine/serving path. Oracle byte-identical, dispatch-parity green.

Q4_K/Q6_K-direct prefill — Q/K/V/O and gate/up/down now project straight from the vindex's Q4_K/Q6_K bytes via an amortised q4k_matmul (+ its Q6_K twin q6k_matmul, for the default Q6_K v_proj/down_proj) — no f32 materialisation. Gated per-component on the stored format tag, with a dequant+sgemm fallback for unsupported shapes.

Hand-written aarch64 NEON inner dot (dot_256_f32, 8× float32x4 + vfmaq_f32) — at seq=5 the q4k_matmul now beats f32 AMX sgemm while still skipping the dequant. Portable scalar fallback + parity oracle.

kquant de-duplication — the Q4_K CPU forward existed in two near-identical copies (larql-inference + larql-compute). larql-inference is now a thin VectorIndex→KvIndex adapter delegating to the single larql-compute substrate copy (~740 lines removed).

Correctness fix caught in review

The first q4k-direct FFN assumed every component was Q4_K, but the default vindex stores down_proj/v_proj as Q6_K → garbage output ("…Paris" became " peregr区块лени Everett…"). Fixed by dispatching on the per-component format tag; added a project_down_dispatches_on_q6k_format regression test (the all-Q4_K fixture never exercised it).

Quality

fix(bench): engine_decode measured growing-context cost (unbounded K/V in the timed loop), not single-step latency — now iter_batched.
New kernel coverage: NEON-vs-scalar dot parity, q4k_matmul rows>32 multi-chunk + seq=1, direct q6k_matmul parity.
make lint clean: swept ~230 vestigial &mut (needless_pass_by_ref_mut) left by the no-shims migration + the unused_mut cascade.

Validation

cargo test --workspace --lib: 4530 passed, 0 failed.
cargo clippy --workspace --tests -- -D warnings: clean.
cargo fmt --all --check: clean.
Real-model accuracy: gemma3-4b Q4_K CPU → "The capital of France is Paris."

…try_tag Dedupe the hardcoded tag→QuantFormat match arms in resolve_attn_weights / resolve_ffn_weights against the single source of truth (QuantFormat::from_registry_tag), keeping each surface's supported-subset guard. Behaviour-equivalent; orthogonal to BitNet.

…ward on it Add the int8-activation (A8) ternary matvec — scalar sign-select plus a bit-identical NEON kernel (aarch64) — and wire the larql-inference BitNet forward onto it. Q/K/V and gate/up quantise the shared activation once. Parity-tested vs the f32 reference; validated e2e on microsoft/bitnet-b1.58-2B-4T. x86_64 runs scalar A8; AVX2 twin is follow-up.

Route bitnet-* model_type to a thin named BitnetArch instead of silently collapsing to GenericArch; norm_eps honoured from config. Prevents silent generic-config degradation and gives BitNet a home for first-class overrides. Covered by test_detect_bitnet_is_explicit_not_generic.

… (G1) Make the BitNet ternary kernel reachable through the backend registry, not only by direct call: add QuantFormat::I2S (+ registry_tag/from_registry_tag round-trip, is_ternary), a dedicated QuantMatVec::ternary_matvec method (BitLinearWeight carries the per-channel scales the &[u8] quant_matvec signature can't), and a CpuBackend impl on the best-available A8 kernel. quant_matvec returns None for I2S (loud, like Q8_0); Metal panics (no ternary shader). Foundation for vindex tag unification (G3).

… falsified) Close contact with the code revised the BitNet graduation plan: G1 (QuantFormat ternary + dispatch) landed; G2 (first-class KvEngine) is blocked on KvEngine's dense &ModelWeights signature — needs a breaking trait change, not a type-lie; G3's fold-into-one-registry goal is falsified (bitnet vindexes are mixed dense+ternary, quant:None is correct, two-field design is right); G4 (AVX2 twin) is blocked on an x86 build/test box (can't compile-check the C-FFI crate cross from aarch64, can't validate SIMD).

…productize) Record the staged plan after choosing to make BitNet a served path: P-A wire larql run + server to the ternary path (no trait change, start here); P-B first-class KvEngine (8 engines + ~171 call sites; trait-design fork B1/B2/B3); P-C AVX2 twin on x86 CI.

Detect config.bitnet_layout at run() and route to the ternary forward (load_bitnet_model + generate_streaming_bitnet) instead of the dense engine dispatch, bypassing the walk_cmd path the dense run delegates to. Greedy streaming generation + a simple stdin chat REPL; EOS via EosConfig::from_vindex_dir (same as the dense path). MVP: raw prompt encode (no chat-template) + greedy sampling. Compile-checked + clippy-clean; needs a --keep-quant BitNet vindex to smoke-test. Server route wiring + chat-template parity are the remaining P-A increments.

… caveat Smoke test against ~/larql-vindex/bitnet-2b.vindex passes (Paris, deterministic) → P-A CLI behaviour-verified, output captured as the P-B oracle. B1 locked for P-B; read-only check done — dense/resident paths are read-only but prefill_quant/decode_step_quant take &mut ModelWeights (resident-quant memoization), so B1 bundles a sub-decision: relocate that cache to engine state (rec) or interior-mutability.

…attn_tensors_dequantised Real-code scope of clean B1a: &mut chokepoint is ensure_attn_tensors_dequantised (vindex/dequant.rs:35), which memoizes dequantised Q4K Q/K/V/O into weights.tensors (derivative state). Stages: P-B.1 relocate that cache to engine state + resolver through the forward read path (drops &mut); P-B.2 Arc-owned weights (~171 sites); P-B.3 BitnetEngine + dispatch; P-B.4 validate vs oracle. Run in a worktree.

…), not RwLock-in-ModelWeights Server serializes every generation behind an exclusive write lock (state.rs:186 lock_weights_for_gen, all OpenAI gen routes) specifically because the dequant cache mutates weights. An interior-mut RwLock field can't lift that serialization (transient evicting scratch races across shared-Arc forwards) and taxes the dense path per-resolve. Engine-owned scratch makes ModelWeights truly immutable → Arc shared across concurrent generations. Provisional RwLock impl tried + reverted before the read-site/trait sweep could cement it.

…ch) — P-B.1 foundation The correct-shaped scratch resolver for P-B.1: DequantScratch is engine-owned per-forward state (keyed by the same tensor names as ModelWeights::tensors); WeightsView bundles &ModelWeights + &DequantScratch, Derefs to ModelWeights, and tensor() resolves scratch-first-then-canonical returning a borrow (no lock, no clone — dense path pays nothing). Foundation only: nothing threads it yet. Next: relocate the ~13 bulk-path inserts + per-layer insert/evict to the engine scratch, thread WeightsView through the shared quant forward to the 19 read sites, drop &mut from prefill_quant/decode_step_quant; validate vs dense Q4K parity + an evict-identity test (insert L then evict L → gone).

…opy + dense()/with_scratch()/From The shared forward is reached by non-quant callers too, so the resolver must serve them without manufacturing an empty map: scratch is now Option, WeightsView::dense terminates the threading cascade at non-quant call sites, with_scratch carries the engine cache on the quant path. Copy + two words so threading by value is free.

…allers wrap dense() — P-B.1 Stage 1 Stage 1 of the engine-owned dequant relocation: the attention-side scratch reader (run_attention_with_kv_backend, in larql-compute) now takes a WeightsView and resolves Q/K/V/O via .tensor(); all ~22 callers across larql-compute/inference/kv wrap with WeightsView::dense(weights). Behavior-identical (dense view + .tensor() resolves to weights.tensors exactly as before, and the inserter still populates weights.tensors) — the workspace-spanning signature change lands decoupled from any behavior change. run_ffn unchanged (reads only canonical norms; the FFN-side scratch readers are FfnBackend impls, deferred to Stage 2). Guards: run_attention full-recompute/no-backend parity, 50 kquant_forward tests, dequant parity — all green; workspace --all-targets green; clippy clean. Stage 2 relocates the scratch off weights.tensors into a forward-local DequantScratch and flips the ~18 quant sites dense()→with_scratch.

…Ffn wrap dense() internally — P-B.1 Stage 2a FFN-side analog of Stage 1, and the reason the 326 WeightFfn construction sites stay untouched: dense_ffn_forward / dense_ffn_forward_backend now take a WeightsView and resolve gate/up/down via .tensor(); WeightFfn and BackendFfn wrap WeightsView::dense(self.weights) *internally*, so their ~326 call sites are unaffected. Behavior-identical (dense view → canonical resolve); 15 FFN-weight + 38 sparse tests pass incl. backend-matches-no-backend parity; workspace --all-targets green; clippy clean. Sets up Stage 2b: the ~5-10 quant loops construct a scratch-aware FFN (with_scratch view) so the dequantised FFN reads resolve from the engine scratch.

…n + entry conditions Stage 1 (run_attention→WeightsView) + Stage 2a (dense_ffn_forward→WeightsView) committed behavior-identical. Stage 2b (relocation) reverted to green on a CATEGORICAL distinction: the silent-break reader set is type-system-invisible (Deref reads canonical, wrong only on decode under a real Q4K vindex), unlike the prior 4 compiler-visible cost escalations. Closure is making misses LOUD (leave canonical empty of dequant keys → None → existing panic/?-bail fires on first decode), not grep enumeration (inventory is current, not complete). Entry conditions met: qwen3-0.6b-q4k.vindex exists; capture a multi-token DECODE oracle at Stage 2a as the regression spine.

…htsView — P-B.1 Stage 2b-pre Converts the third (and final) quant-path scratch reader — the DECODE attention reader Stage 1 missed (it only did the prefill reader) — to take WeightsView and resolve attn Q/K/V/O via .tensor(); ~14 callers across compute/inference/kv wrap dense(). Behavior-identical: 18 decode-attention + 50 kquant_forward tests pass, clippy clean, workspace --all-targets green. With all three readers (prefill attn, dense FFN, decode attn) now resolving through the view, the relocation can flip dense()→with_scratch at the quant loops without a silent-break — every reader they reach consults the scratch. Sets up the relocation proper (inserters→scratch, ViewFfn, loops, drop &mut), guarded by the captured Q4K decode oracle + loud-break (canonical empty of dequant keys → None → panic on miss).

…finding + loud-break safety All 3 primary quant-path readers converted to WeightsView (Stage 1/2a/2b-pre, committed behavior-identical) + Q4K decode oracle captured. Relocation reverted to green on the reader-set expanding again: hidden.rs/interventions.rs reach attention via run_layer_with_ffn → run_attention_inner/with_kv_cache → run_attention_block_core/block_gpu — un-converted readers the grep missed. True scope = convert the whole attention-reader family, each a Stage-1-style cascade. Loud-break (canonical empty → None → panic) makes the remaining conversions safe to do incrementally against the oracle.

…sView — P-B.1 Stage 2b-pre2 The remaining quant-path attention readers (run_attention_block_core, run_attention_block_gpu) + the run_layer_with_ffn chain (run_attention/run_attention_inner/run_attention_with_kv_cache/run_layer_with_ffn/run_layer_with_capture[_hooked]/run_attention_public + the block.rs family) now take WeightsView and resolve via .tensor(); ~100 callers across compute/inference/kv/cli/server/examples wrap WeightsView::dense (run_ffn/apply_layer_scalar stay &ModelWeights; the un-converted auto_inplace path stays &ModelWeights). Behavior-identical: 92 attention + 10 layer + 50 kquant_forward tests pass incl. full-recompute parity; workspace --all-targets green; clippy clean. With the ENTIRE reader family now resolving through the view, the relocation's loops can flip dense()→with_scratch with no silent-break — every reader they reach consults the scratch. Oracle confirmation pending the release rebuild.

…1 Stage 2b The production Q4K decode path (predict_kquant_prefill/decode_step + hidden + interventions) now dequantises attn/FFN into a forward-local DequantScratch and resolves it via WeightsView::with_scratch + ViewFfn — weights stay &ModelWeights (immutable, Arc-able) on the decode path, no more &mut. Inserters (insert_q4k_layer_tensors/remove_layer_tensors/ensure_attn_tensors_dequantised) take &mut DequantScratch + &ModelWeights. Bulk f32-fallback + dev/research drivers (KvEngine prefill_quant/decode_step_quant trait defaults, all larql-kv quant-engine overrides, apollo, the ov_rd CLI tooling, the lql relation resolver, the vision/image CLI path, examples) keep their in-weights behaviour via *_resident shims (dequant into a scratch, then merge into weights.tensors). Their &mut-drop needs engine-owned scratch state — a documented follow-up; loud-break (canonical empty → None → panic) guards any reader still on canonical. Validated: workspace --all-targets green; clippy clean (0 warnings); 50 kquant_forward + 13 dequant + resident_identity tests pass; decode against qwen3-0.6b-q4k BYTE-IDENTICAL to the Stage-2a oracle (bench/oracles/q4k_qwen3_history_of_computing.txt).

…nt relocated, oracle byte-identical

…inding (WIP stashed)

… staged refactor

…tch, weights immutable Completes the P-B.1 relocation with ZERO weights.tensors merge shims on the engine/serving path. Each KvEngine (standard, no_cache, markov_residual, markov_residual_codec, boundary_per_layer, boundary_kv, turbo_quant, unlimited_context, apollo) owns a dequant_scratch field; prefill_quant/decode_step_quant dequantise into it and the forward resolves attention/FFN through WeightsView::with_scratch — no &mut ModelWeights, no weights.tensors mutation. Converted to WeightsView across the dispatch + engine layers: KvDispatch trait (5 methods) + cpu/metal/async impls + the 7 kv_*_via_dispatch helpers; coarse_prefill/coarse_decode_step drop &mut (delegate to the relocated predict_kquant_*); the SECOND kquant_forward (larql-compute's, the real oracle/coarse path) relocated to forward-local scratch + ViewFfn; LayerExecutor trait + local_walk; every engine's walk/compute/executor/dispatch/cold_tier modules; recompute_kv + attn_kv_projection_weights; run_attention_block_decode_step_auto/_auto_inplace; kv_prefill_run; forward_raw_logits/forward_from_layer. RetrievalEngine quant defaults → loud error (apollo overrides). The *_resident helpers remain for ~58 dev/research call sites (ov_rd CLI, lql resolver, vision CLI, examples) that own their weights and run one-off forwards. Validated: workspace --all-targets green, clippy 0 warnings, 766 larql-kv + 40 kquant + resident_identity + 4 dispatch_parity (cross-engine bit-parity) tests pass, decode BYTE-IDENTICAL to the oracle, and markov-rs/unlimited/turbo-quant/no-cache engines smoke-tested coherent at runtime.

… to 0, oracle byte-identical

…forward internals Thread WeightsView (engine-owned dequant scratch) through every KvEngine's forward/recompute internals — walk.rs/compute.rs/executor.rs/engine.rs across markov_residual, markov_residual_codec, boundary_per_layer, turbo_quant, unlimited_context, no_cache, standard — plus generation.rs cached loops. Weights stay &ModelWeights on the engine/serving path: 0 &mut ModelWeights and 0 weights.tensors.extend in larql-kv engines. Validated: cargo check --workspace --all-targets green, fmt --all --check clean, larql-kv 766 + 4 dispatch_parity (cross-engine bit-parity), larql-inference 1248 (incl. A8 ternary FFN + streaming), larql-compute/larql-models green.

bench_decode_step reused one engine and called decode_step in b.iter without resetting, so the unbounded `standard` engine (window_size: None) appended to its K/V cache every iteration. Over criterion's 25k+ iterations the per-call cost grew linearly, making the measurement non-stationary and iteration-count-dependent: standard swung 12-25us run-to-run, and no-cache (which re-forwards the full context) read ~40us. Measure one decode step on a fresh prefilled engine via iter_batched_ref (setup untimed). standard now reports true single-step latency (~12.8us, matching the bounded window-4/unlimited engines) and no-cache drops to a deterministic ~18us. Bounded engines were already correct.

Add q4k_matmul_into: decode each Q4_K super-block to f32 once and FMA it across all seq activation columns (8 independent accumulators so LLVM lowers the reduction to NEON), row-parallel, output transposed to the [seq, rows] contract. Wire CpuBackend::q4k_matmul to it — previously only Metal implemented the trait method, so CPU prefill fell back to dequant-whole-layer + sgemm. Reads the Q4_K weight once instead of seq x (per-position matvec) or 4x the bytes (dequant to f32); beats the per-position matvec loop at seq>=5. Parity-tested vs dequant->matmul and vs q4k_matvec row-for-row. Also: attn_prefill_f32_vs_q4k example now benches the real kernel and takes a seq arg to probe the short-prompt prefill regime.

Add Q4kMatmulFfn (FfnBackend) that runs gate/up/down straight on the vindex's Q4_K bytes via q4k_matmul, plus insert_q4k_attn_tensors which dequantises only Q/K/V/O. The Q4_K CPU prefill loop (predict_kquant_prefill_with_state) now uses both when the vindex exposes interleaved Q4_K FFN bytes and hidden is a 256-multiple, falling back to the full dequant + dense FFN otherwise. The FFN weights (gate/up/down) are ~4x the attention weights, so skipping their f32 materialisation removes the bulk of prefill's per-layer dequant cost. gemma3-4b Q4_K CPU prefill (standard engine): 2746ms -> 572ms (4.8x) on a 5-token prompt, generation unchanged. Q4kMatmulFfn is parity-tested against dequantising the same bytes + the dense FFN (q4k_matmul_ffn_matches_dequant_dense); the existing kquant prefill tests now exercise the q4k-direct path on the fixture.

The Q4_K CPU forward lived in two near-identical copies — larql-inference's vindex/kquant_forward and larql-compute's substrate copy (ADR-0022). That duplication is why the larql-cpu generate path missed the q4k-direct FFN prefill speedup the standard engine (on larql-compute's copy) already got. Replace the duplicated bodies of predict_kquant_prefill_with_state, predict_kquant_decode_step, and predict_kquant_decode_step_direct_with_state with thin delegations to larql-compute (coerce VectorIndex -> &dyn KvIndex, bridge the per-crate CachedTimings). Public API unchanged; ~150 duplicated lines + 6 now-unused imports removed. Effect: larql-cpu Q4_K prefill 2734ms -> 880ms (now shares the q4k-direct FFN). larql-inference 1248 + larql-kv 766 tests pass. Follow-up: the native decode kernels (attention/ffn_decode_step_native), the fused_* Metal path, and the supports_* predicates are still duplicated and can be delegated the same way.

…gation Finish de-duplicating the Q4_K CPU forward. supports_cached_decode, supports_direct_matvec_decode, fused_prefill, fused_decode_step[_with_state], and attention/ffn_decode_step_native now delegate to larql-compute's substrate copy (coerce VectorIndex -> &dyn KvIndex). cached.rs is now purely the VectorIndex-typed adapter over the substrate (ADR-0022) with no duplicated forward logic; module doc updated to state that role. Removed the test-only matvec_q4k_or_q6k_q8k / layer_supports_direct_matvec helper copies and their unit tests — larql-compute owns those (and tests them). cached.rs: 1640 -> 720 lines. larql-inference 1240 + larql-kv 766 tests pass, workspace clean, 0 warnings.

Q4kMatmulFfn assumed every FFN component was Q4_K, but the default q4k vindex stores down_proj as Q6_K (210 B/block, not 144). Decoding the Q6_K bytes through q4k_matmul produced garbage plus a wrong derived column count, so the q4k-direct FFN prefill emitted nonsense on real models — gemma3-4b "The capital of France is **Paris**" became " peregr...Everett Monroe...". The all-Q4_K test fixture masked it. Dispatch each FFN component on its stored format tag: Q4_K through the amortised q4k_matmul (reads bytes once, no f32 staging — the prefill win), Q6_K via dequantize_matrix + dot_proj. gate/up stay Q4_K (fast path); only the Q6_K down is dequantised. Makes dequantize_matrix / its module pub(crate) so the FFN backend can reach it. Adds project_down_dispatches_on_q6k_format regression test (the all-Q4_K fixture never exercised Q6_K). Real gemma3-4b CPU prefill is correct again ("Paris") at 1646ms vs the 2746ms full-dequant baseline (1.67x); the earlier 572ms was the broken path that skipped the down decode. larql-compute 741 / inference / kv tests pass, fmt clean.

The per-format fix dequantised the Q6_K down_proj to f32 before matmul, leaving prefill at 1.67x. Add q6k_matmul_into — the Q6_K twin of q4k_matmul_into — so the down also reads its quantised bytes once and skips the f32 materialisation. Factor the shared amortised matmul loop into kquant_matmul_into (decode closure + block_bytes); q4k/q6k differ only in the per-block decode, addressing the kernel duplication the review flagged. Q4kMatmulFfn::matmul now dispatches Q4_K/Q6_K to their kernels and only falls back to dequant for a format without one (none today). All FFN projections are dequant-free. gemma3-4b Q4_K CPU prefill (standard engine): 2746ms full-dequant -> 577ms (4.76x), correct ("Paris") — matching the originally-broken 572ms but with right output. Decode unchanged. larql-compute 741 (incl. project_down_dispatches_on_q6k_format) + inference + kv tests pass, fmt clean.

Project the attention Q/K/V/O straight from the vindex's Q4_K/Q6_K bytes (Q/K/O are Q4_K, V is Q6_K) via the amortised q4k/q6k matmul, skipping the insert_q4k_attn_tensors f32 dequant — the last per-layer dequant on the prefill path. run_attention_with_kv_backend gains an optional `index`: when Some it projects from bytes, else the f32 view path (all existing callers pass None). The prefill gates `use_q4k_attn` (attn bytes present, hidden and q_dim 256-aligned) and skips the attn dequant when both attn and FFN go direct. Factor the Q4_K/Q6_K dispatch into shared quant_matmul / quant_proj (reused by the FFN backend and attention), removing the duplication the review flagged. gemma3-4b Q4_K CPU prefill (standard engine): 2746ms full-dequant -> 291ms (9.4x), correct ("Paris"). vs llama.cpp pp5 the gap is now ~4x (was 39x at the start). Decode unchanged. larql-compute 741 + inference 1240 + kv 766 tests pass, workspace clean.

Replace the portable multi-accumulator dot in kquant_matmul_into with a hand-written aarch64 NEON dot (dot_256_f32): 8 float32x4 accumulators + vfmaq_f32, enough independent accumulators to hide the M-series ~4-cycle FMA latency. Scalar 8-accumulator fallback for other targets + as the NEON parity oracle. q4k/q6k share it via the generic kquant_matmul_into. At seq=5 the q4k_matmul now BEATS f32 AMX sgemm (mm/f32 1.0-1.3x, was 0.4-0.7x) while still skipping the dequant. gemma3-4b Q4_K CPU prefill (standard engine): 291ms -> 233ms (11.8x over the 2746ms baseline), correct ("Paris"). Tests: dot_256_f32 NEON-vs-scalar parity, q4k_matmul rows>32 multi-chunk + seq=1, and a direct q6k_matmul_into-vs-dequant kernel parity — the review-flagged coverage gaps. larql-compute 744 tests pass, the touched code is clippy-clean and fmt clean.

…doc) - bench/baselines/cpu/COMPARISON.md: dated 2026-06-22 update — decode now at/ahead of llama.cpp, prefill 2746ms->233ms (11.8x), gap 55x -> ~3x. - ROADMAP.md: new CPU prefill row in the honest-accounting table. - README.md: CPU-prefill paragraph (q4k/q6k-direct attn+FFN + NEON dot). - kquant_forward/cached.rs: module doc updated — prefill no longer pays the per-layer f32 dequant (the old "follow-up" is done).

…pace The no-shims weights-immutability migration left ~230 methods/functions taking &mut params they no longer mutate (needless_pass_by_ref_mut), plus the resulting unused_mut at the call sites, plus one needless_range_loop in a test. cargo clippy --fix swept them; behavior-preserving (those &mut were already never used mutably). cargo clippy --workspace --tests -- -D warnings now passes (0 warnings). larql-compute 744 + inference 1240 + kv 766 tests pass, fmt clean.

chrishayuk added 30 commits June 20, 2026 12:42

docs(roadmap): BitNet b1.58 integration hardening + G1–G4 follow-ups

9520947

docs(roadmap): P-B.1 done — attention-reader family converted + dequa…

e0544a1

…nt relocated, oracle byte-identical

docs(roadmap): scope P-B.1b no-shims sweep + the two-kquant_forward f…

004be69

…inding (WIP stashed)

docs(roadmap): no-shims sweep convergence data (28→45→67 diverging) —…

aaa5ba5

… staged refactor

docs(roadmap): P-B.1b no-shims sweep DONE — diverging count converged…

e38357d

… to 0, oracle byte-identical

chrishayuk added 7 commits June 21, 2026 20:26

chrishayuk merged commit ab7a08c into main Jun 21, 2026
30 of 31 checks passed

chrishayuk deleted the feat/quant-ternary-a8 branch June 22, 2026 00:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q4_K/Q6_K-direct CPU prefill + no-shims weights-immutability + kquant de-dup#171

Q4_K/Q6_K-direct CPU prefill + no-shims weights-immutability + kquant de-dup#171
chrishayuk merged 37 commits into
mainfrom
feat/quant-ternary-a8

chrishayuk commented Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chrishayuk commented Jun 21, 2026

Summary

What's here

Correctness fix caught in review

Quality

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant