Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
072b198
refactor(compute): source attn/ffn tag→format mapping from from_regis…
chrishayuk Jun 20, 2026
856205d
feat(bitnet): W1.58·A8 ternary matvec (scalar + NEON), run BitNet for…
chrishayuk Jun 20, 2026
0009565
feat(models): recognise bitnet architecture explicitly
chrishayuk Jun 20, 2026
9520947
docs(roadmap): BitNet b1.58 integration hardening + G1–G4 follow-ups
chrishayuk Jun 20, 2026
fdc9f20
feat(compute): add I2_S ternary QuantFormat + ternary_matvec dispatch…
chrishayuk Jun 20, 2026
4adefa0
docs(roadmap): revise G2–G4 after scoping (G1 done; G2/G4 blocked; G3…
chrishayuk Jun 20, 2026
1d319c0
docs(roadmap): productization plan P-A/P-B/P-C for BitNet (decision: …
chrishayuk Jun 20, 2026
c5fa100
feat(cli): serve BitNet vindexes from `larql run` (P-A, CLI half)
chrishayuk Jun 20, 2026
b4eec18
docs(roadmap): P-A CLI behaviour-verified + B1 chosen with quant-path…
chrishayuk Jun 20, 2026
dc65a7d
docs(roadmap): grounded P-B(B1a) execution stages anchored on ensure_…
chrishayuk Jun 20, 2026
9e6c8d6
docs(roadmap): lock P-B.1 scratch home = engine (concurrency evidence…
chrishayuk Jun 20, 2026
77e343e
feat(models): engine-owned dequant resolver (WeightsView/DequantScrat…
chrishayuk Jun 20, 2026
0f22468
refactor(models): WeightsView supports dense (no-scratch) callers — C…
chrishayuk Jun 20, 2026
802fef7
refactor(forward): run_attention_with_kv_backend takes WeightsView; c…
chrishayuk Jun 20, 2026
2f642cb
refactor(ffn): dense_ffn_forward takes WeightsView; WeightFfn/Backend…
chrishayuk Jun 20, 2026
d3b24fd
docs(roadmap): P-B.1 signature stages done; Stage 2b loud-break desig…
chrishayuk Jun 20, 2026
6ad9fc3
refactor(forward): run_attention_block_decode_step_backend takes Weig…
chrishayuk Jun 20, 2026
fc6a87a
docs(roadmap): Stage 2b-pre done (3 readers + oracle); reader-family …
chrishayuk Jun 20, 2026
9650582
refactor(forward): convert the full attention-reader family to Weight…
chrishayuk Jun 20, 2026
f0da87c
refactor(vindex): relocate Q4K dequant to engine-local scratch — P-B.…
chrishayuk Jun 20, 2026
e0544a1
docs(roadmap): P-B.1 done — attention-reader family converted + dequa…
chrishayuk Jun 20, 2026
004be69
docs(roadmap): scope P-B.1b no-shims sweep + the two-kquant_forward f…
chrishayuk Jun 20, 2026
aaa5ba5
docs(roadmap): no-shims sweep convergence data (28→45→67 diverging) —…
chrishayuk Jun 20, 2026
379885e
refactor(kv): no-shims dequant threading — every engine owns its scra…
chrishayuk Jun 21, 2026
e38357d
docs(roadmap): P-B.1b no-shims sweep DONE — diverging count converged…
chrishayuk Jun 21, 2026
2a8af47
refactor(kv): complete no-shims WeightsView threading through engine …
chrishayuk Jun 21, 2026
12ee73c
fix(bench): measure single decode step, not growing-context cost
chrishayuk Jun 21, 2026
e865365
feat(compute): amortised CpuBackend::q4k_matmul kernel
chrishayuk Jun 21, 2026
c1d8931
feat(prefill): q4k-direct FFN — skip the per-layer FFN dequant
chrishayuk Jun 21, 2026
efe9672
refactor(kquant): delegate larql-inference Q4_K forward to larql-compute
chrishayuk Jun 21, 2026
3b5b7b7
refactor(kquant): complete larql-inference -> larql-compute Q4_K dele…
chrishayuk Jun 21, 2026
77b6831
fix(prefill): dispatch q4k-direct FFN per component format (Q6_K down)
chrishayuk Jun 21, 2026
8b12240
perf(prefill): amortised q6k_matmul for the Q6_K down_proj
chrishayuk Jun 21, 2026
8abf385
perf(prefill): q4k-direct attention projections — close the dequant gap
chrishayuk Jun 21, 2026
376b520
perf(kernel): NEON inner dot for the amortised k-quant matmul
chrishayuk Jun 21, 2026
08be59f
docs: record q4k-direct prefill (COMPARISON, ROADMAP, README, module …
chrishayuk Jun 21, 2026
078b269
chore(clippy): make lint clean — drop vestigial &mut across the works…
chrishayuk Jun 21, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -625,6 +625,8 @@ vs ollama gemma3:4b on the same machine: ~103 tok/s steady → **gap 1.17×**, w

**CPU vs llama.cpp** (reconciled 2026-06-02, M3 Max, 8 threads, warm): larql **26.4** (StandardEngine) / 23.5 (legacy `bench --cpu`) vs **llama.cpp `-ngl 0` 43.0** tok/s → **gap ~1.6–1.8×**. The gap is per-core kernel quality — both attention and FFN already run the int8 Q8_K SDOT kernel; closing it is C12 (hand-asm; an opt-in `LARQL_Q4K_ASM=1` v1 lands +~4% isolated). `larql bench --cpu` now reports both the legacy and production-StandardEngine rows; `--ollama-cpu` forces a true CPU ollama baseline (default `--ollama` runs on Metal GPU). The earlier 1.5×/1.9× spread was two measurement confounds (path mismatch + an unwarmed-ollama artifact), not a regression — see `bench/baselines/c10_gemma3-4b_cpu_reconciled.json`.

**CPU prefill** (2026-06-22): the per-layer f32 dequant — long the dominant prefill cost (~2.7 s / ~2 tok/s on the 5-token prompt) — is gone. Q/K/V/O **and** gate/up/down now project straight from the Q4_K/Q6_K vindex bytes via amortised `q4k_matmul` / `q6k_matmul` (the Q6_K twin handles the default Q6_K `v_proj` / `down_proj`) with a hand-written aarch64 NEON inner dot. Gemma 3 4B Q4_K CPU prefill: **2746 ms → 233 ms (11.8×)**, closing the gap to llama.cpp `pp5` from ~55× to **~3×**; the NEON `q4k_matmul` at seq=5 beats f32 AMX sgemm while still skipping the dequant. See `bench/baselines/cpu/COMPARISON.md`.

**Cross-arch coverage (2026-05-09)**: Gemma 3, Gemma 4 31B dense, Llama 2 7B, Mistral 7B all dispatch correctly through Metal. Gemma 4 E2B currently falls back to CPU (Per-Layer Embeddings not yet in Metal — ROADMAP D-METAL-PLE). See [crates/larql-compute/docs/architecture-shader-map.md](crates/larql-compute/docs/architecture-shader-map.md) for the per-architecture shader dispatch table.

CPU walk breakdown:
Expand Down
415 changes: 415 additions & 0 deletions ROADMAP.md

Large diffs are not rendered by default.

22 changes: 22 additions & 0 deletions bench/baselines/cpu/COMPARISON.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,27 @@
# larql vs llama.cpp — CPU decode on Gemma 3 4B Q4_K

> **Update 2026-06-22 — prefill gap largely closed.** The q4k-direct prefill
> work changed the picture: Q4_K/Q6_K attention (Q/K/V/O) and FFN (gate/up/down)
> projections now run straight from the vindex bytes with no per-layer f32
> dequant — `q4k_matmul`/`q6k_matmul` (the Q6_K twin, used by the default Q6_K
> `down_proj` and `v_proj`), with a hand-written aarch64 NEON inner dot.
> Apple M3 Max, CPU only (`-t 8`), same model + prompt as below.
>
> | Metric | larql (standard) | llama.cpp | Ratio |
> |---|---:|---:|---:|
> | Decode (tg, tok/s) | ~42 | ~38 | **~1.1× ahead** |
> | Prefill (5-tok prompt, ms) | 233 | ~70 | **~3.3× behind** (was 55×) |
> | Prefill vs the May full-dequant path | 2746 → 233 ms | | **11.8× faster** |
>
> Decode is now at/ahead of llama.cpp; prefill went from 55× behind to ~3×. The
> NEON `q4k_matmul` at seq=5 actually *beats* f32 AMX sgemm (1.0–1.3×) while
> skipping the dequant. The remaining prefill gap is constant-factor kernel work
> (our matmul vs llama.cpp's hand-tuned asm) plus batched attention, not dequant.
> Numbers are same-session (machine warm from builds) — ratios hold; cold
> absolutes run a touch faster. The 2026-05-15 baseline below is kept for history.

---

Recorded 2026-05-15 on Apple M3 Max, 12 threads, BLAS / Accelerate enabled,
no GPU. Both engines load the same model weights — `output/larql-gemma-3-4b-it.gguf`
quantized to Q4_K_M for llama.cpp, the matching `output/gemma3-4b-q4k-v2.vindex`
Expand Down
20 changes: 15 additions & 5 deletions crates/larql-cli/src/commands/dev/ov_rd/basis.rs
Original file line number Diff line number Diff line change
Expand Up @@ -259,8 +259,12 @@ pub(super) fn fit_z_pca_bases(
for layer in 0..weights.num_layers {
let inserted = insert_q4k_layer_tensors(weights, index, layer)?;
if let Some(layer_heads) = heads_by_layer.get(&layer) {
let (_, pre_o) = run_attention_block_with_pre_o(weights, &h, layer)
.ok_or_else(|| format!("pre-W_O capture failed at layer {layer}"))?;
let (_, pre_o) = run_attention_block_with_pre_o(
larql_models::WeightsView::dense(weights),
&h,
layer,
)
.ok_or_else(|| format!("pre-W_O capture failed at layer {layer}"))?;
let head_dim = weights.arch.head_dim_for_layer(layer);
for head in layer_heads {
let basis = bases.get(head).expect("basis pre-created for PCA fit");
Expand All @@ -287,9 +291,15 @@ pub(super) fn fit_z_pca_bases(

{
let ffn = WeightFfn { weights };
if let Some((h_new, _, _)) =
run_layer_with_ffn(weights, &h, layer, &ffn, false, ple_inputs.get(layer), None)
{
if let Some((h_new, _, _)) = run_layer_with_ffn(
larql_inference::WeightsView::dense(weights),
&h,
layer,
&ffn,
false,
ple_inputs.get(layer),
None,
) {
h = h_new;
}
}
Expand Down
10 changes: 7 additions & 3 deletions crates/larql-cli/src/commands/dev/ov_rd/capture.rs
Original file line number Diff line number Diff line change
Expand Up @@ -132,8 +132,12 @@ pub(super) fn run_capture(args: CaptureArgs) -> Result<(), Box<dyn std::error::E
let inserted = insert_q4k_layer_tensors(&mut weights, &index, layer)?;

if capture_layer(layer) {
let (_, pre_o) = run_attention_block_with_pre_o(&weights, &h, layer)
.ok_or_else(|| format!("pre-W_O capture failed at layer {layer}"))?;
let (_, pre_o) = run_attention_block_with_pre_o(
larql_models::WeightsView::dense(&weights),
&h,
layer,
)
.ok_or_else(|| format!("pre-W_O capture failed at layer {layer}"))?;
add_pre_o_stats(
&mut stats[layer],
&pre_o,
Expand All @@ -160,7 +164,7 @@ pub(super) fn run_capture(args: CaptureArgs) -> Result<(), Box<dyn std::error::E
{
let ffn = WeightFfn { weights: &weights };
if let Some((h_new, _, _)) = run_layer_with_ffn(
&weights,
larql_inference::WeightsView::dense(&weights),
&h,
layer,
&ffn,
Expand Down
22 changes: 16 additions & 6 deletions crates/larql-cli/src/commands/dev/ov_rd/edit_catalog.rs
Original file line number Diff line number Diff line change
Expand Up @@ -278,7 +278,7 @@ pub(super) fn run_oracle_edit_catalog(
}
let stratum = record.stratum.as_deref().unwrap_or("unknown");
let baseline_hidden =
larql_inference::vindex::predict_kquant_hidden(&mut weights, &token_ids, &index, None);
larql_inference::vindex::predict_kquant_hidden(&weights, &token_ids, &index, None);
let baseline_logits = final_logits(&weights, &baseline_hidden);
let baseline_logp = log_softmax(&baseline_logits);
let baseline_top1 = argmax(&baseline_logits);
Expand Down Expand Up @@ -483,8 +483,12 @@ fn fit_edit_catalogs(
for layer in 0..weights.num_layers {
let inserted = insert_q4k_layer_tensors(weights, index, layer)?;
if let Some(layer_heads) = heads_by_layer.get(&layer) {
let (_, pre_o) = run_attention_block_with_pre_o(weights, &h, layer)
.ok_or_else(|| format!("pre-W_O capture failed at layer {layer}"))?;
let (_, pre_o) = run_attention_block_with_pre_o(
larql_models::WeightsView::dense(weights),
&h,
layer,
)
.ok_or_else(|| format!("pre-W_O capture failed at layer {layer}"))?;
let head_dim = weights.arch.head_dim_for_layer(layer);
for head in layer_heads {
let basis = bases.get(head).expect("basis pre-created for edit catalog");
Expand Down Expand Up @@ -537,9 +541,15 @@ fn fit_edit_catalogs(

{
let ffn = WeightFfn { weights };
if let Some((h_new, _, _)) =
run_layer_with_ffn(weights, &h, layer, &ffn, false, ple_inputs.get(layer), None)
{
if let Some((h_new, _, _)) = run_layer_with_ffn(
larql_inference::WeightsView::dense(weights),
&h,
layer,
&ffn,
false,
ple_inputs.get(layer),
None,
) {
h = h_new;
}
}
Expand Down
9 changes: 2 additions & 7 deletions crates/larql-cli/src/commands/dev/ov_rd/eval_program/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -254,15 +254,10 @@ pub(super) fn run_eval_program(args: EvalProgramArgs) -> Result<(), Box<dyn std:
{
h
} else {
larql_inference::vindex::predict_kquant_hidden(
&mut weights,
&token_ids,
&index,
None,
)
larql_inference::vindex::predict_kquant_hidden(&weights, &token_ids, &index, None)
}
} else {
larql_inference::vindex::predict_kquant_hidden(&mut weights, &token_ids, &index, None)
larql_inference::vindex::predict_kquant_hidden(&weights, &token_ids, &index, None)
};
let baseline_logits = final_logits(&weights, &baseline_h);
let baseline_logp = log_softmax(&baseline_logits);
Expand Down
20 changes: 15 additions & 5 deletions crates/larql-cli/src/commands/dev/ov_rd/gamma_address.rs
Original file line number Diff line number Diff line change
Expand Up @@ -733,8 +733,12 @@ fn collect_gamma_code_samples(
let inserted = insert_q4k_layer_tensors(weights, index, layer)?;
if let Some(layer_heads) = heads_by_layer.get(&layer) {
let layer_input = h.clone();
let (_, pre_o) = run_attention_block_with_pre_o(weights, &h, layer)
.ok_or_else(|| format!("pre-W_O capture failed at layer {layer}"))?;
let (_, pre_o) = run_attention_block_with_pre_o(
larql_models::WeightsView::dense(weights),
&h,
layer,
)
.ok_or_else(|| format!("pre-W_O capture failed at layer {layer}"))?;
let head_dim = weights.arch.head_dim_for_layer(layer);
for head in layer_heads {
let basis = bases.get(head).ok_or_else(|| {
Expand Down Expand Up @@ -787,9 +791,15 @@ fn collect_gamma_code_samples(

{
let ffn = WeightFfn { weights };
if let Some((h_new, _, _)) =
run_layer_with_ffn(weights, &h, layer, &ffn, false, ple_inputs.get(layer), None)
{
if let Some((h_new, _, _)) = run_layer_with_ffn(
larql_inference::WeightsView::dense(weights),
&h,
layer,
&ffn,
false,
ple_inputs.get(layer),
None,
) {
h = h_new;
} else {
remove_layer_tensors(weights, inserted);
Expand Down
4 changes: 2 additions & 2 deletions crates/larql-cli/src/commands/dev/ov_rd/oracle.rs
Original file line number Diff line number Diff line change
Expand Up @@ -270,7 +270,7 @@ pub(super) fn run_oracle_roundtrip(
let stratum = record.stratum.as_deref().unwrap_or("unknown");

let baseline_hidden =
larql_inference::vindex::predict_kquant_hidden(&mut weights, &token_ids, &index, None);
larql_inference::vindex::predict_kquant_hidden(&weights, &token_ids, &index, None);
let baseline_logits = final_logits(&weights, &baseline_hidden);
let baseline_logp = log_softmax(&baseline_logits);

Expand Down Expand Up @@ -414,7 +414,7 @@ pub(super) fn run_oracle_lowrank(
let stratum = record.stratum.as_deref().unwrap_or("unknown");

let baseline_hidden =
larql_inference::vindex::predict_kquant_hidden(&mut weights, &token_ids, &index, None);
larql_inference::vindex::predict_kquant_hidden(&weights, &token_ids, &index, None);
let baseline_logits = final_logits(&weights, &baseline_hidden);
let baseline_logp = log_softmax(&baseline_logits);
let baseline_top1 = argmax(&baseline_logits);
Expand Down
2 changes: 1 addition & 1 deletion crates/larql-cli/src/commands/dev/ov_rd/oracle_pq.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1715,7 +1715,7 @@ pub(super) fn run_oracle_pq(args: OraclePqArgs) -> Result<(), Box<dyn std::error
let stratum = record.stratum.as_deref().unwrap_or("unknown");

let baseline_hidden =
larql_inference::vindex::predict_kquant_hidden(&mut weights, &token_ids, &index, None);
larql_inference::vindex::predict_kquant_hidden(&weights, &token_ids, &index, None);
let baseline_logits = final_logits(&weights, &baseline_hidden);
let baseline_logp = log_softmax(&baseline_logits);
let baseline_top1 = argmax(&baseline_logits);
Expand Down
21 changes: 16 additions & 5 deletions crates/larql-cli/src/commands/dev/ov_rd/oracle_pq_address.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1314,7 +1314,11 @@ where
if let Some(qk_rank) = reduced_qk_rank {
let (_, pre_o, all_weights) =
run_attention_block_with_pre_o_and_reduced_qk_attention_weights(
weights, &h, layer, None, qk_rank,
larql_models::WeightsView::dense(weights),
&h,
layer,
None,
qk_rank,
)
.ok_or_else(|| {
format!(
Expand All @@ -1325,16 +1329,23 @@ where
} else {
let (_, pre_o, all_weights) =
run_attention_block_with_pre_o_and_all_attention_weights(
weights, &h, layer, None,
larql_models::WeightsView::dense(weights),
&h,
layer,
None,
)
.ok_or_else(|| {
format!("pre-W_O/all-attention capture failed at layer {layer}")
})?;
(pre_o, Some(all_weights))
}
} else {
let (_, pre_o) = run_attention_block_with_pre_o(weights, &h, layer)
.ok_or_else(|| format!("pre-W_O capture failed at layer {layer}"))?;
let (_, pre_o) = run_attention_block_with_pre_o(
larql_models::WeightsView::dense(weights),
&h,
layer,
)
.ok_or_else(|| format!("pre-W_O capture failed at layer {layer}"))?;
(pre_o, None)
};
let (pre_o, all_weights) = capture;
Expand Down Expand Up @@ -1405,7 +1416,7 @@ where
{
let ffn = WeightFfn { weights };
if let Some((h_new, activation, _)) = run_layer_with_ffn(
weights,
larql_inference::WeightsView::dense(weights),
&h,
layer,
&ffn,
Expand Down
20 changes: 15 additions & 5 deletions crates/larql-cli/src/commands/dev/ov_rd/oracle_pq_fit.rs
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,12 @@ pub(super) fn fit_pq_codebooks(
for layer in 0..weights.num_layers {
let inserted = insert_q4k_layer_tensors(weights, index, layer)?;
if let Some(layer_heads) = heads_by_layer.get(&layer) {
let (_, pre_o) = run_attention_block_with_pre_o(weights, &h, layer)
.ok_or_else(|| format!("pre-W_O capture failed at layer {layer}"))?;
let (_, pre_o) = run_attention_block_with_pre_o(
larql_models::WeightsView::dense(weights),
&h,
layer,
)
.ok_or_else(|| format!("pre-W_O capture failed at layer {layer}"))?;
let head_dim = weights.arch.head_dim_for_layer(layer);
for head in layer_heads {
let basis = bases.get(head).expect("basis pre-created for PQ fit");
Expand Down Expand Up @@ -102,9 +106,15 @@ pub(super) fn fit_pq_codebooks(

{
let ffn = WeightFfn { weights };
if let Some((h_new, _, _)) =
run_layer_with_ffn(weights, &h, layer, &ffn, false, ple_inputs.get(layer), None)
{
if let Some((h_new, _, _)) = run_layer_with_ffn(
larql_inference::WeightsView::dense(weights),
&h,
layer,
&ffn,
false,
ple_inputs.get(layer),
None,
) {
h = h_new;
}
}
Expand Down
21 changes: 14 additions & 7 deletions crates/larql-cli/src/commands/dev/ov_rd/oracle_pq_forward.rs
Original file line number Diff line number Diff line change
Expand Up @@ -189,7 +189,7 @@ pub(super) fn capture_layer_input_hidden(
.and_then(|src| kv_cache.get(&src));
let ffn = WeightFfn { weights };
run_layer_with_ffn(
weights,
larql_inference::WeightsView::dense(weights),
&h,
layer,
&ffn,
Expand Down Expand Up @@ -239,7 +239,7 @@ pub(super) fn capture_prev_ffn_feature_keys(
.and_then(|src| kv_cache.get(&src));
let ffn = WeightFfn { weights };
run_layer_with_ffn(
weights,
larql_inference::WeightsView::dense(weights),
&h,
layer,
&ffn,
Expand Down Expand Up @@ -305,7 +305,7 @@ pub(super) fn capture_ffn_first_feature_keys(
.and_then(|src| kv_cache.get(&src));
let ffn = WeightFfn { weights };
run_layer_with_ffn(
weights,
larql_inference::WeightsView::dense(weights),
&h,
layer,
&ffn,
Expand Down Expand Up @@ -348,7 +348,10 @@ pub(super) fn capture_attention_relation_rows(
.kv_shared_source_layer(layer)
.and_then(|src| kv_cache.get(&src));
let (_, _, all_weights) = run_attention_block_with_pre_o_and_all_attention_weights(
weights, &h, layer, shared_kv,
larql_models::WeightsView::dense(weights),
&h,
layer,
shared_kv,
)
.ok_or_else(|| {
format!(
Expand All @@ -369,7 +372,7 @@ pub(super) fn capture_attention_relation_rows(
.and_then(|src| kv_cache.get(&src));
let ffn = WeightFfn { weights };
run_layer_with_ffn(
weights,
larql_inference::WeightsView::dense(weights),
&h,
layer,
&ffn,
Expand Down Expand Up @@ -414,7 +417,11 @@ pub(super) fn capture_reduced_qk_attention_rows(
.and_then(|src| kv_cache.get(&src));
let (_, _, all_weights) =
run_attention_block_with_pre_o_and_reduced_qk_attention_weights(
weights, &h, layer, shared_kv, qk_rank,
larql_models::WeightsView::dense(weights),
&h,
layer,
shared_kv,
qk_rank,
)
.ok_or_else(|| {
format!(
Expand All @@ -439,7 +446,7 @@ pub(super) fn capture_reduced_qk_attention_rows(
.and_then(|src| kv_cache.get(&src));
let ffn = WeightFfn { weights };
run_layer_with_ffn(
weights,
larql_inference::WeightsView::dense(weights),
&h,
layer,
&ffn,
Expand Down
Loading
Loading