You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
cd crates/larql-python
uv sync --no-install-project --group dev # creates .venv + installs dev deps
uv run --no-sync maturin develop --release # builds PyO3 extension into .venv
uv run --no-sync pytest tests/ # run binding tests
MLX attention + Rust sparse FFN. FFN weights mmap'd — only touched pages loaded.
For models that don't fit in memory.
fromlarql.walk_ffnimportloadimportmlx_lmmodel, tokenizer=load("model.vindex", top_k=4096)
response=mlx_lm.generate(model, tokenizer, prompt="...", max_tokens=20)
# Walk FFN: 7.1 GB FFN weights handled by Rust (not in MLX memory)
4. WalkModel (zero-copy mmap)
Rust inference with mmap'd weights. Load RSS: ~450 MB for a 4B model (vs 18 GB heap).
For 120B models: ~1 GB RSS instead of 220 GB.
wm=larql.WalkModel("model.vindex", top_k=4096)
result=wm.predict("The capital of France is")
# [("Paris", 0.498), ...]
Memory & Performance (Gemma 3 4B, f32)
Path
Load RSS
Inference
How
WalkModel / vindex.infer()
+0 MB (mmap)
19s→13s (warms up)
Zero-copy mmap, OS pages on demand
larql.mlx.load()
+22 GB
0.9s (GPU)
All weights in MLX/GPU memory
Native MLX
+8.6 GB
0.9s (GPU)
Safetensors in GPU memory
vindex.infer() uses mmap'd weights (lazy-loaded on first call, reused after).
The OS page cache warms up across calls — second call is faster, third faster still.
For 120B models: WalkModel ~1 GB load RSS vs native 220 GB.
With madvise prefetching, steady-state ~200-500ms/token after cache warms.
LQL Session
session=larql.session("model.vindex")
session.query("DESCRIBE 'France'")
session.query("WALK 'The capital of France is' TOP 10")
session.vindex.gate_vectors(layer=26) # numpy access on same session
API Reference
Loading
Function
Description
larql.load(path)
Load vindex, returns Vindex
larql.session(path)
LQL session with .query() and .vindex
larql.mlx.load(path)
MLX model from vindex (all weights in MLX)
larql.walk_ffn.load(path, top_k)
MLX attention + Rust FFN (mmap'd)
larql.WalkModel(path, top_k)
Rust inference with mmap'd weights
Vindex — Inference
Method
Description
infer(prompt, top_k_predictions=5)
Full Rust forward pass, returns [(token, prob)]. Routes through larql_inference::infer_patched for byte-identical parity with LQL SELECT ... INFER (ADR 0001)
# Synthetic tests (run anywhere, no model files)
pytest crates/larql-python/tests/ -v
# With real vindex (integration tests for infer, WalkModel, MLX)
REAL_VINDEX_PATH=output/gemma3-4b-v2.vindex pytest crates/larql-python/tests/ -v
Extracting a Vindex
# Browse level (knowledge queries only)
larql extract-index "google/gemma-3-4b-it" -o model.vindex
# All weights (for inference + MLX)
larql extract-index "google/gemma-3-4b-it" -o model.vindex --level all
# Half precision (recommended for MLX)
larql extract-index "google/gemma-3-4b-it" -o model.vindex --level all --f16