CPU-only MoE expert servers. No GPU, no VRAM. The laptop runs the hot path (attention + routing); fly.io machines serve the expert bank from memory-mapped vindex shards.
Each performance-8x (16 GB) machine serves one 64-expert shard cleanly:
- ~6.2 GB: expert pages (64 experts × 30 layers × 421 MB / 128)
- ~1.8 GB: embeddings + dense FFN + norms (shared overhead)
- ~8 GB headroom (no thrashing)
--warmup-walk-ffn pre-faults owned expert pages at startup. Pages for
other shards' experts are never accessed (rejected by --experts filter),
so they never consume physical RAM.
flyCLI installed and authenticated- HuggingFace account (to host the expert-server slice)
- Vindex extracted locally:
output/gemma4-26b-a4b-q4k.vindex
The expert-server preset includes everything the server needs: embeddings,
norms, dense FFN (interleaved_q4k.bin), per-layer expert weights (layers/),
and tokenizer. Total: ~14.1 GB.
larql slice output/gemma4-26b-a4b-q4k.vindex \
-o /tmp/gemma4-26b-expert-server.vindex \
--preset expert-server
larql publish /tmp/gemma4-26b-expert-server.vindex \
--repo chrishayuk/gemma-4-26b-a4b-it-vindex-expert-server \
--slices noneThe live slice is already published at
hf://chrishayuk/gemma-4-26b-a4b-it-vindex-expert-server.
Each shard serves half the expert bank. Pages for the owned half are pre-faulted at startup; the other half is never touched.
Shard A — experts 0–63:
fly apps create larql-expert-server-a
fly volumes create expert_data --size 25 --app larql-expert-server-a --region lhr --yes
fly secrets set HF_TOKEN=hf_... EXPERTS="0-63" WARMUP="1" --app larql-expert-server-a
fly deploy --app larql-expert-server-a --config deploy/fly/fly.toml --remote-onlyShard B — experts 64–127:
fly apps create larql-expert-server-b
fly volumes create expert_data --size 25 --app larql-expert-server-b --region lhr --yes
fly secrets set HF_TOKEN=hf_... EXPERTS="64-127" WARMUP="1" --app larql-expert-server-b
fly deploy --app larql-expert-server-b --config deploy/fly/fly.toml --remote-onlyEach machine downloads the full vindex on first boot (~2 min on fly's LHR
network). The --experts filter ensures only the owned half's pages are
ever faulted into RAM.
larql run output/gemma4-26b-a4b-q4k.vindex --max-tokens 20 \
--moe-shards "0-63=https://larql-expert-server-a.fly.dev,\
64-127=https://larql-expert-server-b.fly.dev" \
"The capital of France is"One machine serves all 128 experts. Requires performance-8x (16 GB) and tolerates some page pressure under sustained load.
fly apps create larql-expert-server
fly volumes create expert_data --size 25 --app larql-expert-server --region lhr --yes
fly secrets set HF_TOKEN=hf_... --app larql-expert-server
fly deploy --app larql-expert-server --config deploy/fly/fly.toml --remote-onlyTest:
larql run output/gemma4-26b-a4b-q4k.vindex --max-tokens 1 \
--moe-shards "0-127=https://larql-expert-server.fly.dev" \
"The capital of France is"| Variable | Default | Description |
|---|---|---|
EXPERTS |
"" |
Expert range for this shard, e.g. "0-63". Empty = all experts. |
WARMUP |
"0" |
Set to "1" to pre-fault owned expert pages at startup. |
LAYERS |
"" |
Layer range, e.g. "0-14". Empty = all layers. |
HF_REPO |
chrishayuk/... |
HuggingFace repo to download the vindex from. |
VINDEX_PATH |
/data/vindex |
Local path for the vindex on the persistent volume. |
PORT |
8080 |
HTTP listen port. |
Public internet (UK ↔ fly LHR): ~0.7 tok/s (30 serial RTTs × 45 ms each).
LAN or same-datacenter: ~19 tok/s. For batch dispatch (1 RTT/token,
approximate but usable): larql run ... --moe-dispatch batch.