Distributed inference produces gibberish output — are the published vindex files correct? #105

MELDApps · 2026-05-17T06:42:45Z

MELDApps
May 17, 2026

Hi Chris,

I've spent nearly a week trying to get the decoupled inference setup working as shown in your video, and I'm stuck on a consistent issue: the output is always gibberish (multilingual nonsense), regardless of configuration. I got it working using a dense model, having made some changes to the code (found and fixed two norm bugs in grid.rs — missing pre-FFN norm before sending to the remote server, and missing post-FFN norm on the server response) but can't get the MoE model to work at all.

Setup

Client (attention): M1 Mac Mini, 8GB RAM, Metal GPU, running larql run --ffn --metal
Server (FFN experts): HP ProDesk, 16GB RAM, CPU only, running larql-server
Branch: main (commit c0bcd620)
Model: Gemma 4 26B (google/gemma-4-26B-A4B-it)

What works

Client and server connect successfully
All 30 layers complete
Bytes flow correctly (~20790 KB sent and received)
Speed is plausible (~3.3 tok/s)

What doesn't work

Output is always gibberish — multilingual tokens regardless of prompt
Same output with --metal and without (CPU-only client)
Same output with local-only inference (no server)

The vindex files downloaded from your HuggingFace repos all have "extract_level": "browse" in their index.json — including chrishayuk/gemma-4-26b-a4b-it-vindex-expert-server. Based on the larql slice --help output. Am I correct that the preset for a distributed FFN server should be expert-server, not browse?

The browse preset only includes gate + embed + down_meta — no forward pass weights. This seems like it would explain why the server cannot correctly process the residual stream?

Questions

Are the published HuggingFace vindex files (gemma-4-26b-a4b-it-vindex-expert-server, gemma-4-26b-a4b-client-vindex-client) built against the current main branch and intended for use with --ffn?
If the published files are not correct, what is the recommended way to build the client/server slices from scratch?
Is there a working end-to-end test or example script that demonstrates the decoupled setup?

I've gone through the README, the slice command help, and the route code, and I can't find documentation on which exact HuggingFace files pair with the current codebase.

Thanks for any help — this is an amazing project and I really want to get it working. I've also been investigating the larql from the perspective of 'injecting' into a model without training, and viewing the 'thinking' of a model, both of which are fascinating.

Best wishes,
Robert
(P.s - much of the technical information is from Claude, which has been helping me get this set up)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed inference produces gibberish output — are the published vindex files correct? #105

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Distributed inference produces gibberish output — are the published vindex files correct? #105

Uh oh!

MELDApps May 17, 2026

Setup

What works

What doesn't work

Questions

Replies: 0 comments

MELDApps
May 17, 2026