You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've spent nearly a week trying to get the decoupled inference setup working as shown in your video, and I'm stuck on a consistent issue: the output is always gibberish (multilingual nonsense), regardless of configuration. I got it working using a dense model, having made some changes to the code (found and fixed two norm bugs in grid.rs — missing pre-FFN norm before sending to the remote server, and missing post-FFN norm on the server response) but can't get the MoE model to work at all.
Setup
Client (attention): M1 Mac Mini, 8GB RAM, Metal GPU, running larql run --ffn --metal
Server (FFN experts): HP ProDesk, 16GB RAM, CPU only, running larql-server
Branch:main (commit c0bcd620)
Model: Gemma 4 26B (google/gemma-4-26B-A4B-it)
What works
Client and server connect successfully
All 30 layers complete
Bytes flow correctly (~20790 KB sent and received)
Speed is plausible (~3.3 tok/s)
What doesn't work
Output is always gibberish — multilingual tokens regardless of prompt
Same output with --metal and without (CPU-only client)
Same output with local-only inference (no server)
The vindex files downloaded from your HuggingFace repos all have "extract_level": "browse" in their index.json — including chrishayuk/gemma-4-26b-a4b-it-vindex-expert-server. Based on the larql slice --help output. Am I correct that the preset for a distributed FFN server should be expert-server, not browse?
The browse preset only includes gate + embed + down_meta — no forward pass weights. This seems like it would explain why the server cannot correctly process the residual stream?
Questions
Are the published HuggingFace vindex files (gemma-4-26b-a4b-it-vindex-expert-server, gemma-4-26b-a4b-client-vindex-client) built against the current main branch and intended for use with --ffn?
If the published files are not correct, what is the recommended way to build the client/server slices from scratch?
Is there a working end-to-end test or example script that demonstrates the decoupled setup?
I've gone through the README, the slice command help, and the route code, and I can't find documentation on which exact HuggingFace files pair with the current codebase.
Thanks for any help — this is an amazing project and I really want to get it working. I've also been investigating the larql from the perspective of 'injecting' into a model without training, and viewing the 'thinking' of a model, both of which are fascinating.
Best wishes,
Robert
(P.s - much of the technical information is from Claude, which has been helping me get this set up)
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Chris,
I've spent nearly a week trying to get the decoupled inference setup working as shown in your video, and I'm stuck on a consistent issue: the output is always gibberish (multilingual nonsense), regardless of configuration. I got it working using a dense model, having made some changes to the code (found and fixed two norm bugs in grid.rs — missing pre-FFN norm before sending to the remote server, and missing post-FFN norm on the server response) but can't get the MoE model to work at all.
Setup
larql run --ffn --metallarql-servermain(commitc0bcd620)google/gemma-4-26B-A4B-it)What works
What doesn't work
--metaland without (CPU-only client)The vindex files downloaded from your HuggingFace repos all have
"extract_level": "browse"in theirindex.json— includingchrishayuk/gemma-4-26b-a4b-it-vindex-expert-server. Based on thelarql slice --helpoutput. Am I correct that the preset for a distributed FFN server should beexpert-server, notbrowse?The
browsepreset only includesgate + embed + down_meta— no forward pass weights. This seems like it would explain why the server cannot correctly process the residual stream?Questions
gemma-4-26b-a4b-it-vindex-expert-server,gemma-4-26b-a4b-client-vindex-client) built against the currentmainbranch and intended for use with--ffn?I've gone through the README, the slice command help, and the route code, and I can't find documentation on which exact HuggingFace files pair with the current codebase.
Thanks for any help — this is an amazing project and I really want to get it working. I've also been investigating the larql from the perspective of 'injecting' into a model without training, and viewing the 'thinking' of a model, both of which are fascinating.
Best wishes,
Robert
(P.s - much of the technical information is from Claude, which has been helping me get this set up)
Beta Was this translation helpful? Give feedback.
All reactions