I discussed with Gemini the possibility of trying this on my local network #106
nnnarvaez
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Based in 1 comment in your YouTube video and my reply (from my poor understanding of what you propose) I asked Gemini to evaluate the feasibility and clarify my understanding of the technology you propose/developed.
I am posting it here so others that will probably misunderstand like me ground their expectations.
Feasibility Report: Distributed Sharding of Gemma 4 via LARQL
1. Executive Summary
The objective was to evaluate whether the LARQL repository could be used to shard the Gemma 4 26B-A4B model across three separate PCs to pool their combined NVIDIA VRAM (30GB total) and run the model at Q6 quantization (~22GB) without spilling to slow system RAM.
Conclusion: While LARQL is architecturally brilliant for sharding logic, it is not feasible for this specific hardware setup today. The repository lacks a CUDA backend, meaning it cannot utilize NVIDIA GPUs for computation or VRAM for weight storage.
2. User Hardware & Strategy
The proposed plan aimed to leverage "scattered" consumer hardware to create a high-performance inference cluster:
3. Technical Analysis of LARQL
Our investigation of the repository (
chrishayuk/larql) revealed the following:ROADMAP.mdandCargo.tomlexplicitly confirm that the CUDA backend is "Planned" but not implemented.CpuBackendusing OpenBLAS.4. Critical Findings & Contradictions
During the discussion, we identified a core conflict between the user's vision and the software's capability:
5. Final Feasibility Score
6. Recommended Alternatives
For the user's specific goal of pooling distributed NVIDIA VRAM, the following tools are better suited:
Technical Report: Distributed CPU Sharding — Practical Trade-offs
1. The Core Value: Capacity over Speed
The primary realistic advantage of sharding across a local CPU network is RAM Pooling.
If a model requires 24GB of RAM (e.g., Gemma 4 Q6) but your available machines only have 16GB of RAM each, sharding is the only way to execute the model without expensive hardware upgrades. It allows you to utilize the "sunken cost" of existing RAM across multiple motherboards.
2. Technical Advantages (Grounded)
A. Memory Bandwidth Parallelism
LLM inference is almost entirely limited by memory bandwidth (how fast data moves from RAM to CPU).
B. MoE Efficiency (Traffic vs. Weights)
Sharding is uniquely suited for Mixture-of-Experts (MoE) models.
C. Thermal Load Distribution
Continuous LLM inference on a CPU can lead to thermal throttling on a single machine. Spreading the workload across three chassis allows for much better heat dissipation, maintaining a steady (if slower) token rate over long sessions.
3. The "Latency Tax" (Realistic Limits)
Sharding is not "free speed." It introduces a Network Latency Tax:
4. Summary: When to use this?
Distributed CPU sharding is a utilitarian solution, not a performance one. It is best used for:
Beta Was this translation helpful? Give feedback.
All reactions