I discussed with Gemini the possibility of trying this on my local network #106

nnnarvaez · 2026-05-17T09:21:50Z

nnnarvaez
May 17, 2026

Based in 1 comment in your YouTube video and my reply (from my poor understanding of what you propose) I asked Gemini to evaluate the feasibility and clarify my understanding of the technology you propose/developed.

I am posting it here so others that will probably misunderstand like me ground their expectations.

Feasibility Report: Distributed Sharding of Gemma 4 via LARQL

1. Executive Summary

The objective was to evaluate whether the LARQL repository could be used to shard the Gemma 4 26B-A4B model across three separate PCs to pool their combined NVIDIA VRAM (30GB total) and run the model at Q6 quantization (~22GB) without spilling to slow system RAM.

Conclusion: While LARQL is architecturally brilliant for sharding logic, it is not feasible for this specific hardware setup today. The repository lacks a CUDA backend, meaning it cannot utilize NVIDIA GPUs for computation or VRAM for weight storage.

2. User Hardware & Strategy

The proposed plan aimed to leverage "scattered" consumer hardware to create a high-performance inference cluster:

PC 1 (Main): RTX 4070 (12GB) — Target: Attention + Context + 30 Experts.
PC 2 (Gaming): RTX 3060 (12GB) — Target: 68 Experts.
PC 3 (Old): RTX 1070 (6GB) — Target: 30 Experts.
Goal: Trade network latency for VRAM speed, keeping 100% of the weights in fast GPU memory.

3. Technical Analysis of LARQL

Our investigation of the repository (chrishayuk/larql) revealed the following:

Platform Specialization: LARQL is a research-first engine optimized for Apple Silicon (Metal) and High-Performance CPU inference.
The CUDA Gap: The ROADMAP.md and Cargo.toml explicitly confirm that the CUDA backend is "Planned" but not implemented.
Inference Path: On Linux and Windows, LARQL defaults to the CpuBackend using OpenBLAS.
Memory Mapping: Because there is no CUDA support, weights are mmap'd into System RAM, not VRAM. The software currently has no mechanism to "see" or utilize the memory on an RTX 3060 or 1070.

4. Critical Findings & Contradictions

During the discussion, we identified a core conflict between the user's vision and the software's capability:

VRAM vs. RAM: Sharding to remote machines in LARQL today only aggregates CPU cycles and System RAM bandwidth. It does not aggregate VRAM.
Network Bottleneck: Sharding across a 1Gbps network for CPU-based inference is likely slower than running a single-machine CPU-offload, as the network latency (RTT) outweighs the parallel CPU gains.
Software Alignment: LARQL's current "Ultimate Aim" is to achieve blazing speeds on CPUs alone (via a technique called WalkFfn), moving away from the requirement for expensive GPUs entirely.

5. Final Feasibility Score

Category	Score	Note
Architectural Fit	9/10	Supports non-uniform sharding and remote expert dispatch.
Memory Fit	0/10	Cannot utilize NVIDIA VRAM; restricted to System RAM.
Compute Fit	0/10	No CUDA kernels; NVIDIA GPUs will sit at 0% usage.
Overall Feasibility	Poor	Recommended only for Mac users or CPU-speed research.

6. Recommended Alternatives

For the user's specific goal of pooling distributed NVIDIA VRAM, the following tools are better suited:

llama.cpp (with RPC): Supports sharding layers and experts across remote NVIDIA GPUs using CUDA.
vLLM + Ray: The industry standard for distributed GPU inference, specifically designed for multi-node VRAM pooling.

Technical Report: Distributed CPU Sharding — Practical Trade-offs

1. The Core Value: Capacity over Speed

The primary realistic advantage of sharding across a local CPU network is RAM Pooling.

If a model requires 24GB of RAM (e.g., Gemma 4 Q6) but your available machines only have 16GB of RAM each, sharding is the only way to execute the model without expensive hardware upgrades. It allows you to utilize the "sunken cost" of existing RAM across multiple motherboards.

2. Technical Advantages (Grounded)

A. Memory Bandwidth Parallelism

LLM inference is almost entirely limited by memory bandwidth (how fast data moves from RAM to CPU).

A single PC is limited by its DDR4/DDR5 channels (typically ~50-60 GB/s).
By sharding, you are utilizing multiple independent memory controllers. If Shard A and Shard B are both fetching weights from their own RAM, the aggregate bandwidth is higher than a single machine, potentially offsetting the CPU's slower compute speed.

B. MoE Efficiency (Traffic vs. Weights)

Sharding is uniquely suited for Mixture-of-Experts (MoE) models.

You don't send the model weights over the network.
You only send the hidden state vector (approx. 5KB for Gemma 4).
Because only a few experts fire per token, the network traffic is minimal, making 1Gbps Ethernet a viable interconnect for MoE, whereas it would be a major bottleneck for dense models.

C. Thermal Load Distribution

Continuous LLM inference on a CPU can lead to thermal throttling on a single machine. Spreading the workload across three chassis allows for much better heat dissipation, maintaining a steady (if slower) token rate over long sessions.

3. The "Latency Tax" (Realistic Limits)

Sharding is not "free speed." It introduces a Network Latency Tax:

Every time a layer needs a remote expert, there is a round-trip time (RTT) over your LAN (typically 0.5ms - 2ms).
Across 30-60 layers, this can add 100ms - 200ms of delay per token.
Result: You will likely see lower "Tokens per Second" (TPS) than a single-machine GPU setup, but you gain the ability to run models that would otherwise crash due to Out-Of-Memory (OOM) errors.

4. Summary: When to use this?

Distributed CPU sharding is a utilitarian solution, not a performance one. It is best used for:

Batch Tasks: Summarizing 50 documents overnight where speed doesn't matter.
Large-Model Testing: Validating if a 26B+ model is even useful for your task before buying a $1000 GPU.
Hardware Re-use: Making use of an old PC or home server to act as a "permanent" expert bank.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

I discussed with Gemini the possibility of trying this on my local network #106

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

I discussed with Gemini the possibility of trying this on my local network #106

Uh oh!

Uh oh!

nnnarvaez May 17, 2026

I am posting it here so others that will probably misunderstand like me ground their expectations.

Feasibility Report: Distributed Sharding of Gemma 4 via LARQL

1. Executive Summary

2. User Hardware & Strategy

3. Technical Analysis of LARQL

4. Critical Findings & Contradictions

5. Final Feasibility Score

6. Recommended Alternatives

Technical Report: Distributed CPU Sharding — Practical Trade-offs

1. The Core Value: Capacity over Speed

2. Technical Advantages (Grounded)

A. Memory Bandwidth Parallelism

B. MoE Efficiency (Traffic vs. Weights)

C. Thermal Load Distribution

3. The "Latency Tax" (Realistic Limits)

4. Summary: When to use this?

Replies: 0 comments

nnnarvaez
May 17, 2026