Skip to content

Geodesic Attention Engine - Minimum-energy path through transformer attention. Fused Waller Kernel reduces HBM round-trips from 12 to 2. O(N) memory complexity, 23-37% Tok/J improvement, bit-exact determinism. No approximation, no sparsity - just the shortest path.

License

Notifications You must be signed in to change notification settings

RegularJoe-CEO/Geodesic-Attention-Engine-GAE-

Repository files navigation

Geodesic Attention Engine (GAE)

DOI License: AGPL-3.0

GAE computes exact transformer attention with fewer memory operations. The Fused Waller Kernel reduces HBM round-trips from 12 to 2, achieving O(N) memory complexity instead of O(N²).

Full Technical Specification — Deep dive into the math and implementation


Quick Results on NVIDIA H100 80GB

Sequence Length Standard Attention GAE Memory Reduction
65,536 17.25 GB 0.62 GB 99.6%
262,144 275 GB (impossible) 0.82 GB ✓ Works
1,048,576 4.4 TB (impossible) 1.09 GB ✓ Works

GAE enables 1M+ token sequences on hardware that cannot fit 64K with standard attention.

Throughput: Auto-Tuned Waller Kernel on H100

Seq Length TFLOPS Time (ms) Mem Reduction
8,192 119.8 0.29 46.3%
16,384 146.2 0.94 84.7%
32,768 164.4 3.34 95.3%
65,536 170.4 12.90 98.4%
131,072 174.4 50.44 99.4%

Peak: 174.4 TFLOPS with auto-tuned tile sizes and cuBLAS tensor core acceleration.


How It Works

Standard attention: 12 HBM round-trips
GAE: 2 HBM round-trips (Load Q,K,V → Compute in Registers → Store Output)

Key techniques:

  • Online Softmax — Single streaming pass, no O(N²) intermediate matrix
  • Register-Level Fusion — Q·Kᵀ, softmax, ×V all in registers
  • Welford Statistics — Numerically stable, bit-exact determinism

Energy Efficiency Benchmarks

Energy ∝ Data Movement — The Waller Kernel eliminates N² memory traffic.

Benchmark What It Measures Command
energy_thesis.rs Memory traffic analysis, 75% energy reduction proof cargo run --bin energy_thesis
energy_benchmark.rs GFLOPS, efficiency rating across sequence lengths cargo run --bin energy_benchmark
gpu_energy_benchmark.rs CPU vs GPU speedup, GPU GFLOPS cargo run --bin gpu_energy_benchmark

At 8K+ context: >99% memory reduction, >75% energy savings.

Full benchmark results: benches/BENCHMARK_RESULTS.md


Installation

git clone https://github.com/RegularJoe-CEO/Geodesic-Attention-Engine-GAE-.git
cd Geodesic-Attention-Engine-GAE-
cargo build --release
Requirements: CUDA 11.8+, Rust 1.70+, NVIDIA GPU (Ampere+)

Run Benchmarks
Rust:

bash


cargo bench
CUDA O(1) Memory Kernel:

bash


cd cuda_src && nvcc -O3 -arch=sm_90 waller_operator.cu -o bench && ./bench
CUDA cuBLAS Kernel:

bash


cd cuda_src && nvcc -O3 -arch=sm_90 waller_v7.cu -lcublas -o bench && ./bench
Backends
Backend	Status	Notes
CUDA	Production	A100/H100
Rust	Reference	Pure impl
WebGPU	Experimental	Browser
What GAE Is Not
Not approximate — Exact attention, bit-for-bit identical results
Not sparse — Full attention matrix semantics
Not a FlashAttention replacement — Different approach, demonstrates further fusion possibilities
License
AGPL-3.0

Contact
Eric Waller — [email protected] — https://luxiedge.com

© 2026 Eric Waller

About

Geodesic Attention Engine - Minimum-energy path through transformer attention. Fused Waller Kernel reduces HBM round-trips from 12 to 2. O(N) memory complexity, 23-37% Tok/J improvement, bit-exact determinism. No approximation, no sparsity - just the shortest path.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors 4

  •  
  •  
  •  
  •  

Languages