Geodesic Attention Engine (GAE)

GAE computes exact transformer attention with fewer memory operations. The Fused Waller Kernel reduces HBM round-trips from 12 to 2, achieving O(N) memory complexity instead of O(N²).

Full Technical Specification — Deep dive into the math and implementation

Quick Results on NVIDIA H100 80GB

Sequence Length	Standard Attention	GAE Memory	Reduction
65,536	17.25 GB	0.62 GB	99.6%
262,144	275 GB (impossible)	0.82 GB	✓ Works
1,048,576	4.4 TB (impossible)	1.09 GB	✓ Works

GAE enables 1M+ token sequences on hardware that cannot fit 64K with standard attention.

Throughput: Auto-Tuned Waller Kernel on H100

Seq Length	TFLOPS	Time (ms)	Mem Reduction
8,192	119.8	0.29	46.3%
16,384	146.2	0.94	84.7%
32,768	164.4	3.34	95.3%
65,536	170.4	12.90	98.4%
131,072	174.4	50.44	99.4%

Peak: 174.4 TFLOPS with auto-tuned tile sizes and cuBLAS tensor core acceleration.

How It Works

Standard attention: 12 HBM round-trips
GAE: 2 HBM round-trips (Load Q,K,V → Compute in Registers → Store Output)

Key techniques:

Online Softmax — Single streaming pass, no O(N²) intermediate matrix
Register-Level Fusion — Q·Kᵀ, softmax, ×V all in registers
Welford Statistics — Numerically stable, bit-exact determinism

Energy Efficiency Benchmarks

Energy ∝ Data Movement — The Waller Kernel eliminates N² memory traffic.

Benchmark	What It Measures	Command
`energy_thesis.rs`	Memory traffic analysis, 75% energy reduction proof	`cargo run --bin energy_thesis`
`energy_benchmark.rs`	GFLOPS, efficiency rating across sequence lengths	`cargo run --bin energy_benchmark`
`gpu_energy_benchmark.rs`	CPU vs GPU speedup, GPU GFLOPS	`cargo run --bin gpu_energy_benchmark`

At 8K+ context: >99% memory reduction, >75% energy savings.

Full benchmark results: benches/BENCHMARK_RESULTS.md

Installation

git clone https://github.com/RegularJoe-CEO/Geodesic-Attention-Engine-GAE-.git
cd Geodesic-Attention-Engine-GAE-
cargo build --release
Requirements: CUDA 11.8+, Rust 1.70+, NVIDIA GPU (Ampere+)

Run Benchmarks
Rust:

bash


cargo bench
CUDA O(1) Memory Kernel:

bash


cd cuda_src && nvcc -O3 -arch=sm_90 waller_operator.cu -o bench && ./bench
CUDA cuBLAS Kernel:

bash


cd cuda_src && nvcc -O3 -arch=sm_90 waller_v7.cu -lcublas -o bench && ./bench
Backends
Backend	Status	Notes
CUDA	Production	A100/H100
Rust	Reference	Pure impl
WebGPU	Experimental	Browser
What GAE Is Not
Not approximate — Exact attention, bit-for-bit identical results
Not sparse — Full attention matrix semantics
Not a FlashAttention replacement — Different approach, demonstrates further fusion possibilities
License
AGPL-3.0

Contact
Eric Waller — [email protected] — https://luxiedge.com

© 2026 Eric Waller

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
benches		benches
cuda_src		cuda_src
layer_weights		layer_weights
rust_src		rust_src
src		src
tests		tests
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
REVIEW.md		REVIEW.md
TECHNICAL_SPEC.md		TECHNICAL_SPEC.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Geodesic Attention Engine (GAE)

Quick Results on NVIDIA H100 80GB

Throughput: Auto-Tuned Waller Kernel on H100

How It Works

Energy Efficiency Benchmarks

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

RegularJoe-CEO/Geodesic-Attention-Engine-GAE-

Folders and files

Latest commit

History

Repository files navigation

Geodesic Attention Engine (GAE)

Quick Results on NVIDIA H100 80GB

Throughput: Auto-Tuned Waller Kernel on H100

How It Works

Energy Efficiency Benchmarks

Installation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages