Skip to content
Scottcjn edited this page Mar 7, 2026 · 2 revisions

RAM Coffers - NUMA-Aware Weight Banking for LLM Inference

RAM Coffers is an industry-first approach to CPU-based LLM inference that indexes model weights by NUMA memory bank, enabling selective prefetch and non-bijunctive pruning before data ever reaches the compute pipeline.

Priority Claim

This work was first published on December 16, 2025, predating DeepSeek's Engram paper (arXiv:2601.07372, January 12, 2026) by 27 days. RAM Coffers introduces 15 features that DeepSeek Engram does not implement, including NUMA topology routing, cognitive hemisphere mapping, tetranary logic, and vec_perm single-cycle collapse.

Core Innovation

Standard LLM inference treats all RAM equally. RAM Coffers partitions model weights across NUMA nodes with domain-specific routing, so that a query about language activates different physical memory banks than a query about spatial reasoning.

The system runs on an IBM POWER8 S824 with 512 GB RAM across 4 NUMA nodes, achieving 147 tokens/second on TinyLlama 1.1B -- a 9x improvement over stock llama.cpp on the same hardware.

Key Capabilities

  • O(1) coffer routing via cosine similarity on query embeddings
  • NUMA-pinned execution using numactl for memory locality
  • DCBT resident prefetch keeping hot weights in L2/L3 cache
  • Vec_perm non-bijunctive collapse pruning weak attention paths in a single POWER8 cycle
  • PSE burst entropy from hardware timebase for behavioral divergence
  • Neuromorphic cognitive routing mapping Brodmann brain areas to NUMA topology

Links

Wiki Contents

  • Architecture - NUMA coffer layout, cognitive routing, and vec_perm collapse details

Clone this wiki locally