System: IBM POWER8 S824 (16 cores, 128 threads, 320GB RAM) Target: llama.cpp with gpt-oss/OptiMind architecture
- OptiMind-SFT 20B Q4_K_M deployed to POWER8
- llama.cpp using scalar fallback (no POWER8 VSX paths)
- Codex attempt: ~1 t/s (all scalar, no SIMD)
- Problem: All VSX code guarded by
__POWER9_VECTOR__, POWER8 excluded
File: ggml/src/ggml-cpu/arch/powerpc/power8-compat.h
Key Discovery: POWER8 has all needed intrinsics EXCEPT vec_xl_len (partial load)
vec_xl,vec_xst- NATIVE on POWER8vec_perm,vec_msum- NATIVE on POWER8vec_extract,vec_insert- NATIVE on POWER8vec_xl_len- POWER9 ONLY → shimmed with memcpy
Solution: Define __POWER9_VECTOR__ when __POWER8_VECTOR__ detected, provide vec_xl_len shim
#if defined(__POWER8_VECTOR__) && !defined(__POWER9_VECTOR__)
#define __POWER9_VECTOR__ 1
#define vec_xl_len(ptr, len) /* memcpy-based shim */
#endifResult: All VSX-optimized quant paths now compile for POWER8
Location: /opt/ibm/xlmass/9.1.1/
Libraries: libmassvp8.a (POWER8 vector), libmass.a (scalar)
Functions Used:
vsexp()- vectorized exp for softmaxvstanh()- vectorized tanh for activations (available)vssqrt()- vectorized sqrt (available)vslog()- vectorized log (available)
CMakeLists.txt Addition (line 438+):
set(IBM_MASS_ROOT "/opt/ibm/xlmass/9.1.1")
target_include_directories(${GGML_CPU_NAME} PRIVATE ${IBM_MASS_ROOT}/include)
target_link_libraries(${GGML_CPU_NAME} PRIVATE
${IBM_MASS_ROOT}/lib/libmassvp8.a
${IBM_MASS_ROOT}/lib/libmass.a m)
target_compile_definitions(${GGML_CPU_NAME} PRIVATE GGML_USE_IBM_MASS=1)vec.cpp Integration (existing code, now active):
#if defined(__powerpc__) && defined(GGML_USE_IBM_MASS)
#include <massv.h>
// ... vsexp used in ggml_vec_soft_max_f32
#endifLibrary: /usr/lib/libnuma.so
CMakeLists.txt Addition:
find_library(NUMA_LIBRARY numa)
target_link_libraries(${GGML_CPU_NAME} PRIVATE ${NUMA_LIBRARY})RAM Coffers Header: ggml-ram-coffers.h
- Detects 2 NUMA nodes (Node 0: 128GB, Node 1: 192GB)
- Provides
ggml_coffer_alloc()for NUMA-aware allocation
File: ggml/src/ggml-cpu/arch/powerpc/quants.c
Location: ggml_vec_dot_q4_K_q8_K() function, line ~914
/* POWER8 PSE: Prefetch weight blocks into L2/L3 cache */
#ifdef GGML_POWER8_PSE_ACTIVE
{
const size_t block_size = sizeof(block_q4_K);
const char* ptr = (const char*)x;
for (int i = 0; i < nb && i < 16; i++) {
__asm__ __volatile__("dcbt 0, %0" : : "r"(ptr + i * block_size) : "memory");
}
}
#endif| Build | pp128 (t/s) | tg32 (t/s) | Notes |
|---|---|---|---|
| Codex scalar | ~1.0 | ~0.5 | No SIMD, generic fallback |
| PSE Full Stack | 15.54 | 4.58 | vec_perm + MASS + NUMA |
Speedup: ~15.5x prompt, ~9x generation
| Component | Symbol/Evidence | Status |
|---|---|---|
| vec_perm | quants.c:327-328 | ✅ Used in Q4_K dequant |
| vec_msum | quants.c:266-267 | ✅ Used in dot product |
| vsexp (MASS) | nm libggml-cpu.so | grep vsexp |
✅ Linked |
| libnuma | CMake detection | ✅ Linked |
| DCBT | quants.c:914+ | ✅ Compiled in |
power8-compat.h- POWER9 intrinsic shims for POWER8ggml-pse-integration.h- Master PSE integrationggml-intelligent-collapse.h- Hebbian vec_perm collapse (headers)ggml-topk-collapse-vsx.h- Top-K attention collapse (headers)ggml-ram-coffers.h- NUMA coffer memory bankingggml-dcbt-resident.h- L3 cache prefetch macrospse-entropy-burst.h- Hardware entropy injectionggml-mass-integration.h- IBM MASS wrapper
quants.c- Added PSE includes, DCBT prefetchggml-cpu.c- Added PSE init callvec.cpp- Added MASS includeCMakeLists.txt- Added NUMA + MASS linkage
- Currently: Headers defined but not called
- Need: Hook into
ggml_compute_forward_flash_attnor equivalent - Goal: Non-bijunctive attention via vec_perm collapse
- Symbolic-neural bridge
- Tetranary logic (4-state confidence)
- Production rules with neural fallback
- Currently: Library linked, nodes detected
- Need: Hook into model loading for NUMA-aware weight placement
- Goal: Route layers to optimal NUMA nodes
# SSH to POWER8
ssh sophia@192.168.0.50 # or 100.94.28.32 via Tailscale
# Build with full PSE
cd ~/llama.cpp/build-pse-gptoss
cmake .. \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_OPENMP=ON \
-DCMAKE_C_FLAGS="-mcpu=power8 -mvsx -maltivec -O3 -DGGML_POWER8_PSE_ACTIVE=1" \
-DCMAKE_CXX_FLAGS="-mcpu=power8 -mvsx -maltivec -O3 -DGGML_POWER8_PSE_ACTIVE=1"
make -j32 llama-cli llama-bench
# Run with NUMA interleave
export OMP_NUM_THREADS=64
export OMP_PROC_BIND=spread
numactl --interleave=all ./bin/llama-cli -m /mnt/nvme/models/OptiMind-SFT-Q4_K_M.gguf -t 64Document Created: 2025-01-23 Last Updated: 2025-01-23 ~16:45 UTC Author: Claude + Scott (Elyan Labs)
- ✅ Collapse header created:
- ✅ Symbolic gate header created:
- ✅ Integration point identified: in ops.cpp
⚠️ HOT PATH OVERHEAD: Direct integration causes performance regression
Problem: Calling collapse (or even gate check) in the softmax hot path causes massive overhead:
- A 20B model calls softmax 720,000+ times per benchmark
- Even a simple gate check adds noticeable overhead at this scale
- Symbolic gate reduced collapse calls 10x but still hurt performance
Performance Results:
| Configuration | pp128 t/s | tg32 t/s |
|---|---|---|
| Baseline (no collapse) | 14.76 | 0.88 |
| With collapse (every call) | 14.79 | 0.93 |
| With symbolic gate (sparse) | 9.74 | 0.80 |
- Compile-Time Flag: Use
-DPSE_COLLAPSE_ENABLED=1to enable collapse - Model Metadata: Store which layers/heads benefit from collapse in GGUF
- Separate Attention Path: Create PSE-specific attention op (not modify generic softmax)
- Flash Attention Hook: Integrate into flash_attn path, not generic softmax
- Collapse header:
ggml/src/ggml-cpu/arch/powerpc/ggml-attn-collapse-vsx.h - Symbolic gate:
ggml/src/ggml-cpu/arch/powerpc/ggml-pse-symbolic-gate.h - Integration point:
ops.cpp:5255-5265(commented out for now)
The symbolic layer (PowerLISP) should decide:
- Which models benefit from collapse (via model metadata)
- Which layers to collapse (early layers for representation, not all)
- When to collapse (prompt phase, not generation)
This decision should happen at model load time, not per-op call.