Non-record: 30 experiments across 13 architectures (MLA, Pause Tokens, Eigenweight, 9 exotic ideas) by nnm2602 · Pull Request #1589 · openai/parameter-golf

nnm2602 · 2026-04-13T04:18:28Z

Summary

Systematic exploration of 30 experiments across 13 distinct architectural ideas for parameter-efficient language modeling, run on 1x H100 with 10-min wallclock cap each.

Best result: MLA latent=128 achieves 1.3223 BPB (only +0.013 behind baseline) with 4x KV compression.

Key findings:

MLA (Multi-Head Latent Attention): Compressing K,V through a shared latent bottleneck nearly matches baseline at zero step-time overhead. Validates DeepSeek V2's insight at small scale.
Pause tokens: Inserting 4 learned dummy tokens every 64 positions beats baseline at matched steps (1.3424 vs 1.3465 at step 1200). Zero architectural change, 2,048 extra params.
Eigenweight rank sweep: Clean Pareto curve — rank 64 (4.3x compression, 1.50 BPB) → rank 256 (1.1x, 1.36 BPB). MLP rank matters more than attention rank.
Depth recurrence: Less is more on 1 GPU — 2-layer x2 (1.3226) beats 3-layer x2 (1.3324) and 2-layer x3 (1.3399).
9 exotic architectures: SIREN weight generation, neural cellular automata, Hopfield energy LMs, hypernetworks, tensor MPS, seed models, communicating agents, basis sharing, universal transformer + ACT — all interesting failures with documented lessons.

Submission track

Non-record (1x H100, not 8x). Emphasis on breadth of exploration and architectural insights rather than SOTA BPB.

Test plan

All 16 experiment scripts parse without errors
Results are reproducible on H100 with documented run commands
submission.json matches reported metrics

🤖 Generated with Claude Code

Systematic exploration of parameter-efficient LM architectures on 1xH100: - MLA (Multi-Head Latent Attention): 1.3223 BPB with 4x KV compression - Pause tokens: 1.3318 BPB with zero arch change, just dummy thinking tokens - Eigenweight rank sweep: clean Pareto curve from r=64 to r=256 - Depth recurrence sweep: 3 configs, validates SOTA technique - 9 exotic architectures: SIREN, NCA, Hopfield, HyperNetwork, Tensor MPS, Seed Model, Communicating Agents, Basis Sharing, Universal Transformer+ACT Key findings: targeted KV compression (MLA) >> uniform compression (eigenweight), MLP rank matters as much as attention rank, step throughput dominates on 1 GPU. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Novel architecture where transformer W_O evolves during forward pass via a recurrent meta-network. 18 experiments across 4 batches on 1xH100. Key findings: - Original per-token GRUCell loop was 20x slower than baseline — unusable. - Vectorized nn.GRU fix gave 7.6x speedup, made architecture viable. - Zero-init u/v heads create an unbreakable chicken-egg: no gradient signal to move u/v away from zero, so M never activates. - Breaking symmetry with UV_INIT_STD > 0 lets M actually learn useful updates. - Single-layer SURGE (layer 4 only) beats multi-layer SURGE. - Best result: 1.5608 BPB with UV=1.0 on layer 4 (vs 1.5910 vanilla-ablated control = 0.030 BPB improvement). SURGE-M is not competitive with top submissions on raw BPB (our best 1.56 vs SOTA 1.081), but the 4-batch debug-and-breakthrough story and the architectural insight about zero-init symmetry failure in meta-network weight updates is a genuine creative-track contribution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

After exhaustive tuning of the single-pass v2, refactored to a two-pass forward: first pass with W_0 for base logits + pre-W_O outputs, compute WY correction, second pass re-runs layers after SURGE starting from (x_after_surge + attn_scale * correction). Result: - d=256 baseline: 1.4351 BPB - SURGE-M v2 two-pass: 1.4313 BPB (-0.004, beats baseline) The two-pass architecture lets the correction propagate through all downstream layers' MLP+attention+lm_head, rather than merely shifting final logits. This makes the multiplicative weight update actually modify the function the transformer computes, as the spec intended. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Beige-gc and others added 3 commits April 12, 2026 23:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: 30 experiments across 13 architectures (MLA, Pause Tokens, Eigenweight, 9 exotic ideas)#1589

Non-record: 30 experiments across 13 architectures (MLA, Pause Tokens, Eigenweight, 9 exotic ideas)#1589
nnm2602 wants to merge 3 commits intoopenai:mainfrom
nnm2602:submission/30-experiments-mla-pause-eigenweight

nnm2602 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nnm2602 commented Apr 13, 2026

Summary

Key findings:

Submission track

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants