Non-record: 30 experiments across 13 architectures (MLA, Pause Tokens, Eigenweight, 9 exotic ideas)#1589
Open
nnm2602 wants to merge 3 commits intoopenai:mainfrom
Open
Conversation
Systematic exploration of parameter-efficient LM architectures on 1xH100: - MLA (Multi-Head Latent Attention): 1.3223 BPB with 4x KV compression - Pause tokens: 1.3318 BPB with zero arch change, just dummy thinking tokens - Eigenweight rank sweep: clean Pareto curve from r=64 to r=256 - Depth recurrence sweep: 3 configs, validates SOTA technique - 9 exotic architectures: SIREN, NCA, Hopfield, HyperNetwork, Tensor MPS, Seed Model, Communicating Agents, Basis Sharing, Universal Transformer+ACT Key findings: targeted KV compression (MLA) >> uniform compression (eigenweight), MLP rank matters as much as attention rank, step throughput dominates on 1 GPU. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Novel architecture where transformer W_O evolves during forward pass via a recurrent meta-network. 18 experiments across 4 batches on 1xH100. Key findings: - Original per-token GRUCell loop was 20x slower than baseline — unusable. - Vectorized nn.GRU fix gave 7.6x speedup, made architecture viable. - Zero-init u/v heads create an unbreakable chicken-egg: no gradient signal to move u/v away from zero, so M never activates. - Breaking symmetry with UV_INIT_STD > 0 lets M actually learn useful updates. - Single-layer SURGE (layer 4 only) beats multi-layer SURGE. - Best result: 1.5608 BPB with UV=1.0 on layer 4 (vs 1.5910 vanilla-ablated control = 0.030 BPB improvement). SURGE-M is not competitive with top submissions on raw BPB (our best 1.56 vs SOTA 1.081), but the 4-batch debug-and-breakthrough story and the architectural insight about zero-init symmetry failure in meta-network weight updates is a genuine creative-track contribution. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After exhaustive tuning of the single-pass v2, refactored to a two-pass forward: first pass with W_0 for base logits + pre-W_O outputs, compute WY correction, second pass re-runs layers after SURGE starting from (x_after_surge + attn_scale * correction). Result: - d=256 baseline: 1.4351 BPB - SURGE-M v2 two-pass: 1.4313 BPB (-0.004, beats baseline) The two-pass architecture lets the correction propagate through all downstream layers' MLP+attention+lm_head, rather than merely shifting final logits. This makes the multiplicative weight update actually modify the function the transformer computes, as the spec intended. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Systematic exploration of 30 experiments across 13 distinct architectural ideas for parameter-efficient language modeling, run on 1x H100 with 10-min wallclock cap each.
Best result: MLA latent=128 achieves 1.3223 BPB (only +0.013 behind baseline) with 4x KV compression.
Key findings:
Submission track
Non-record (1x H100, not 8x). Emphasis on breadth of exploration and architectural insights rather than SOTA BPB.
Test plan
🤖 Generated with Claude Code