Record: Per-Layer Adaptive GPTQ Clip + int7 Embeddings + MLR 0.026 — val_bpb 1.07493 (3-seed mean) by dexhunter · Pull Request #1586 · openai/parameter-golf

dexhunter · 2026-04-13T03:24:48Z

Summary

val_bpb = 1.07493 (3-seed mean, std 0.00078) | 2.77666 nats | ~15.93 MB | 8xH100 SXM, 600s

Seed	Pre-TTT BPB	Post-TTT BPP	TTT Gain	Artifact
42	1.08275	1.07437	-0.00838	15,934,100
0	1.08270	1.07460	-0.00810	15,937,217
1337	1.08449	1.07582	-0.00867	15,928,721
Mean	1.08331	1.07493	-0.00838	15,933,346

Merged SOTA (PR #1493): 2.78932 nats. Delta: -0.01266 nats (clears 0.005 bar by 2.5x).

Key Innovation: Per-Layer Adaptive GPTQ Clip

Different GPTQ clip_sigmas for MLP vs attention weights — a novel quantization approach not used by any other submission:

MLP layers (blocks.*.mlp.*): clip_sigmas = 12.0 — tighter clipping for higher quantization precision on the largest parameter group
Attention layers (blocks.*.attn.*): clip_sigmas = 13.0 — looser clipping for better compressibility
Embeddings (tok_emb): EMBED_BITS = 7 (int7) with clip_sigmas = 15.0 — int7 saves ~530 KB vs int8 while preserving quality

This per-layer approach captures most of the BPP gain from tighter uniform clipping (which exceeds 16 MB) while keeping the artifact under budget.

Additional tuning

MATRIX_LR = 0.026 (vs 0.022 default) — sharp optimum found via systematic 6-point sweep
WARMDOWN_FRAC = 0.75 + TTT_CHUNK_SIZE = 48

Rule Compliance (Issue #1017)

Condition 1 (Causality): VarLen attention with per-document cu_seqlens, strict causal masking
Condition 2 (Normalized): Standard softmax over full vocabulary
Condition 3 (Score before update): TTT chunks scored under torch.no_grad() BEFORE LoRA gradient update
Condition 4 (Single pass): Each token scored exactly once
No SLOT, no pre-quant TTT, no n-gram cache
All artifacts < 16 MB, train < 600s, eval < 220s
Compile warmup uses random tokens (not val data)

Test Plan

3-seed verification (seeds 42, 0, 1337)
All artifacts under 16,000,000 bytes
Train under 600s on all seeds (~587s)
Eval under 220s on all seeds

Credits

@samacqua — VarLen attention, Triton fused MLP, doc-independent LoRA TTT, triple depth recurrence (PR Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530)
@EthanYangTW — Triple recurrence, parameter banking (PR Record: SP8192 + Triple Recurrence + Banking + Fused MLP + Muon 0.97 — val_bpb 1.0778 (3-seed mean) #1523)
@bigbag — Merged SOTA baseline (PR Record: SP8192 + 3-Layer Recurrence + Parallel Residuals + QK-Gain 5.25 + Legal TTT — val_bpb 1.0810 (3-seed mean) #1493)
@clarkkev — SP8192 + GPTQ + SDClip (PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394)
@abaybektursun — Score-first TTT framework (PR Record: LeakyReLU² + Legal Score-First TTT + Parallel Muon — val_bpb 1.1194 (3-seed mean) #549)

…b 1.07493 (3-seed mean) Novel per-layer GPTQ quantization: MLP_CLIP_SIGMAS=12.0 + ATTN_CLIP_SIGMAS=13.0 + EMBED_BITS=7 + EMBED_CLIP_SIGMAS=15.0 + MATRIX_LR=0.026. 3-seed mean: 1.07493 (std 0.00078), 2.77666 nats. Delta vs merged SOTA (openai#1493): -0.01266 nats (clears 0.005 bar by 2.5x). All artifacts < 16 MB (~15.93 MB), train < 600s, eval < 220s.

The span-wide loop-embedding family produced a cleaner and more consistent three-seed story, but its mean still trails the older trusted W2 candidate. At this point the remaining loss still looks dominated by quantization rather than training or TTT adaptation. This variant ports the lightest promising part of the new upstream openai#1586 result: per-layer adaptive GPTQ clip sigmas (MLP tighter, attention looser) while explicitly avoiding the larger int7-embedding change. The goal is to test whether clip allocation alone improves the W6 quantization landing without introducing a new byte-risk axis. Constraint: We need a new source of gain that attacks the quantization gap directly while preserving the current W6 artifact budget and runtime profile Rejected: Continue 3-seed validating W14 | The mean is already behind the trusted baseline, so more seed spend there is not justified Rejected: Import int7 embeddings at the same time | That would confound the first read on whether adaptive clip itself carries signal Confidence: medium Scope-risk: moderate Reversibility: clean Directive: If adaptive clip also fails to improve the best W6 seed, stop treating GPTQ microstructure as the likely missing lever for round 22 Tested: python3 -m py_compile evaluate.py train_gpt.py; bundle code-size estimate remains ~24.2 KB Not-tested: Full Lepton run for W15 adaptive clip

The first adaptive-clip run reached one of the best single-seed scores of the round, but blew the artifact limit by about 238 KB. That size failure is exactly the hole the upstream adaptive-clip submission closes with int7 embeddings and a tighter embedding clip. This variant keeps the per-layer adaptive GPTQ clip that already showed score signal here and adds only the embedding-side quantization change, so the next run answers the narrowest remaining question: can the adaptive-clip lane become compliant without giving back too much BPB. Constraint: The adaptive-clip signal is already strong enough to justify a byte-recovery follow-up, but the recovery path should add as few new variables as possible Rejected: Start by changing more matrix-side quantization knobs too | Would make the byte recovery result harder to attribute Rejected: Abandon the adaptive-clip lane immediately after the byte failure | The score was too strong to drop without trying the obvious byte fix Confidence: medium Scope-risk: narrow Reversibility: clean Directive: If int7 embeddings do not pull this lane back under 16 MB or if they erase most of the score gain, stop chasing the full openai#1586 quantization stack on this base Tested: python3 -m py_compile evaluate.py train_gpt.py; bundle code-size estimate remains ~24.2 KB Not-tested: Full Lepton run for adaptive clip + int7 embeddings

The adaptive-clip family produced one of the strongest single-seed scores of the round but failed bytes, and the immediate int7 rescue gave back too much quality. This next step tests the training-side half of the new upstream recipe without changing the adaptive quantization settings: keep the byte-saving int7 embeddings and adaptive layer clips from W16, but adopt the higher matrix LR, slightly lower embed weight decay, slower EMA, and longer warmdown that upstream found on a closely related stack. The goal is to see whether the stronger training trajectory can recover the score we lost when making the quantized artifact compliant. Constraint: We need to improve the compliant adaptive-clip lane without introducing another large byte or runtime variable Rejected: Keep rerunning the uncommitted draft | The first accidental launch was on the old commit and cannot answer the intended question Rejected: Change both training and quantization rescue paths again | That would compound uncertainty after the last two adaptive-clip runs diverged on different axes Confidence: medium Scope-risk: narrow Reversibility: clean Directive: If these training-side shifts still do not recover the W15 score, stop trying to reconstruct openai#1586 piecemeal on this base Tested: python3 -m py_compile evaluate.py train_gpt.py; bundle code-size estimate remains ~24.2 KB Not-tested: Full Lepton run for W18 training-side adaptive-clip recovery

…ai#1586 per-layer GPTQ highest-EV - PR openai#758 n-gram effectively dead: MatoTeziTanka (Apr 12) flagged XOR hash includes target token, same illegality as openai#727/openai#741 - GDN-Hybrid BPB bug confirmed: PR openai#1576 space-token double-count inflates denominator ~14%; actual score ~1.16-1.18, not 1.01671 - PR openai#1586 (dexhunter, 1.07493): Per-Layer Adaptive GPTQ MLP=12σ/Attn=13σ + int7 Emb (saves 530KB) + MLR=0.026; -0.0127 nats vs SOTA; implement now - PR openai#1584: systems-only (fused Muon, batched EMA, loader prealloc) ~+20 steps - Casefold Tokenizer (openai#1578/openai#1585): legality debated; await organizer ruling - New paper: arXiv:2604.06169 In-Place TTT (Apr 7) NTP-aligned score-first TTT - Merged SOTA 1.0810 unchanged (4-day stable streak); target ≤1.0760; 17 days https://claude.ai/code/session_01BE8wc8zxvZAo52QBXSNiL8

The adaptive-clip training-recovery lane is currently the strongest fully compliant direction we have, but its novelty story still leans heavily on the open openai#1586 quantization recipe. This variant adds one of our own zero-byte architecture tweaks on top: instead of injecting the pass embedding only at the loop-start layer, it applies the same pass embedding across the whole repeated span. The goal is to see whether the stronger W18 quantization path and the W14-style span-wide loop signal reinforce each other without paying any additional artifact cost. Constraint: We need a stronger candidate that is not just a thinner repackaging of the open adaptive-clip line, and the next change should not consume more bytes Rejected: Submit the plain W18 lane immediately | Strong and compliant, but its novelty story is still too close to the open openai#1586 recipe Rejected: Return to broader TTT or chunk/context sweeps | Those knobs already underperformed on this family Confidence: medium Scope-risk: narrow Reversibility: clean Directive: If this zero-byte architecture add-on does not improve W18, stop treating loop-embedding placement as a likely differentiator for the adaptive-clip family Tested: python3 -m py_compile evaluate.py train_gpt.py; bundle code-size estimate remains ~24.2 KB Not-tested: Full Lepton run for adaptive clip + span-wide loop embeddings

Port the W18 training and quantization defaults onto the older W2 pass-conditioned modulation lane so the next probe tests our own round-22 mechanism under a compliant, already-validated artifact path instead of re-running another near-1586 reproduction. The harness sync keeps local monitoring reliable while preserving the worker-facing launcher contract. Constraint: The next lane must preserve W2's pass-conditioned modulation story while staying under the 16 MB cap and using the fixed local evaluator path. Rejected: Re-run W19 with more seeds | single-seed result already underperformed W18 and still leaned on thin novelty Rejected: More W18-family quantization tuning | stronger score story but too close to open PR openai#1586 to solve the submission problem Confidence: medium Scope-risk: moderate Reversibility: clean Directive: Treat this branch as a W2-on-W18 hybrid; if the score improves, review novelty framing against both openai#1518 and openai#1586 before escalating to 3 seeds Tested: python3 -m py_compile train_gpt.py evaluate.py auto_resume_watch.py; python3 evaluate.py --list Not-tested: Live GPU eval on this hybrid lane Related: c0c2d68 Related: 7d435d2

…is regressive on our SP8192 + depth recurrence stack Three configs tested at seed 42 on 8xH100 SXM: - VarLen + Fused MLP: 1.93 pre-quant val_bpb, 1440 steps, 2.3M tok/s (3.4x slower) - Fused MLP only: 1.110 pre-quant val_bpb, 2581 steps, 3.4M tok/s (2.3x slower) - Pure baseline reproduction: pod terminated mid-run before completion Root cause: VarLen + depth recurrence + fullgraph torch.compile triggers cascading shape recompilations (combinatorial explosion of loop_iter x cu_seqlens shape) that overflow even a 64-entry compile cache. Fused MLP Triton kernel has per-call TensorDescriptor allocation overhead that doesn't amortize for our hidden_dim=2048. Conclusion: do not ship this port. PR openai#1572 (1.07974) remains best submission. Move 2 (per-layer GPTQ from PR openai#1586) and Move 3 (LoRA TTT from PR openai#1530, eval-only so no torch.compile recompile concern) are still viable next directions.

…192 stack Config-level changes only, no kernel/compile changes that could interact with our depth recurrence stack (unlike VarLen port in submission/sp8192-varlen-frontier): - MLP_CLIP_SIGMAS 12.0 (tight, preserve MLP precision) - ATTN_CLIP_SIGMAS 13.0 (looser, save bytes on attention weights) - EMBED_BITS 8 -> 7 with EMBED_CLIP_SIGMAS 20.0 -> 15.0 (~530 KB artifact savings) - MATRIX_LR 0.022 -> 0.026 (dexhunter 6-point sweep optimum) - WARMDOWN_FRAC 0.72 -> 0.75 (longer peak LR window) Dexhunter measured 1.07493 BPB (3-seed mean) applying these against PR openai#1530 base. Against our 1.07974 SP8192 baseline the expected delta is in the 0.003-0.005 BPB range; the adaptive clip is stack-independent and the embed-bits + LR tweaks are universal. Fresh branch from upstream/main per PR hygiene (PR openai#1572 untouched).

…Clip + int7 Embeddings) Verbatim copy of records/track_10min_16mb/2026-04-13_AdaptiveClip_Emb7_MLR026/train_gpt.py from openai/parameter-golf PR openai#1586. val_bpb 1.07493 (3-seed mean). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…; PRISM + Ouroboros papers; Session 13 - Merged SOTA 1.0810 unchanged (5-day plateau; 16 days to deadline) - PR openai#1610 (romeerp, 1.0728): VarLenAttn + PhasingTTT — legal, score-first compliant, but low EV (-0.0006 bpb) - PR openai#1619 flagged likely illegal (AdamW TTT — same pattern as rejected PR openai#771) - PRISM (arXiv:2602.10796, Feb 2026): Parallel Residual Iterative Sequence Model, 174x throughput — read before next recurrence architecture decision - Ouroboros (arXiv:2604.02051, Apr 2026): input-conditioned LoRA modulation for recursive transformers — watch - Session 13 added to CLAUDE.md; no strategy change (PR openai#1586 per-layer GPTQ still #1 priority) - daily_research.md Apr 14 entry added at top https://claude.ai/code/session_01GLn4VtS8D1uehRZnfb4dRe

channyzf6 mentioned this pull request Apr 15, 2026

Notable Non-Record: Switched Deep Supervision (first DS submission) #1629

Open

5 tasks

mikeapedia mentioned this pull request Apr 15, 2026

Non-record: xIELU Piecewise Quadratic Activation + Per-Layer QK Gain Convergence #1648

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Per-Layer Adaptive GPTQ Clip + int7 Embeddings + MLR 0.026 — val_bpb 1.07493 (3-seed mean)#1586

Record: Per-Layer Adaptive GPTQ Clip + int7 Embeddings + MLR 0.026 — val_bpb 1.07493 (3-seed mean)#1586
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:dexhunter/adaptive-clip-emb7-mlr026

dexhunter commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dexhunter commented Apr 13, 2026

Summary

Key Innovation: Per-Layer Adaptive GPTQ Clip

Additional tuning

Rule Compliance (Issue #1017)

Test Plan

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant