Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) by romeerp · Pull Request #1610 · openai/parameter-golf

romeerp · 2026-04-14T04:50:50Z

This builds directly on PR #1530. Training is unchanged; the only change is in evaluation.

Results:

Seed	val_loss	val_bpb	eval_time	artifact_size
0	2.76951521	1.07216564	500.104 s	15,996,697 B
1	2.77167493	1.07300174	515.324 s	15,995,985 B
2	2.77232000	1.07325147	504.949 s	15,988,805 B
avg	2.77117005	1.07280628	506.792 s	15,993,829 B

All 3 seeds are under the 600s eval budget and under the 16 MB artifact cap.

Compared to the original PR #1530 submission mean:

Metric	PR1530	This submission	Delta
val_loss	2.77261037	2.77117005	-0.00144032
val_bpb	1.07336388	1.07280628	-0.00055760

Method:

Run the stock PR Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530 LoRA TTT evaluator on its single global length-sorted queue.
After 2000 queue-completed documents have been fully scored, pause once.
Gather exactly those already-scored documents in queue order.
Run distributed global SGD on that scored prefix.
Resume the same queue with the updated base model.

Legality:

LoRA scoring happens before LoRA updates on those chunks.
Global SGD only trains on documents that have already been fully scored.
After the pause, evaluation resumes on future queue items only.
So no token is used for adaptation before its score has already been counted.

Intuition:

PR Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530's LoRA TTT is a local adaptation mechanism. It lets the model fit the current document quickly, but that adaptation is discarded when the document ends.
The added global SGD phase is meant to improve the shared base model itself on a score-first prefix, so later documents can benefit from a slightly better base model before local LoRA adaptation is applied.
In that sense, LoRA handles fast document-local adaptation, while global SGD tries to capture reusable cross-document adaptation.

Implementation note:

I initially tried a more continuous hybrid scheme where local and global updates happened throughout eval.
That version was harder to make run well in distributed form without introducing too much synchronization overhead.
I simplified the final implementation into a phased process because it is much easier to reason about, clearly score-first, and still fits within the 600s eval budget.
I do not think this implementation is especially optimized yet; the main goal here was to get a clean legal baseline for combining local LoRA TTT with global base-model adaptation.

Run instructions:

Train + quantize + phased eval for one seed:

SEED=0 ARTIFACT_DIR="runs/varlen0" \
PHASED_TTT_ENABLED=1 PHASED_TTT_PREFIX_DOCS=2000 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Eval-only on an existing checkpoint:

SEED=0 EVAL_ONLY_PATH="runs/varlen0/final_model.pt" \
PHASED_TTT_ENABLED=1 PHASED_TTT_PREFIX_DOCS=2000 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Bring AGENTS.md, AGENT_SYNC.md, project-state.md, decisions.md, and next-session.md to the openai#1610-direct strategy. Add locked execution plan (PLAN_PR1610_CORRECTOR.md Rev 3). Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Exact copy from PR openai#1610 at SHA ca19195. MD5: 57cfda2047b2c2a63ec10b99d704bfb0. 3379 lines, 139831 bytes. This is the unmodified source base; corrector will be added in later commits. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

Setup, seed-0 (Gate A), seed-1/2 (Gate B) subcommands with published BPB verification targets and kill criteria. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

…; PRISM + Ouroboros papers; Session 13 - Merged SOTA 1.0810 unchanged (5-day plateau; 16 days to deadline) - PR openai#1610 (romeerp, 1.0728): VarLenAttn + PhasingTTT — legal, score-first compliant, but low EV (-0.0006 bpb) - PR openai#1619 flagged likely illegal (AdamW TTT — same pattern as rejected PR openai#771) - PRISM (arXiv:2602.10796, Feb 2026): Parallel Residual Iterative Sequence Model, 174x throughput — read before next recurrence architecture decision - Ouroboros (arXiv:2604.02051, Apr 2026): input-conditioned LoRA modulation for recursive transformers — watch - Session 13 added to CLAUDE.md; no strategy change (PR openai#1586 per-layer GPTQ still #1 priority) - daily_research.md Apr 14 entry added at top https://claude.ai/code/session_01GLn4VtS8D1uehRZnfb4dRe

@samacqua

…al_bpb 1.07193 (3-seed mean) Novel multi-phase global SGD during phased TTT evaluation. Builds on PR openai#1530 (@samacqua) + PR openai#1610 (@romeerp) phased TTT concept. 3-seed mean: 1.07193 BPB (2.76890 nats), std 0.00063. Seeds: 42, 0, 1234. All artifacts <16 MB.

romeerp added 2 commits April 13, 2026 23:50

Add phased global SGD TTT prefix submission

afcde8b

Refresh phased TTT submission logs and results

61b3e24

romeerp marked this pull request as ready for review April 14, 2026 05:30

romeerp changed the title ~~Add phased global SGD TTT prefix submission~~ Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) Apr 14, 2026

romeerp and others added 2 commits April 14, 2026 00:43

Add artifact sizes to phased TTT README

ca19195

Update submission.json

41bb4a6

dexhunter mentioned this pull request Apr 14, 2026

Record: VarLen Attention + Fused MLP + Multi-Phase Global SGD TTT — val_bpb 1.07193 (3-seed mean) #1626

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean)#1610

Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean)#1610
romeerp wants to merge 4 commits intoopenai:mainfrom
romeerp:codex/phased-ttt-2000

romeerp commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

romeerp commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

romeerp commented Apr 14, 2026 •

edited

Loading