Skip to content

Record: VarLen Attention + Fused MLP + Multi-Phase Global SGD TTT — val_bpb 1.07193 (3-seed mean)#1626

Open
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:dexhunter/multiphase-sgd-ttt
Open

Record: VarLen Attention + Fused MLP + Multi-Phase Global SGD TTT — val_bpb 1.07193 (3-seed mean)#1626
dexhunter wants to merge 1 commit intoopenai:mainfrom
dexhunter:dexhunter/multiphase-sgd-ttt

Conversation

@dexhunter
Copy link
Copy Markdown
Contributor

Summary

Results

Seed Post-TTT BPB val_loss (nats) Artifact
42 1.07280 2.77116 15,932,897
0 1.07134 2.76739 15,939,841
1234 1.07164 2.76815 15,932,419
Mean 1.07193 2.76890

Key Innovation

Multi-phase global SGD: instead of a single SGD round on prefix docs (PR #1610), we split into 3 phases — scoring a chunk, running SGD, then scoring the next chunk with the improved model. This progressively adapts the base model while maintaining strict score-before-update legality. 3-phase gives -0.0008 BPP over single-phase.

Test plan

  • Verify 3-seed mean and std
  • Check artifact sizes < 16 MB
  • Verify score-before-update ordering in TTT logs
  • Check code consistency across seeds

…al_bpb 1.07193 (3-seed mean)

Novel multi-phase global SGD during phased TTT evaluation.
Builds on PR openai#1530 (@samacqua) + PR openai#1610 (@romeerp) phased TTT concept.
3-seed mean: 1.07193 BPB (2.76890 nats), std 0.00063.
Seeds: 42, 0, 1234. All artifacts <16 MB.
@romeerp
Copy link
Copy Markdown

romeerp commented Apr 15, 2026

Wanted to implement this multi-phased strategy but didn't have compute to run tests for it. Glad you were able to do it and show improvement!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants