Skip to content

Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean)#1610

Open
romeerp wants to merge 4 commits intoopenai:mainfrom
romeerp:codex/phased-ttt-2000
Open

Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean)#1610
romeerp wants to merge 4 commits intoopenai:mainfrom
romeerp:codex/phased-ttt-2000

Conversation

@romeerp
Copy link
Copy Markdown

@romeerp romeerp commented Apr 14, 2026

This builds directly on PR #1530. Training is unchanged; the only change is in evaluation.

Results:

Seed val_loss val_bpb eval_time artifact_size
0 2.76951521 1.07216564 500.104 s 15,996,697 B
1 2.77167493 1.07300174 515.324 s 15,995,985 B
2 2.77232000 1.07325147 504.949 s 15,988,805 B
avg 2.77117005 1.07280628 506.792 s 15,993,829 B

All 3 seeds are under the 600s eval budget and under the 16 MB artifact cap.

Compared to the original PR #1530 submission mean:

Metric PR1530 This submission Delta
val_loss 2.77261037 2.77117005 -0.00144032
val_bpb 1.07336388 1.07280628 -0.00055760

Method:

  1. Run the stock PR Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530 LoRA TTT evaluator on its single global length-sorted queue.
  2. After 2000 queue-completed documents have been fully scored, pause once.
  3. Gather exactly those already-scored documents in queue order.
  4. Run distributed global SGD on that scored prefix.
  5. Resume the same queue with the updated base model.

Legality:

  • LoRA scoring happens before LoRA updates on those chunks.
  • Global SGD only trains on documents that have already been fully scored.
  • After the pause, evaluation resumes on future queue items only.
  • So no token is used for adaptation before its score has already been counted.

Intuition:

  • PR Record: Varlen attention + fused MLP + doc-independent TTT (1.07336) #1530's LoRA TTT is a local adaptation mechanism. It lets the model fit the current document quickly, but that adaptation is discarded when the document ends.
  • The added global SGD phase is meant to improve the shared base model itself on a score-first prefix, so later documents can benefit from a slightly better base model before local LoRA adaptation is applied.
  • In that sense, LoRA handles fast document-local adaptation, while global SGD tries to capture reusable cross-document adaptation.

Implementation note:

  • I initially tried a more continuous hybrid scheme where local and global updates happened throughout eval.
  • That version was harder to make run well in distributed form without introducing too much synchronization overhead.
  • I simplified the final implementation into a phased process because it is much easier to reason about, clearly score-first, and still fits within the 600s eval budget.
  • I do not think this implementation is especially optimized yet; the main goal here was to get a clean legal baseline for combining local LoRA TTT with global base-model adaptation.

Run instructions:

Train + quantize + phased eval for one seed:

SEED=0 ARTIFACT_DIR="runs/varlen0" \
PHASED_TTT_ENABLED=1 PHASED_TTT_PREFIX_DOCS=2000 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Eval-only on an existing checkpoint:

SEED=0 EVAL_ONLY_PATH="runs/varlen0/final_model.pt" \
PHASED_TTT_ENABLED=1 PHASED_TTT_PREFIX_DOCS=2000 \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

@romeerp romeerp marked this pull request as ready for review April 14, 2026 05:30
@romeerp romeerp changed the title Add phased global SGD TTT prefix submission Record: VarLenAttn + PhasingTTT - val_bpb 1.0728 (3-seed mean) Apr 14, 2026
amrayach added a commit to amrayach/parameter-golf that referenced this pull request Apr 14, 2026
Bring AGENTS.md, AGENT_SYNC.md, project-state.md, decisions.md,
and next-session.md to the openai#1610-direct strategy. Add locked
execution plan (PLAN_PR1610_CORRECTOR.md Rev 3).

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
amrayach added a commit to amrayach/parameter-golf that referenced this pull request Apr 14, 2026
Exact copy from PR openai#1610 at SHA ca19195.
MD5: 57cfda2047b2c2a63ec10b99d704bfb0. 3379 lines, 139831 bytes.
This is the unmodified source base; corrector will be added in later commits.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
amrayach added a commit to amrayach/parameter-golf that referenced this pull request Apr 14, 2026
Setup, seed-0 (Gate A), seed-1/2 (Gate B) subcommands with
published BPB verification targets and kill criteria.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 14, 2026
…; PRISM + Ouroboros papers; Session 13

- Merged SOTA 1.0810 unchanged (5-day plateau; 16 days to deadline)
- PR openai#1610 (romeerp, 1.0728): VarLenAttn + PhasingTTT — legal, score-first compliant, but low EV (-0.0006 bpb)
- PR openai#1619 flagged likely illegal (AdamW TTT — same pattern as rejected PR openai#771)
- PRISM (arXiv:2602.10796, Feb 2026): Parallel Residual Iterative Sequence Model, 174x throughput — read before next recurrence architecture decision
- Ouroboros (arXiv:2604.02051, Apr 2026): input-conditioned LoRA modulation for recursive transformers — watch
- Session 13 added to CLAUDE.md; no strategy change (PR openai#1586 per-layer GPTQ still #1 priority)
- daily_research.md Apr 14 entry added at top

https://claude.ai/code/session_01GLn4VtS8D1uehRZnfb4dRe
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 14, 2026
…al_bpb 1.07193 (3-seed mean)

Novel multi-phase global SGD during phased TTT evaluation.
Builds on PR openai#1530 (@samacqua) + PR openai#1610 (@romeerp) phased TTT concept.
3-seed mean: 1.07193 BPB (2.76890 nats), std 0.00063.
Seeds: 42, 0, 1234. All artifacts <16 MB.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant