Skip to content

Add optional pre-verification to skip doomed experiments#90

Open
gtsbahamas wants to merge 1 commit intokarpathy:masterfrom
gtsbahamas:feat/pre-verification
Open

Add optional pre-verification to skip doomed experiments#90
gtsbahamas wants to merge 1 commit intokarpathy:masterfrom
gtsbahamas:feat/pre-verification

Conversation

@gtsbahamas
Copy link

What

Adds an optional pre-verification step to program.md. Before spending 5 minutes on a training run, the agent can run a ~75-second check that extracts implicit claims from the code (shape expectations, import requirements, API contracts) and flags changes likely to crash.

Why

Autonomous overnight runs produce ~100 experiments. Some crash immediately (OOM, shape mismatch, missing import). Each crash still burns 5 minutes of GPU time plus agent context window on stack trace diagnosis.

Pre-verification catches these before training starts.

Evidence

Ran 12 LLM-generated architectural modifications on an A10G GPU. For each, Claude generated a complete modified train.py, verification ran, then training ran regardless to compare predictions vs reality.

Modification Verification Score Bugs Found Predicted Crash Actually Crashed
GELU → SiLU activation 85 6 Yes Yes
Add dropout p=0.1 60 16 Yes Yes
Double MLP width (8x) 58 10 Yes Yes
Sliding window attention 47 16 Yes Yes
RMSNorm replace LayerNorm 55 17 Yes Yes
4 layers, n_embd=1024 71 5 Yes Yes
MoE 4 experts, top-2 50 17 Yes Yes

7/7 runs where verification executed correctly predicted the crash. 5 additional runs errored on code generation (Claude API failures), not verification.

Limitations (being honest)

  • Every modification in this test crashed. We have 7 true positives but 0 true negatives (no cases where verification correctly passed a good modification). The 100% crash rate likely reflects overly aggressive LLM-generated changes.
  • Verification uses an LLM internally, so it can also hallucinate. It's a pre-filter, not a guarantee.
  • Requires Node.js 20+ and an ANTHROPIC_API_KEY, which won't be available in all environments. The step is marked optional for this reason.
  • Adds ~75 seconds per experiment. At 12 experiments/hour, this reduces throughput to ~10/hour. Worth it only if crash rate exceeds ~25%.

The change

One paragraph added to program.md under "Crashes", marked as optional. No new files, no changes to train.py or prepare.py.

npx -y tryassay assess . --no-publish --no-review -y

Full experiment data: results TSV

Add an optional Assay verification step before training. Takes ~60-90s
to check for shape mismatches, missing imports, and API violations.
If the code looks broken, skip training instead of wasting 5 min GPU time.

Tested with 12 LLM-generated architectural modifications on A10G:
- 7/7 runs where verification executed: correctly predicted all crashes
- Modifications tested: SiLU swap, dropout, MoE, sliding window attention,
  RMSNorm, doubled MLP width, ALiBi, architecture reshape
- Average verification time: ~75 seconds

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants