Save pre-eval checkpoint to prevent training loss on eval crash by sunilp · Pull Request #107 · karpathy/autoresearch

sunilp · 2026-03-10T02:14:43Z

Summary

Saves a lightweight checkpoint (model state dict) after the training loop completes and before evaluate_bpb() runs. If evaluation crashes (OOM, CUDA error), the trained weights survive for inspection or metric recovery. On successful eval, the checkpoint is cleaned up automatically.

What: torch.save(model.state_dict(), "pre_eval_checkpoint.pt") between training and eval
Why: The agent loop runs experiments overnight — a crash during eval wastes a full 5-minute training cycle with no recovery path
Impact: +5 lines, no change to training loop or eval logic

Fixes #7

After the training loop completes and before evaluation begins, save a lightweight checkpoint (model state dict). If evaluation crashes (e.g. OOM when the agent has increased model size), the checkpoint survives for inspection or metric recovery. On successful eval, the checkpoint is cleaned up automatically. Fixes karpathy#7

Fixes adopted from karpathy/autoresearch PRs: - karpathy#84: NaN loss bypasses fast-fail (IEEE 754: NaN > 100 is False). Fix: `not x <= 100`. Applied to both train.py and train_mlx.py. - karpathy#83: ParquetFile handles never closed, causing FD exhaustion on multi-epoch training. Fix: try/finally with pf.close(). - karpathy#107: Save pre-eval checkpoint so eval OOM/crash doesn't lose the entire training run. Removed on successful eval. - karpathy#93: MFU off-by-one: warmup skips 11 steps (0-10), not 10. - karpathy#70: Loss only reported last microstep, not average across grad accumulation. Fix: accumulate loss += detach() / grad_accum_steps. - karpathy#53: Debug checkpoint on loss explosion with step/loss metadata for post-mortem analysis (train.py only, merged into karpathy#84 fix). - karpathy#62: Input validation for --num-shards and --download-workers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save pre-eval checkpoint to prevent training loss on eval crash#107

Save pre-eval checkpoint to prevent training loss on eval crash#107
sunilp wants to merge 1 commit intokarpathy:masterfrom
sunilp:fix/pre-eval-checkpoint

sunilp commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sunilp commented Mar 10, 2026

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant