Skip to content

Save pre-eval checkpoint to prevent training loss on eval crash#107

Open
sunilp wants to merge 1 commit intokarpathy:masterfrom
sunilp:fix/pre-eval-checkpoint
Open

Save pre-eval checkpoint to prevent training loss on eval crash#107
sunilp wants to merge 1 commit intokarpathy:masterfrom
sunilp:fix/pre-eval-checkpoint

Conversation

@sunilp
Copy link

@sunilp sunilp commented Mar 10, 2026

Summary

Saves a lightweight checkpoint (model state dict) after the training loop completes and before evaluate_bpb() runs. If evaluation crashes (OOM, CUDA error), the trained weights survive for inspection or metric recovery. On successful eval, the checkpoint is cleaned up automatically.

  • What: torch.save(model.state_dict(), "pre_eval_checkpoint.pt") between training and eval
  • Why: The agent loop runs experiments overnight — a crash during eval wastes a full 5-minute training cycle with no recovery path
  • Impact: +5 lines, no change to training loop or eval logic

Fixes #7

After the training loop completes and before evaluation begins, save
a lightweight checkpoint (model state dict). If evaluation crashes
(e.g. OOM when the agent has increased model size), the checkpoint
survives for inspection or metric recovery. On successful eval, the
checkpoint is cleaned up automatically.

Fixes karpathy#7
sunnypatneedi added a commit to sunnypatneedi/autoresearch-muon that referenced this pull request Mar 10, 2026
Fixes adopted from karpathy/autoresearch PRs:

- karpathy#84: NaN loss bypasses fast-fail (IEEE 754: NaN > 100 is False).
  Fix: `not x <= 100`. Applied to both train.py and train_mlx.py.
- karpathy#83: ParquetFile handles never closed, causing FD exhaustion on
  multi-epoch training. Fix: try/finally with pf.close().
- karpathy#107: Save pre-eval checkpoint so eval OOM/crash doesn't lose
  the entire training run. Removed on successful eval.
- karpathy#93: MFU off-by-one: warmup skips 11 steps (0-10), not 10.
- karpathy#70: Loss only reported last microstep, not average across grad
  accumulation. Fix: accumulate loss += detach() / grad_accum_steps.
- karpathy#53: Debug checkpoint on loss explosion with step/loss metadata
  for post-mortem analysis (train.py only, merged into karpathy#84 fix).
- karpathy#62: Input validation for --num-shards and --download-workers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
sunnypatneedi added a commit to sunnypatneedi/autoresearch-muon that referenced this pull request Mar 10, 2026
Fixes adopted from karpathy/autoresearch PRs:

- karpathy#84: NaN loss bypasses fast-fail (IEEE 754: NaN > 100 is False).
  Fix: `not x <= 100`. Applied to both train.py and train_mlx.py.
- karpathy#83: ParquetFile handles never closed, causing FD exhaustion on
  multi-epoch training. Fix: try/finally with pf.close().
- karpathy#107: Save pre-eval checkpoint so eval OOM/crash doesn't lose
  the entire training run. Removed on successful eval.
- karpathy#93: MFU off-by-one: warmup skips 11 steps (0-10), not 10.
- karpathy#70: Loss only reported last microstep, not average across grad
  accumulation. Fix: accumulate loss += detach() / grad_accum_steps.
- karpathy#53: Debug checkpoint on loss explosion with step/loss metadata
  for post-mortem analysis (train.py only, merged into karpathy#84 fix).
- karpathy#62: Input validation for --num-shards and --download-workers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Training results lost if evaluation crashes (no pre-eval checkpoint)

1 participant