Skip to content

fix NaN loss not caught by fast-fail check#84

Merged
karpathy merged 2 commits intokarpathy:masterfrom
haosenwang1018:fix/nan-loss-fast-fail
Mar 11, 2026
Merged

fix NaN loss not caught by fast-fail check#84
karpathy merged 2 commits intokarpathy:masterfrom
haosenwang1018:fix/nan-loss-fast-fail

Conversation

@haosenwang1018
Copy link
Contributor

The fast-fail check train_loss_f > 100 silently passes on NaN because IEEE 754 NaN comparisons always return False. When an agent experiment produces NaN (e.g. from an aggressive LR change), the run wastes the full 5-minute budget training on garbage instead of aborting immediately.

not (x <= 100) is the standard negated-comparison idiom that catches both >100 and NaN, with no added complexity.

`train_loss_f > 100` silently passes on NaN because IEEE 754 NaN
comparisons always return False. When an agent experiment produces
NaN (e.g. from an aggressive LR change), the run wastes the full
5-minute budget instead of failing fast.

`not (x <= 100)` catches both >100 and NaN with no added complexity.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
sunnypatneedi added a commit to sunnypatneedi/autoresearch-muon that referenced this pull request Mar 10, 2026
Fixes adopted from karpathy/autoresearch PRs:

- karpathy#84: NaN loss bypasses fast-fail (IEEE 754: NaN > 100 is False).
  Fix: `not x <= 100`. Applied to both train.py and train_mlx.py.
- karpathy#83: ParquetFile handles never closed, causing FD exhaustion on
  multi-epoch training. Fix: try/finally with pf.close().
- karpathy#107: Save pre-eval checkpoint so eval OOM/crash doesn't lose
  the entire training run. Removed on successful eval.
- karpathy#93: MFU off-by-one: warmup skips 11 steps (0-10), not 10.
- karpathy#70: Loss only reported last microstep, not average across grad
  accumulation. Fix: accumulate loss += detach() / grad_accum_steps.
- karpathy#53: Debug checkpoint on loss explosion with step/loss metadata
  for post-mortem analysis (train.py only, merged into karpathy#84 fix).
- karpathy#62: Input validation for --num-shards and --download-workers.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
sunnypatneedi added a commit to sunnypatneedi/autoresearch-muon that referenced this pull request Mar 10, 2026
Fixes adopted from karpathy/autoresearch PRs:

- karpathy#84: NaN loss bypasses fast-fail (IEEE 754: NaN > 100 is False).
  Fix: `not x <= 100`. Applied to both train.py and train_mlx.py.
- karpathy#83: ParquetFile handles never closed, causing FD exhaustion on
  multi-epoch training. Fix: try/finally with pf.close().
- karpathy#107: Save pre-eval checkpoint so eval OOM/crash doesn't lose
  the entire training run. Removed on successful eval.
- karpathy#93: MFU off-by-one: warmup skips 11 steps (0-10), not 10.
- karpathy#70: Loss only reported last microstep, not average across grad
  accumulation. Fix: accumulate loss += detach() / grad_accum_steps.
- karpathy#53: Debug checkpoint on loss explosion with step/loss metadata
  for post-mortem analysis (train.py only, merged into karpathy#84 fix).
- karpathy#62: Input validation for --num-shards and --download-workers.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
Copy link
Collaborator

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting, I didn't know this about direct nan comparisons! But I can confirm that if not math.nan <= 100 would hit the fast fail, while if math.nan > 100 wouldn't. So this fix is valid.

But wouldn't it be nicer & less error-prone to just check explicitely?

if train_loss is math.nan or train_loss > 100:

@haosenwang1018
Copy link
Contributor Author

Great suggestion — updated to an explicit check now:\n\n\n\nAlso added accordingly.

@haosenwang1018
Copy link
Contributor Author

(Follow-up) Code snippet got stripped in previous comment by shell quoting. The update is now: if math.isnan(train_loss_f) or train_loss_f > 100.

@karpathy karpathy merged commit 0be1e4f into karpathy:master Mar 11, 2026
@svlandeg svlandeg added the bug Something isn't working label Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants