fix NaN loss not caught by fast-fail check by haosenwang1018 · Pull Request #84 · karpathy/autoresearch

haosenwang1018 · 2026-03-09T15:51:16Z

The fast-fail check train_loss_f > 100 silently passes on NaN because IEEE 754 NaN comparisons always return False. When an agent experiment produces NaN (e.g. from an aggressive LR change), the run wastes the full 5-minute budget training on garbage instead of aborting immediately.

not (x <= 100) is the standard negated-comparison idiom that catches both >100 and NaN, with no added complexity.

`train_loss_f > 100` silently passes on NaN because IEEE 754 NaN comparisons always return False. When an agent experiment produces NaN (e.g. from an aggressive LR change), the run wastes the full 5-minute budget instead of failing fast. `not (x <= 100)` catches both >100 and NaN with no added complexity. Co-Authored-By: Claude Opus 4.6 <[email protected]>

Fixes adopted from karpathy/autoresearch PRs: - karpathy#84: NaN loss bypasses fast-fail (IEEE 754: NaN > 100 is False). Fix: `not x <= 100`. Applied to both train.py and train_mlx.py. - karpathy#83: ParquetFile handles never closed, causing FD exhaustion on multi-epoch training. Fix: try/finally with pf.close(). - karpathy#107: Save pre-eval checkpoint so eval OOM/crash doesn't lose the entire training run. Removed on successful eval. - karpathy#93: MFU off-by-one: warmup skips 11 steps (0-10), not 10. - karpathy#70: Loss only reported last microstep, not average across grad accumulation. Fix: accumulate loss += detach() / grad_accum_steps. - karpathy#53: Debug checkpoint on loss explosion with step/loss metadata for post-mortem analysis (train.py only, merged into karpathy#84 fix). - karpathy#62: Input validation for --num-shards and --download-workers. Co-Authored-By: Claude Opus 4.6 <[email protected]>

svlandeg

Oh interesting, I didn't know this about direct nan comparisons! But I can confirm that if not math.nan <= 100 would hit the fast fail, while if math.nan > 100 wouldn't. So this fix is valid.

But wouldn't it be nicer & less error-prone to just check explicitely?

if train_loss is math.nan or train_loss > 100:

haosenwang1018 · 2026-03-11T04:28:17Z

Great suggestion — updated to an explicit check now:\n\n\n\nAlso added accordingly.

haosenwang1018 · 2026-03-11T04:28:38Z

(Follow-up) Code snippet got stripped in previous comment by shell quoting. The update is now: if math.isnan(train_loss_f) or train_loss_f > 100.

svlandeg approved these changes Mar 10, 2026

View reviewed changes

fix(train): make NaN fast-fail check explicit

ebf3578

karpathy merged commit 0be1e4f into karpathy:master Mar 11, 2026

svlandeg added the bug Something isn't working label Mar 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix NaN loss not caught by fast-fail check#84

fix NaN loss not caught by fast-fail check#84
karpathy merged 2 commits intokarpathy:masterfrom
haosenwang1018:fix/nan-loss-fast-fail

haosenwang1018 commented Mar 9, 2026

Uh oh!

svlandeg left a comment

Uh oh!

haosenwang1018 commented Mar 11, 2026

Uh oh!

haosenwang1018 commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

haosenwang1018 commented Mar 9, 2026

Uh oh!

svlandeg left a comment

Choose a reason for hiding this comment

Uh oh!

haosenwang1018 commented Mar 11, 2026

Uh oh!

haosenwang1018 commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants