Fix Issue #7: Save checkpoint before evaluation to prevent training loss by teee32 · Pull Request #112 · karpathy/autoresearch

teee32 · 2026-03-10T05:47:42Z

Summary

Fixes Issue Training results lost if evaluation crashes (no pre-eval checkpoint) #7: Training results lost if evaluation crashes (no pre-eval checkpoint)

Problem

When running unattended overnight experiments, if evaluation crashes (OOM, CUDA errors, etc.) after the full 5-minute training budget, the entire training run is lost.

Solution

Save checkpoint before each evaluation
Retry failed evaluations up to 3 times with smaller batch size
Progressive wait time (2s, 4s, 6s) between retries for GPU memory release
GPU memory diagnostics for debugging
Auto-cleanup checkpoint after successful evaluation

Changes

train.py: +217 lines (checkpoint utilities, retry logic, memory management)

Test plan

Run evaluation that triggers OOM and verify checkpoint is saved
Verify retry mechanism works with smaller batch size
Verify checkpoint is cleaned up after successful evaluation

🤖 Generated with Claude Code

Standardize train.py and prepare.py to fail with actionable CLI messages for unsupported CUDA environments and missing or corrupted setup artifacts, so common setup mistakes no longer surface as raw tracebacks or low-level file errors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…aining loss - Save pre-evaluation checkpoint before each evaluation - Retry failed evaluations up to 3 times with exponential backoff - Use smaller batch size on retry to avoid OOM - Add GPU memory diagnostics for debugging - Clear GPU memory and reset stats between retry attempts - Clean up checkpoint file after successful evaluation This prevents 5 minutes of training from being lost when evaluation crashes (OOM, CUDA errors, etc.) during unattended overnight experiments. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

sdfgafe and others added 2 commits March 9, 2026 23:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Issue #7: Save checkpoint before evaluation to prevent training loss#112

Fix Issue #7: Save checkpoint before evaluation to prevent training loss#112
teee32 wants to merge 2 commits intokarpathy:masterfrom
teee32:fix/issue-7-pre-eval-checkpoint

teee32 commented Mar 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

teee32 commented Mar 10, 2026

Summary

Problem

Solution

Changes

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants