Skip to content

Fix Issue #7: Save checkpoint before evaluation to prevent training loss#112

Open
teee32 wants to merge 2 commits intokarpathy:masterfrom
teee32:fix/issue-7-pre-eval-checkpoint
Open

Fix Issue #7: Save checkpoint before evaluation to prevent training loss#112
teee32 wants to merge 2 commits intokarpathy:masterfrom
teee32:fix/issue-7-pre-eval-checkpoint

Conversation

@teee32
Copy link

@teee32 teee32 commented Mar 10, 2026

Summary

Problem

When running unattended overnight experiments, if evaluation crashes (OOM, CUDA errors, etc.) after the full 5-minute training budget, the entire training run is lost.

Solution

  • Save checkpoint before each evaluation
  • Retry failed evaluations up to 3 times with smaller batch size
  • Progressive wait time (2s, 4s, 6s) between retries for GPU memory release
  • GPU memory diagnostics for debugging
  • Auto-cleanup checkpoint after successful evaluation

Changes

  • train.py: +217 lines (checkpoint utilities, retry logic, memory management)

Test plan

  • Run evaluation that triggers OOM and verify checkpoint is saved
  • Verify retry mechanism works with smaller batch size
  • Verify checkpoint is cleaned up after successful evaluation

🤖 Generated with Claude Code

sdfgafe and others added 2 commits March 9, 2026 23:10
Standardize train.py and prepare.py to fail with actionable CLI messages for unsupported CUDA environments and missing or corrupted setup artifacts, so common setup mistakes no longer surface as raw tracebacks or low-level file errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…aining loss

- Save pre-evaluation checkpoint before each evaluation
- Retry failed evaluations up to 3 times with exponential backoff
- Use smaller batch size on retry to avoid OOM
- Add GPU memory diagnostics for debugging
- Clear GPU memory and reset stats between retry attempts
- Clean up checkpoint file after successful evaluation

This prevents 5 minutes of training from being lost when evaluation
crashes (OOM, CUDA errors, etc.) during unattended overnight experiments.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants