Skip to content

Conversation

@finbarrtimbers
Copy link
Collaborator

@finbarrtimbers finbarrtimbers commented Nov 15, 2025

Note

Extracts checkpoint saving into maybe_save_checkpoint and runs checkpointing and evaluation asynchronously after each step; updates docs with code style rules.

  • Training pipeline:
    • Extracts checkpoint logic into maybe_save_checkpoint (with Timer) and replaces inline checkpointing.
    • Runs maybe_save_checkpoint and maybe_evaluate via thread pool, waits for both, then triggers weight sync.
    • Minor reordering: weight sync trigger moved to occur after async tasks complete.
  • Docs:
    • Add code style guidelines in CLAUDE.md (require type annotations and Google-style docstrings).

Written by Cursor Bugbot for commit 74ff4e5. This will update automatically on new commits. Configure here.

@finbarrtimbers finbarrtimbers changed the base branch from main to pad-out-32b November 15, 2025 04:21
@finbarrtimbers finbarrtimbers marked this pull request as ready for review November 15, 2025 04:34
@finbarrtimbers finbarrtimbers changed the base branch from pad-out-32b to main November 15, 2025 04:43
Copy link
Contributor

@mnoukhov mnoukhov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@finbarrtimbers finbarrtimbers changed the title Adds a lock around accessing the params Fixes the deadlock we get running the weight sync and checkpointing simultaneously Nov 17, 2025
@finbarrtimbers finbarrtimbers changed the title Fixes the deadlock we get running the weight sync and checkpointing simultaneously Makes checkpointing and evaluation bookkeeping run asynchronously Nov 18, 2025
@finbarrtimbers finbarrtimbers marked this pull request as draft November 18, 2025 03:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants