You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Summary:
- we don't increase the max_step when a node is catching up because we don't call should_commit
- this can lead the node always being behind and get into an infinite recovery loop
- note, this can result in the global parameters falling out of sync, the diff includes an RFC on how to fix that if we need to
- document another case where `should_commit` can return `True` but it shouldn't because allreduce failed (this is also relvant only to the case when we can have pending inflight allreduce)
- make an assert based on the fragment sync schedule to make sure we don't run into this
Test Plan:
- tested on a cluster of 3 nodes by removing and adding a node
- the `max_step` and `local_step` increase in the manager's state dict after both failure and recovery
metrics from the healthy node
<img width="1103" alt="Screenshot 2025-06-15 at 10 53 28 PM copy" src="https://github.com/user-attachments/assets/8640780c-fd20-4266-aa3c-3116776a9c68" />
metrics from the failed and recovered node
<img width="1101" alt="Screenshot 2025-06-15 at 10 56 49 PM copy" src="https://github.com/user-attachments/assets/cc2a1c57-715f-4e0a-8e00-7c62da525dc3" />
0 commit comments