-
Notifications
You must be signed in to change notification settings - Fork 198
Avoid save_to_memory running too fast then causing save_to_storage to fail. #1585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid save_to_memory running too fast then causing save_to_storage to fail. #1585
Conversation
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (13.04%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.
Additional details and impacted files@@ Coverage Diff @@
## master #1585 +/- ##
===========================================
- Coverage 84.03% 19.86% -64.17%
===========================================
Files 285 285
Lines 29847 29870 +23
===========================================
- Hits 25082 5934 -19148
- Misses 4765 23936 +19171 ☔ View full report in Codecov by Sentry. |
| create=False, | ||
| ) | ||
| self._notify_queue = SharedQueue( | ||
| name=CheckpointSharedObjPrefix.SAVE_STEP_QNAME |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implement this magic concatenation as a method of ckpt_saver
| if self._local_rank != self.local_shard_id: | ||
| return False | ||
|
|
||
| if self._checkpoint_event_step > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this part into parent class(engine)
| """ | ||
| if self._checkpoint_event_step > 0: | ||
| notify_event = self._notify_queue.get() | ||
| assert notify_event.step == self._checkpoint_event_step |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's ur intention if assert is false?
In my personal opinion, this situation should be handled softly, such as skipping this step and providing a log explanation. So the previous implementation that directly failed didn't have too many issues. At most, it was just not easy to understand why it was 'not equal' the log express.
|
If this part involves code changes, it is still necessary to make all the changes comprehensively, even though #1529. |
| self._event_queue.put(event) | ||
| self._checkpoint_event_step = step | ||
| if success: | ||
| self.latest_step = step |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All ranks should expect a notify to drain their local shard queue
if self._local_rank == 0 and success:
event = CheckpointEvent(type=CheckpointEventType.SAVE, step=step)
self._event_queue.put(event)
if success:
self._checkpoint_event_step = step
self.latest_step = step
What changes were proposed in this pull request?
If there is a save_to_storage operation before, save_to_men needs to wait for the last save_to_storage to complete.
Why are the changes needed?
If the frequency of save_to_memory is high enough that the next save_to_memory is triggered before save_to_stoarge is completed, save_to_stoarge will fail. The error is as follows:
Since it is recommended to use megatron checkpoint directly in #1529, the modifications related to megatron checkpoint are ignored here.
Does this PR introduce any user-facing change?
No
How was this patch tested?
unittest, real ddp job.