Track rollout-engine connection staleness on the weight updater#1444
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
2dc18f2 to
b7af66c
Compare
7ea403c to
262aafe
Compare
b7af66c to
05b41e1
Compare
262aafe to
9026454
Compare
Adds is_rollout_engines_fresh / mark_engine_connection_stale to both weight updaters (UpdateWeightFromTensor and UpdateWeightFromDistributed), tracking a _connection_stale flag cleared on connect_rollout_engines. The Megatron actor re-runs the connect step when engines are new or the connection went stale (after an indep-dp reconfigure marks it stale), so weight updates never broadcast over a dead engine group.
05b41e1 to
ef74253
Compare
9026454 to
5a01762
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds is_rollout_engines_fresh / mark_engine_connection_stale to both weight updaters (UpdateWeightFromTensor and UpdateWeightFromDistributed), tracking a _connection_stale flag cleared on connect_rollout_engines. The Megatron actor re-runs the connect step when engines are new or the connection went stale (after an indep-dp reconfigure marks it stale), so weight updates never broadcast over a dead engine group.