Skip to content

Track rollout-engine connection staleness on the weight updater#1444

Open
fzyzcjy wants to merge 1 commit into
tom/pr_chain/trainer_ft/dev_revert_reversed/kill-failed-cells-immediately-on-execute-failurefrom
tom/pr_chain/trainer_ft/dev_revert_reversed/track-rollout-engine-connection-staleness-on-the-weight-updater
Open

Track rollout-engine connection staleness on the weight updater#1444
fzyzcjy wants to merge 1 commit into
tom/pr_chain/trainer_ft/dev_revert_reversed/kill-failed-cells-immediately-on-execute-failurefrom
tom/pr_chain/trainer_ft/dev_revert_reversed/track-rollout-engine-connection-staleness-on-the-weight-updater

Conversation

@fzyzcjy

@fzyzcjy fzyzcjy commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Adds is_rollout_engines_fresh / mark_engine_connection_stale to both weight updaters (UpdateWeightFromTensor and UpdateWeightFromDistributed), tracking a _connection_stale flag cleared on connect_rollout_engines. The Megatron actor re-runs the connect step when engines are new or the connection went stale (after an indep-dp reconfigure marks it stale), so weight updates never broadcast over a dead engine group.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/kill-failed-cells-immediately-on-execute-failure branch from 2dc18f2 to b7af66c Compare June 23, 2026 07:51
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/track-rollout-engine-connection-staleness-on-the-weight-updater branch from 7ea403c to 262aafe Compare June 23, 2026 07:51
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/kill-failed-cells-immediately-on-execute-failure branch from b7af66c to 05b41e1 Compare June 23, 2026 09:29
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/track-rollout-engine-connection-staleness-on-the-weight-updater branch from 262aafe to 9026454 Compare June 23, 2026 09:29
Adds is_rollout_engines_fresh / mark_engine_connection_stale to both weight updaters (UpdateWeightFromTensor and UpdateWeightFromDistributed), tracking a _connection_stale flag cleared on connect_rollout_engines. The Megatron actor re-runs the connect step when engines are new or the connection went stale (after an indep-dp reconfigure marks it stale), so weight updates never broadcast over a dead engine group.
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/kill-failed-cells-immediately-on-execute-failure branch from 05b41e1 to ef74253 Compare June 23, 2026 13:33
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/track-rollout-engine-connection-staleness-on-the-weight-updater branch from 9026454 to 5a01762 Compare June 23, 2026 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant