Skip to content

Add mini FT controller#1451

Open
fzyzcjy wants to merge 1 commit into
tom/pr_chain/trainer_ft/dev_revert_reversed/wire-ft-event-logging-and-component-gating-into-rolloutmanagerfrom
tom/pr_chain/trainer_ft/dev_revert_reversed/add-mini-ft-controller
Open

Add mini FT controller#1451
fzyzcjy wants to merge 1 commit into
tom/pr_chain/trainer_ft/dev_revert_reversed/wire-ft-event-logging-and-component-gating-into-rolloutmanagerfrom
tom/pr_chain/trainer_ft/dev_revert_reversed/add-mini-ft-controller

Conversation

@fzyzcjy

@fzyzcjy fzyzcjy commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Adds the in-process mini fault-tolerance controller (mini_ft_controller.py: _MiniFTController / _MiniFTControllerRunner / maybe_start_mini_ft_controller) that periodically polls per-cell health snapshots and drives cell healing/restart decisions without an external operator. Includes its fast unit test. Wiring into the train entrypoints and the args validation are handled separately.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/wire-ft-event-logging-and-component-gating-into-rolloutmanager branch from d2a436a to 96eaae8 Compare June 23, 2026 07:51
@fzyzcjy fzyzcjy requested a review from yushengsu-thu as a code owner June 23, 2026 07:51
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/add-mini-ft-controller branch from ff00e39 to 238314a Compare June 23, 2026 07:52
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/wire-ft-event-logging-and-component-gating-into-rolloutmanager branch from 96eaae8 to 7b4a598 Compare June 23, 2026 09:30
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/add-mini-ft-controller branch from 238314a to aca0030 Compare June 23, 2026 09:30
Adds the in-process mini fault-tolerance controller (mini_ft_controller.py: _MiniFTController / _MiniFTControllerRunner / maybe_start_mini_ft_controller) that periodically polls per-cell health snapshots and drives cell healing/restart decisions without an external operator. Includes its fast unit test. Wiring into the train entrypoints and the args validation are handled separately.
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/wire-ft-event-logging-and-component-gating-into-rolloutmanager branch from 7b4a598 to 15244be Compare June 23, 2026 13:34
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/add-mini-ft-controller branch from aca0030 to 8432589 Compare June 23, 2026 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant