Skip to content

Wire per-cell heartbeat health monitoring into the train group#1442

Open
fzyzcjy wants to merge 1 commit into
tom/pr_chain/trainer_ft/dev_revert_reversed/track-a-per-actor-heartbeat-and-expose-it-via-rpcfrom
tom/pr_chain/trainer_ft/dev_revert_reversed/wire-per-cell-heartbeat-health-monitoring-into-the-train-group
Open

Wire per-cell heartbeat health monitoring into the train group#1442
fzyzcjy wants to merge 1 commit into
tom/pr_chain/trainer_ft/dev_revert_reversed/track-a-per-actor-heartbeat-and-expose-it-via-rpcfrom
tom/pr_chain/trainer_ft/dev_revert_reversed/wire-per-cell-heartbeat-health-monitoring-into-the-train-group

Conversation

@fzyzcjy

@fzyzcjy fzyzcjy commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

When running multiple independent-DP cells, the train group attaches a real SimpleHealthChecker (built from trainer_heartbeat_checker_* args via create_trainer_cell_health_checker) to each cell instead of the NoopHealthChecker, and exposes cell.cell_status() so an external FT controller can observe per-cell liveness. Single-cell runs keep the noop checker. Includes the heartbeat monitor tests. The cell-status/health-checker helpers and the actor-side heartbeat RPC are added separately.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/track-a-per-actor-heartbeat-and-expose-it-via-rpc branch from 4c77ff5 to f5c3ed4 Compare June 23, 2026 07:51
@fzyzcjy fzyzcjy requested a review from yushengsu-thu as a code owner June 23, 2026 07:51
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/wire-per-cell-heartbeat-health-monitoring-into-the-train-group branch from 67f82c7 to 9ab1b55 Compare June 23, 2026 07:51
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/track-a-per-actor-heartbeat-and-expose-it-via-rpc branch from f5c3ed4 to ea48128 Compare June 23, 2026 09:29
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/wire-per-cell-heartbeat-health-monitoring-into-the-train-group branch from 9ab1b55 to fe607f9 Compare June 23, 2026 09:29
When running multiple independent-DP cells, the train group attaches a real SimpleHealthChecker (built from trainer_heartbeat_checker_* args via create_trainer_cell_health_checker) to each cell instead of the NoopHealthChecker, and exposes cell.cell_status() so an external FT controller can observe per-cell liveness. Single-cell runs keep the noop checker. Includes the heartbeat monitor tests. The cell-status/health-checker helpers and the actor-side heartbeat RPC are added separately.
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/track-a-per-actor-heartbeat-and-expose-it-via-rpc branch from ea48128 to ad91bd9 Compare June 23, 2026 13:33
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/wire-per-cell-heartbeat-health-monitoring-into-the-train-group branch from fe607f9 to 5cd96c4 Compare June 23, 2026 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant