Wire per-cell heartbeat health monitoring into the train group#1442
Conversation
Contributor
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
4c77ff5 to
f5c3ed4
Compare
67f82c7 to
9ab1b55
Compare
f5c3ed4 to
ea48128
Compare
9ab1b55 to
fe607f9
Compare
When running multiple independent-DP cells, the train group attaches a real SimpleHealthChecker (built from trainer_heartbeat_checker_* args via create_trainer_cell_health_checker) to each cell instead of the NoopHealthChecker, and exposes cell.cell_status() so an external FT controller can observe per-cell liveness. Single-cell runs keep the noop checker. Includes the heartbeat monitor tests. The cell-status/health-checker helpers and the actor-side heartbeat RPC are added separately.
ea48128 to
ad91bd9
Compare
fe607f9 to
5cd96c4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When running multiple independent-DP cells, the train group attaches a real SimpleHealthChecker (built from trainer_heartbeat_checker_* args via create_trainer_cell_health_checker) to each cell instead of the NoopHealthChecker, and exposes cell.cell_status() so an external FT controller can observe per-cell liveness. Single-cell runs keep the noop checker. Includes the heartbeat monitor tests. The cell-status/health-checker helpers and the actor-side heartbeat RPC are added separately.