Skip to content

Kill failed cells immediately on execute failure#1443

Open
fzyzcjy wants to merge 1 commit into
tom/pr_chain/trainer_ft/dev_revert_reversed/wire-per-cell-heartbeat-health-monitoring-into-the-train-groupfrom
tom/pr_chain/trainer_ft/dev_revert_reversed/kill-failed-cells-immediately-on-execute-failure
Open

Kill failed cells immediately on execute failure#1443
fzyzcjy wants to merge 1 commit into
tom/pr_chain/trainer_ft/dev_revert_reversed/wire-per-cell-heartbeat-health-monitoring-into-the-train-groupfrom
tom/pr_chain/trainer_ft/dev_revert_reversed/kill-failed-cells-immediately-on-execute-failure

Conversation

@fzyzcjy

@fzyzcjy fzyzcjy commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Changes the per-cell failure contract: when an actor execute() raises, the cell is not only marked errored but also stopped and confirmed dead (ray.kill + probe), via the kill_on_failure path in RayTrainCell._execute_raw. Renames mark_errored_on_failure to kill_on_failure, makes stop_and_confirm_dead private, and drops the group's separate _kill_errored_cells_and_confirm_dead pre-reconfigure step (now redundant since failures self-kill). Health-check probes pass kill_on_failure=False so a stale heartbeat does not kill the cell. Updates the error-isolation tests accordingly.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/wire-per-cell-heartbeat-health-monitoring-into-the-train-group branch from 67f82c7 to 9ab1b55 Compare June 23, 2026 07:51
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/kill-failed-cells-immediately-on-execute-failure branch from 2dc18f2 to b7af66c Compare June 23, 2026 07:51
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/wire-per-cell-heartbeat-health-monitoring-into-the-train-group branch from 9ab1b55 to fe607f9 Compare June 23, 2026 09:29
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/kill-failed-cells-immediately-on-execute-failure branch from b7af66c to 05b41e1 Compare June 23, 2026 09:29
Changes the per-cell failure contract: when an actor execute() raises, the cell is not only marked errored but also stopped and confirmed dead (ray.kill + probe), via the kill_on_failure path in RayTrainCell._execute_raw. Renames mark_errored_on_failure to kill_on_failure, makes stop_and_confirm_dead private, and drops the group's separate _kill_errored_cells_and_confirm_dead pre-reconfigure step (now redundant since failures self-kill). Health-check probes pass kill_on_failure=False so a stale heartbeat does not kill the cell. Updates the error-isolation tests accordingly.
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/wire-per-cell-heartbeat-health-monitoring-into-the-train-group branch from fe607f9 to 5cd96c4 Compare June 23, 2026 13:33
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/kill-failed-cells-immediately-on-execute-failure branch from 05b41e1 to ef74253 Compare June 23, 2026 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant