Skip to content

Add HTTP control server for cell suspend/resume and fault injection#1452

Open
fzyzcjy wants to merge 1 commit into
tom/pr_chain/trainer_ft/dev_revert_reversed/add-mini-ft-controllerfrom
tom/pr_chain/trainer_ft/dev_revert_reversed/add-http-control-server-for-cell-suspend-resume-and-fault-injection
Open

Add HTTP control server for cell suspend/resume and fault injection#1452
fzyzcjy wants to merge 1 commit into
tom/pr_chain/trainer_ft/dev_revert_reversed/add-mini-ft-controllerfrom
tom/pr_chain/trainer_ft/dev_revert_reversed/add-http-control-server-for-cell-suspend-resume-and-fault-injection

Conversation

@fzyzcjy

@fzyzcjy fzyzcjy commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Adds the FastAPI control server (server.py: _create_control_app / start_control_server) exposing a k8s-style cell API, backed by a cell registry (registry.py: _CellRegistry) and per-cell handles (handles.py: _ActorCellHandle / _RolloutCellHandle wrapping the RayTrainGroup and rollout manager) so an external operator (or the e2e fault injector) can list cells, suspend/resume them, and inject faults. Includes the fast unit tests for server, registry, and handles. The Pydantic API models live in the separate control_server/models module.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/add-mini-ft-controller branch from ff00e39 to 238314a Compare June 23, 2026 07:52
@fzyzcjy fzyzcjy requested a review from yushengsu-thu as a code owner June 23, 2026 07:52
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/add-http-control-server-for-cell-suspend-resume-and-fault-injection branch from 4ec76f7 to 393270c Compare June 23, 2026 07:52
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/add-mini-ft-controller branch from 238314a to aca0030 Compare June 23, 2026 09:30
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/add-http-control-server-for-cell-suspend-resume-and-fault-injection branch from 393270c to 15e1254 Compare June 23, 2026 09:30
Adds the FastAPI control server (server.py: _create_control_app / start_control_server) exposing a k8s-style cell API, backed by a cell registry (registry.py: _CellRegistry) and per-cell handles (handles.py: _ActorCellHandle / _RolloutCellHandle wrapping the RayTrainGroup and rollout manager) so an external operator (or the e2e fault injector) can list cells, suspend/resume them, and inject faults. Includes the fast unit tests for server, registry, and handles. The Pydantic API models live in the separate control_server/models module.
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/add-mini-ft-controller branch from aca0030 to 8432589 Compare June 23, 2026 13:34
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/add-http-control-server-for-cell-suspend-resume-and-fault-injection branch from 15e1254 to 96f28d9 Compare June 23, 2026 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant