Skip to content

Log train-group step-end and analysis events#1447

Open
fzyzcjy wants to merge 1 commit into
tom/pr_chain/trainer_ft/dev_revert_reversed/inject-witness-ids-into-the-megatron-forward-and-train-stepfrom
tom/pr_chain/trainer_ft/dev_revert_reversed/log-train-group-step-end-and-analysis-events
Open

Log train-group step-end and analysis events#1447
fzyzcjy wants to merge 1 commit into
tom/pr_chain/trainer_ft/dev_revert_reversed/inject-witness-ids-into-the-megatron-forward-and-train-stepfrom
tom/pr_chain/trainer_ft/dev_revert_reversed/log-train-group-step-end-and-analysis-events

Conversation

@fzyzcjy

@fzyzcjy fzyzcjy commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Adds observability to RayTrainGroup: run the event analyzer at the start of each train() rollout, emit a TrainGroupStepEndEvent with per-cell outcomes after each attempt, and emit an InferenceEngineWeightChecksumEvent after update_weights (collected via rollout_manager.check_weights). Includes the matching unit tests. The witness-id and cell-reconfigure events are added with their own features.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/inject-witness-ids-into-the-megatron-forward-and-train-step branch from 5f53d76 to 3452825 Compare June 23, 2026 07:51
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/log-train-group-step-end-and-analysis-events branch from 3ea44a1 to 5e07ca5 Compare June 23, 2026 07:51
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/inject-witness-ids-into-the-megatron-forward-and-train-step branch from 3452825 to fde57ba Compare June 23, 2026 09:29
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/log-train-group-step-end-and-analysis-events branch from 5e07ca5 to 2afae2a Compare June 23, 2026 09:30
Adds observability to RayTrainGroup: run the event analyzer at the start of each train() rollout, emit a TrainGroupStepEndEvent with per-cell outcomes after each attempt, and emit an InferenceEngineWeightChecksumEvent after update_weights (collected via rollout_manager.check_weights). Includes the matching unit tests. The witness-id and cell-reconfigure events are added with their own features.
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/inject-witness-ids-into-the-megatron-forward-and-train-step branch from fde57ba to a19935b Compare June 23, 2026 13:34
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/log-train-group-step-end-and-analysis-events branch from 2afae2a to f8190b4 Compare June 23, 2026 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant