Skip to content

Add opt-in periodic py-spy dumper for hang debugging#1461

Open
fzyzcjy wants to merge 1 commit into
tom/pr_chain/trainer_ft/dev_revert_reversed/add-ft-random-and-realistic-gsm8k-e2e-scenarios-with-periodic-faultfrom
tom/pr_chain/trainer_ft/dev_revert_reversed/add-opt-in-periodic-py-spy-dumper-for-hang-debugging
Open

Add opt-in periodic py-spy dumper for hang debugging#1461
fzyzcjy wants to merge 1 commit into
tom/pr_chain/trainer_ft/dev_revert_reversed/add-ft-random-and-realistic-gsm8k-e2e-scenarios-with-periodic-faultfrom
tom/pr_chain/trainer_ft/dev_revert_reversed/add-opt-in-periodic-py-spy-dumper-for-hang-debugging

Conversation

@fzyzcjy

@fzyzcjy fzyzcjy commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Add miles/utils/debug_utils/periodic_py_spy.py and start it from the train
entrypoints (train.py / train_async.py). Opt-in via MILES_DEBUG_PYSPY_DUMP_INTERVAL
(no-op when unset); a background thread periodically py-spy-dumps the
python/ray/sglang processes. Independent of fault tolerance.

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/add-ft-random-and-realistic-gsm8k-e2e-scenarios-with-periodic-fault branch from b7e6ecd to 2d279bb Compare June 23, 2026 07:52
@fzyzcjy fzyzcjy requested a review from yushengsu-thu as a code owner June 23, 2026 07:52
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/add-opt-in-periodic-py-spy-dumper-for-hang-debugging branch from 3fa4cfe to 581446a Compare June 23, 2026 07:52
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/add-ft-random-and-realistic-gsm8k-e2e-scenarios-with-periodic-fault branch from 2d279bb to d360f8b Compare June 23, 2026 09:31
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/add-opt-in-periodic-py-spy-dumper-for-hang-debugging branch from 581446a to eac9c88 Compare June 23, 2026 09:31
Add miles/utils/debug_utils/periodic_py_spy.py and start it from the train
entrypoints (train.py / train_async.py). Opt-in via MILES_DEBUG_PYSPY_DUMP_INTERVAL
(no-op when unset); a background thread periodically py-spy-dumps the
python/ray/sglang processes. Independent of fault tolerance.
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/add-ft-random-and-realistic-gsm8k-e2e-scenarios-with-periodic-fault branch from d360f8b to 505352b Compare June 23, 2026 13:35
@fzyzcjy fzyzcjy force-pushed the tom/pr_chain/trainer_ft/dev_revert_reversed/add-opt-in-periodic-py-spy-dumper-for-hang-debugging branch from eac9c88 to b05a02e Compare June 23, 2026 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant