-
Notifications
You must be signed in to change notification settings - Fork 30
Add timeout to distributed torch tests to fail on hang #596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
c3bcdd9
143a50e
d91dafe
7a883da
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,60 @@ | ||||||||||||||||||||||||||||||
| # Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved. | ||||||||||||||||||||||||||||||
| # | ||||||||||||||||||||||||||||||
| # See LICENSE for license information. | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| import os, signal, subprocess | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| def run_proctree_with_timeout(cmd, timeout, **kwargs): | ||||||||||||||||||||||||||||||
| """Run a command in a subprocess and check for errors.""" | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| if timeout is None: | ||||||||||||||||||||||||||||||
| return subprocess.run(cmd, **kwargs) | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| if "timeout" in kwargs: | ||||||||||||||||||||||||||||||
| raise ValueError("Timeout should be passed as a separate argument, not in kwargs") | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| capture_output = kwargs.pop("capture_output", False) | ||||||||||||||||||||||||||||||
| if capture_output: | ||||||||||||||||||||||||||||||
| kwargs["stdout"] = subprocess.PIPE | ||||||||||||||||||||||||||||||
| kwargs["stderr"] = subprocess.PIPE | ||||||||||||||||||||||||||||||
| else: | ||||||||||||||||||||||||||||||
| stdout, stderr = None, None | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| check = kwargs.pop("check", False) | ||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| kwargs["start_new_session"] = True # To use killpg as termination fallback | ||||||||||||||||||||||||||||||
| p = subprocess.Popen(cmd, **kwargs) | ||||||||||||||||||||||||||||||
| try: | ||||||||||||||||||||||||||||||
| if capture_output: | ||||||||||||||||||||||||||||||
| stdout, stderr = p.communicate(timeout=timeout) | ||||||||||||||||||||||||||||||
| else: | ||||||||||||||||||||||||||||||
| p.wait(timeout=timeout) | ||||||||||||||||||||||||||||||
| except subprocess.TimeoutExpired: | ||||||||||||||||||||||||||||||
| p.terminate() | ||||||||||||||||||||||||||||||
| try: | ||||||||||||||||||||||||||||||
| # Give the process time to terminate gracefully | ||||||||||||||||||||||||||||||
| if capture_output: | ||||||||||||||||||||||||||||||
| stdout, stderr = p.communicate(timeout=timeout) | ||||||||||||||||||||||||||||||
| else: | ||||||||||||||||||||||||||||||
| p.wait(timeout=timeout) | ||||||||||||||||||||||||||||||
| except subprocess.TimeoutExpired: | ||||||||||||||||||||||||||||||
| os.killpg(p.pid, signal.SIGKILL) | ||||||||||||||||||||||||||||||
| if capture_output: | ||||||||||||||||||||||||||||||
| stdout, stderr = p.communicate() | ||||||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The PR description states the goal is to "kill all child processes on timeout", but Two suggestions:
Secondary concern: swallowing |
||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| # Handle check=True | ||||||||||||||||||||||||||||||
| if check and p.returncode != 0: | ||||||||||||||||||||||||||||||
| raise subprocess.CalledProcessError( | ||||||||||||||||||||||||||||||
| cmd, | ||||||||||||||||||||||||||||||
| kwargs.get("args", None), | ||||||||||||||||||||||||||||||
| output=stdout, | ||||||||||||||||||||||||||||||
| stderr=stderr | ||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The
Suggested change
|
||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||
| return subprocess.CompletedProcess( | ||||||||||||||||||||||||||||||
| cmd, | ||||||||||||||||||||||||||||||
| p.returncode, | ||||||||||||||||||||||||||||||
| stdout, | ||||||||||||||||||||||||||||||
| stderr | ||||||||||||||||||||||||||||||
| ) | ||||||||||||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using
run_proctree_with_timeout, could we just use coreutils'timeoutcommand?This could perhaps even become a pytest fixture:
This would have a few advantages:
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is brilliant idea. It can be done even more simple by modifying command line right in tests. I've modified the PR. The first CI run is at https://github.com/ROCm/TransformerEngine/actions/runs/26611894136/job/78425262492
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good.
There are no timeout-related messages in the distributed-test logs at https://github.com/ROCm/TransformerEngine/actions/runs/26611894136/job/78425262492 as far as I can tell, and the distributed tests passed, so the mechanism was not exercised in that run. Not sure if you want to test this further within this PR, but either way, good to go from my side.