-
Notifications
You must be signed in to change notification settings - Fork 17k
Allow using fresh interpreter besides fork() in Edge Worker #65943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
diogosilva30
wants to merge
13
commits into
apache:main
Choose a base branch
from
diogosilva30:fix/edge3-fork-deadlock-subprocess
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+333
−66
Open
Changes from 5 commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
27bb264
fix(edge3): replace fork() with subprocess.Popen to prevent deadlocks…
diogosilva30 fbaa062
Fix test_worker.py to use multiprocessing.Process instead of subproce…
diogosilva30 b8a7ab3
Honor fresh interpreter mode in Edge worker
diogosilva30 7378a39
Clarify Edge worker task process handling
diogosilva30 1cf3899
Merge branch 'main' into fix/edge3-fork-deadlock-subprocess
diogosilva30 72a25a1
Merge branch 'main' into fix/edge3-fork-deadlock-subprocess
diogosilva30 dcb4f2b
Improve Edge worker subprocess failure handling
diogosilva30 617b713
Merge branch 'main' into fix/edge3-fork-deadlock-subprocess
diogosilva30 3527b7d
Merge branch 'main' into fix/edge3-fork-deadlock-subprocess
diogosilva30 29e017c
Merge branch 'main' into fix/edge3-fork-deadlock-subprocess
diogosilva30 0bceddf
fix: rollback supervise changes & fix tests related to display_name
diogosilva30 2ca14c9
Merge branch 'main' into fix/edge3-fork-deadlock-subprocess
diogosilva30 cff85d2
Merge branch 'main' into fix/edge3-fork-deadlock-subprocess
diogosilva30 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not redirecting stderr to the normal logger/stdout?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Used a temp file because stderr is the only parent-visible diagnostic channel for the fresh-interpreter path, and we want those diagnostics attached to the task that failed.
In the fork path, the child can return an exception object through the multiprocessing result queue. In the subprocess path, the child is a separate Python interpreter running
execute_workload, so it cannot send that Python exception object back to the Edge worker. If something fails early, especially during workload parsing, supervisor startup, plugin import, or Dag import, stderr is what preserves the traceback.We could pass
sys.__stderr__like Celery does, but then output from all concurrently running task subprocesses would share the Edge worker’s stderr. That means a traceback could end up only in the worker/container log, potentially interleaved with other task subprocesses and worker logs, and not attached to the failed task’s log.The temp file is a per-task spool: it avoids
subprocess.PIPE(which can deadlock if the parent does not continuously drain it), keeps stderr attributable to the specific task subprocess, and lets us push those startup diagnostics into the task log vialogs_pushafter the process exits.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, sounds reasonable.
Still the STDERR then could be sent to the message queue? Just as plain text? The Edge Worker checks for Exception but otherwise should also be able to accept Test as String (like "OK" is being sent?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Queue approach works in the fork path because the child inherits the multiprocessing state, including the Queue itself.
With
subprocess.Popen(...)we start a completely fresh Python interpreter, so there is no shared Queue unless we build a separate IPC layer (pipe/socket/fd passing/etc).We could do that, but it adds quite a bit of complexity compared to the current tempfile approach. The tempfile also avoids PIPE deadlocks and still captures early bootstrap/import failures before any IPC channel would be initialized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a sleep over and seeing the code... I do not actually want to insist on the queue :-D I just mainly want to pass error details back from supervisor if somethings failed into task logs. So the "text" content should be passed-over.
For me it would also be okay to step away from the Queue in general and transport the error details via a text file in both branches. Then we have one technical backend for both execution options. Main part I want to achieve is to have "text" transferred to instead of passing the exception to queue the test can also be written to file and picked-up. That would make it leaner?
(Including if all is OK we do not need to pass "OK" text, we just use the file for passing any error text?)
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion! Before I push I just wanted to confirm the approach I've taken is what you had in mind:
multiprocessing.Queuefrom the fork path entirely.Path) stored asJob.stderr_file_path.Job.failure_details()reads from that file for both paths — no arguments needed, one code path.Job.cleanup().Does that match what you had in mind, or would you like any adjustments?