-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Move pysam index to external process #11558
base: release_21.01
Are you sure you want to change the base?
Conversation
f"import pysam; pysam.set_verbosity(0); pysam.index('{index_flag}', '{file_name}', '{index_name}')"] | ||
if stderr: | ||
with open(stderr, 'w') as stderr: | ||
subprocess.check_call(cmd, stderr=stderr, shell=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use
galaxy/lib/galaxy/util/commands.py
Line 88 in ca44259
def execute(cmds, input=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing. The original code has a specific comment saying that stderr needs to be discarded:
galaxy/lib/galaxy/datatypes/binary.py
Line 490 in e5a9524
# we start another process and discard stderr. |
and
execute
doesn't seem to support stderr redirection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what stderr=subprocess.PIPE
does (not exactly, but this good enough. the only important thing is that stderr of the externalize pysam call doesn't end up in the outer stderr, which was? a failure reason for the metadata script)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aah I see, the piped stderr is being ignored. Sure, seems fine, can do.
Can you make this optional ? For traditional Galaxy job runners this runs as a external process already as part of the metadata script. When creating many small files creating a subprocess for this is going to be significant overhead. |
I'm now wondering whether this is worth doing at all. I guess threads being suspended is only really a problem for the heartbeat thread, but it seems like the heartbeat is not really a good proxy for liveness anyway for a number of reasons. a. All it’s saying is that a particular thread in the handler is alive, which k8s already knows since the overall handler process is alive. It has a low probability of failure and really doesn’t indicate a lot about the actual health of the handler. So it seems more effective to redo or simply drop the liveness probe. So if this heartbeat blocking issue is not a problem elsewhere, should we consider just doing that instead? |
Maybe. Another way to look at "liveness" could be monitoring the main thing each handler is supposed to do. For workflow handlers this might be creating new jobs, and for job handlers that might be dispatching jobs in the job loop. A bit harder to do for web handlers, but I guess if they're responding to requests that might be fine ? |
May #13411 be an alternative? |
What did you do?
This PR moves all calls to pysam.index to an external process. This had previously been done in one place in the code:
galaxy/lib/galaxy/datatypes/binary.py
Line 493 in e5a9524
Why did you make this change?
We ran into an issue in the k8s chart where the job handler would abruptly restart while within pysam.index. The proximate cause was a health check failure. The underlying reason was that pysam.index could potentially take a long time, and being an external c extension, appears to be not releasing the GIL, preventing the heartbeat thread from running. The failure of the heartbeat thread to report liveness causes k8s to restart the job handler.
By externalizing the process, we prevent it from blocking the job handler threads, and has the additional benefit of generally preventing any pysam failures from causing a handler crash.
How to test the changes?
(select the most appropriate option; if the latter, provide steps for testing below)