Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Move pysam index to external process #11558

Draft
wants to merge 1 commit into
base: release_21.01
Choose a base branch
from

Conversation

nuwang
Copy link
Member

@nuwang nuwang commented Mar 6, 2021

What did you do?

This PR moves all calls to pysam.index to an external process. This had previously been done in one place in the code:

cmd = ['python', '-c', f"import pysam; pysam.set_verbosity(0); pysam.index('{file_name}', '{index_name}')"]
but this PR extends that to all use cases.

Why did you make this change?

We ran into an issue in the k8s chart where the job handler would abruptly restart while within pysam.index. The proximate cause was a health check failure. The underlying reason was that pysam.index could potentially take a long time, and being an external c extension, appears to be not releasing the GIL, preventing the heartbeat thread from running. The failure of the heartbeat thread to report liveness causes k8s to restart the job handler.

By externalizing the process, we prevent it from blocking the job handler threads, and has the additional benefit of generally preventing any pysam failures from causing a handler crash.

How to test the changes?

(select the most appropriate option; if the latter, provide steps for testing below)

  • This is a refactoring of components with existing test coverage.

f"import pysam; pysam.set_verbosity(0); pysam.index('{index_flag}', '{file_name}', '{index_name}')"]
if stderr:
with open(stderr, 'w') as stderr:
subprocess.check_call(cmd, stderr=stderr, shell=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use

def execute(cmds, input=None):
?

Copy link
Member Author

@nuwang nuwang Mar 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing. The original code has a specific comment saying that stderr needs to be discarded:

# we start another process and discard stderr.

and execute doesn't seem to support stderr redirection?

Copy link
Member

@mvdbeek mvdbeek Mar 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what stderr=subprocess.PIPE does (not exactly, but this good enough. the only important thing is that stderr of the externalize pysam call doesn't end up in the outer stderr, which was? a failure reason for the metadata script)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aah I see, the piped stderr is being ignored. Sure, seems fine, can do.

@mvdbeek
Copy link
Member

mvdbeek commented Mar 6, 2021

Can you make this optional ? For traditional Galaxy job runners this runs as a external process already as part of the metadata script. When creating many small files creating a subprocess for this is going to be significant overhead.

@nuwang
Copy link
Member Author

nuwang commented Mar 7, 2021

I'm now wondering whether this is worth doing at all. I guess threads being suspended is only really a problem for the heartbeat thread, but it seems like the heartbeat is not really a good proxy for liveness anyway for a number of reasons.

a. All it’s saying is that a particular thread in the handler is alive, which k8s already knows since the overall handler process is alive. It has a low probability of failure and really doesn’t indicate a lot about the actual health of the handler.
b. The only additional bit of information we know is that the process is communicating with the database, but again, that has never really been a problem, and we’d know about it pretty quickly through other means anyway.
c. The liveness probe also introduces risks like two job handlers with the same name being alive simultaneously for a brief period of time.
d. We presumably don’t know for sure how many other similar instances like this could block the heartbeat thread?

So it seems more effective to redo or simply drop the liveness probe. So if this heartbeat blocking issue is not a problem elsewhere, should we consider just doing that instead?

@mvdbeek
Copy link
Member

mvdbeek commented Mar 7, 2021

Maybe. Another way to look at "liveness" could be monitoring the main thing each handler is supposed to do. For workflow handlers this might be creating new jobs, and for job handlers that might be dispatching jobs in the job loop. A bit harder to do for web handlers, but I guess if they're responding to requests that might be fine ?

@bernt-matthias
Copy link
Contributor

May #13411 be an alternative?

@nsoranzo nsoranzo marked this pull request as draft September 16, 2022 00:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants