Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handlers restart while migrations happen in Job container #172

Closed
almahmoud opened this issue Oct 21, 2020 · 6 comments
Closed

Handlers restart while migrations happen in Job container #172

almahmoud opened this issue Oct 21, 2020 · 6 comments
Assignees

Comments

@almahmoud
Copy link
Member

almahmoud commented Oct 21, 2020

Web and workflow handlers usually restart 1-2 times while migrations happen in Job handler init container.

I think the migrations as a job would fix this, I want follow-up on that PR.

@almahmoud almahmoud self-assigned this Oct 21, 2020
@almahmoud almahmoud changed the title Handlers restarts while migrations happen in Job container Handlers restart while migrations happen in Job container Oct 21, 2020
@ksuderman ksuderman self-assigned this Mar 17, 2021
@ksuderman
Copy link
Contributor

After our discussion during the March 16 call, I would like to take a stab at this. There are at least two issues to address:

  1. The job handler gets restarted during startup due to failed liveness and readiness probes. Kubernetes 1.16 added a startupProbe for just this use case and Kubernetes won't start the other probes until startupProbe succeeds.

    • Add an init container after the db migration that does something like touch /db_migration_complete. It could also be included at the end of the db migration container as well if desired.
    • Add a startupProbe that checks [ -f /db_migration_complete ]
      Once the database migration has completed it should be safe to start the liveness and readiness probes.
  2. Modify the liveness and readiness probes to monitor the state of the job runner rather than the status of the jobs the job runner is running. This would also close [WIP] Move pysam index to external process galaxy#11558; the JobRunner should manage the jobs and the Kubernetes jobHandler health probes should monitor the JobRunner

The second item will require input and guidance from the core team as I suspect the proper solution is to have the AsynchronousJobRunner or galaxy.util.monitor.Monitors class record a heartbeat somewhere (not in a database) that the Kubernetes probes can inspect.

@almahmoud
Copy link
Member Author

almahmoud commented Mar 25, 2021

After our discussion during the March 16 call, I would like to take a stab at this. There are at least two issues to address:

  1. The job handler gets restarted during startup due to failed liveness and readiness probes. Kubernetes 1.16 added a startupProbe for just this use case and Kubernetes won't start the other probes until startupProbe succeeds.

I think this can be true but is generally not. The restarts on the other handlers are happening because the main process is erroring out when Galaxy is attempting to startup before the database is ready. In general, the most common is that the migrations have started but are still underway, so the other handlers are erroring out due to a mismatched database version between what is expected and currently found in postgres.
Startup probes are a good feature nonetheless, but I don't think they're the solution here, as the handlers will restart even with the startup probe. On another note, as mentioned at the meeting this week, the startup probes had already been added in the past, and then subsequently removed when we were asked to keep backwards compatibility. I've reverted the remove in #266.

  • Add an init container after the db migration that does something like touch /db_migration_complete. It could also be included at the end of the db migration container as well if desired.

This is a great thought, but unfortunately it's been a bit more complicated than this. I attempted this in the fall (master...almahmoud:init-containers-2) but the problem was that the migrations were expecting the galaxy handler to touch the DB and restart. Dannon and I spent a few days trying different variations, but it seemed there was a bug in Galaxy or the DB script. We also tried to switch to create_db.sh instead of manage_db.sh to get over that hurdle, but that created other problems:
eg: sqlalchemy.exc.ProgrammingError: (psycopg2.errors.UndefinedTable) relation "tool_shed_repository" does not exist and removing the script altogether letting the handlers initialize the DB also seemed to not be enough, hence why we punted. I am hoping it was due to a bug in the scripts/galaxy at the time that has since been fixed, and it is definitely time to revisit this, but it might be worth starting from where we were at in the fall rather than from scratch.

  • Add a startupProbe that checks [ -f /db_migration_complete ]
    Once the database migration has completed it should be safe to start the liveness and readiness probes.

I think init container is the way to go not startup probe. The init container should block the main container from ever coming up, hence the probe will not even start to run. If the main container has started, probes or not, it will fail and restart because the main process is failing, so we want to block the main container altogether not just alter its probes.

  1. Modify the liveness and readiness probes to monitor the state of the job runner rather than the status of the jobs the job runner is running. This would also close galaxyproject/galaxy/pull/11558; the JobRunner should manage the jobs and the Kubernetes jobHandler health probes should monitor the JobRunner

I'm a little confused what you mean here. The current probes are monitoring whether the job handler is reporting its heartbeat, not whether jobs are being run. In the meeting we were suggesting switching to trying to run a job as the check instead of a heartbeat but I don't think the probe does or was ever planned to monitor jobs themselves. I think the issue is deciding how we monitor the Job Runner. Monitoring just whether the python process is up is useless, as kubernetes automatically monitors that by default (if the main process errors out, the pod is restarted regardless of probes). We originally had the heartbeat in a different thread which was a problem because the handler would sometimes be looping failing to cleanup (a) failed job(s) but the heartbeat was still being reported, hence giving the impression that the handler is up while it's been unable to run a job for over a day. To solve this problem we put the heartbeat thread as part of the main one, to monitor the situation of when the process doesn't error out, but is also not responding to new requests. This has proven problematic because of the scenarios (like the blocking indexing that Nuwan was mentioning) where the blocking is expected and the non-responsiveness is not an issue. I think we still need to figure out a good way to address the latter, but I don't think the nature of probes (startup vs readiness vs liveness) is the issue here, I think it's that we don't know what the most sensible, fast, low-overhead probe should be running to begin with (heartbeat doesn't seem to give us the robustness we hoped).

The second item will require input and guidance from the core team as I suspect the proper solution is to have the AsynchronousJobRunner or galaxy.util.monitor.Monitors class record a heartbeat somewhere (not in a database) that the Kubernetes probes can inspect.

I think this is what Luke already leveraged to implement the current heartbeat monitoring, but it hasn't been enough to accurately indicate whether the job handler can run jobs.

@ksuderman
Copy link
Contributor

I'm a little confused what you mean here.

That is because I was confused, I though the heartbeat was from the running job not the job handler itself.

I don't think the nature of probes (startup vs readiness vs liveness) is the issue here, I think it's that we don't know what the most sensible, fast, low-overhead probe should be running to begin with (heartbeat doesn't seem to give us the robustness we hoped).

I think the heartbeat is fine for the readiness probe; if the job runner is not writing its heartbeat because something is holding the GIL then Kubernetes likely shouldn't keep routing jobs to it. However, the heartbeat should not be used in a liveness probe as we don't want Kubernetes restarting the pod just because the job handler is dealing with a long running misbehaving job. In fact, we can likely get rid of the liveness probe if we ensure the job runner crashes when it is supposed to.

@almahmoud
Copy link
Member Author

I'm a little confused what you mean here.

That is because I was confused, I though the heartbeat was from the running job not the job handler itself.

I don't think the nature of probes (startup vs readiness vs liveness) is the issue here, I think it's that we don't know what the most sensible, fast, low-overhead probe should be running to begin with (heartbeat doesn't seem to give us the robustness we hoped).

I think the heartbeat is fine for the readiness probe; if the job runner is not writing its heartbeat because something is holding the GIL then Kubernetes likely shouldn't keep routing jobs to it. However, the heartbeat should not be used in a liveness probe as we don't want Kubernetes restarting the pod just because the job handler is dealing with a long running misbehaving job. In fact, we can likely get rid of the liveness probe if we ensure the job runner crashes when it is supposed to.

I agree. The heartbeat is a good measure for readiness and in a production scenario with horizontal scaling, it's the most useful bit.
For the liveness probe, I agree in theory, but the history is that it's been helpful in practice. The main problem was that sometimes the job handler would be erroring out in a loop, not because the process itself is erroring out, but because the cleanup function was failing for example and catching the exception then re-adding the job to the queue to try again or something like that. That was a bug, and the real solution is that if it failed to cleanup because the files don't exist, to assume that it's been already cleaned up and move on. But it wasn't only in the cleanup function, it ended up having the same general behavior (handler erroring out in a loop on caught errors) come up in different handler methods especially when running tool tests in bulk.
The restarts from the liveness probes fixed the issue... -ish. In some cases the restart wasn't enough and there are still some niche scenarios that make the handler loop and become unrecoverable without manual intervention in the DB or on the NFS depending on the error. In retrospect I don't know that we ever got a clear answer as to what is happening in a lot of scenarios. I guess maybe if the old handler was killed when the job was removed from the queue before being re-added that's why a restart would fix a job being processed in a loop?
In any case, it was a practical addition to make it slightly more reliable in practice, until it became a problem because of the indexing turning it into something that is also making it less reliable in other scenarios... I wholeheartedly agree though that conceptually it's the wrong thing to check for liveness... I'm not sure what a good solution is for a decent liveness probe. I'm wondering if maybe instead of removing, we should just have it have a much higher threshold on the heartbeat? If it's unresponsive for a few minutes, that's expected and it should be handled by horizontal scaling i.e. readiness probe, with the expectation it will come back soon, but if it's unresponsive for say an hour, then restart it cause it's not a reasonable amount of time to be continuously unresponsive? It's still not great but might be a good interim while we can think of a better liveness probe?

@ksuderman
Copy link
Contributor

I'm not sure what a good solution is for a decent liveness probe. I'm wondering if maybe instead of removing, we should just have it have a much higher threshold on the heartbeat?

That is might be the best option for now. Once I can reliably launch instances on a cluster I will try to nail down some of those problematic scenarios. I believe Nuwan had identified one particular tool that caused problems as well.

@almahmoud
Copy link
Member Author

almahmoud commented May 7, 2021

This has been resolved with the new startup jobs in 4.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants