-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handlers restart while migrations happen in Job container #172
Comments
After our discussion during the March 16 call, I would like to take a stab at this. There are at least two issues to address:
The second item will require input and guidance from the core team as I suspect the proper solution is to have the |
I think this can be true but is generally not. The restarts on the other handlers are happening because the main process is erroring out when Galaxy is attempting to startup before the database is ready. In general, the most common is that the migrations have started but are still underway, so the other handlers are erroring out due to a mismatched database version between what is expected and currently found in postgres.
This is a great thought, but unfortunately it's been a bit more complicated than this. I attempted this in the fall (master...almahmoud:init-containers-2) but the problem was that the migrations were expecting the galaxy handler to touch the DB and restart. Dannon and I spent a few days trying different variations, but it seemed there was a bug in Galaxy or the DB script. We also tried to switch to
I think init container is the way to go not startup probe. The init container should block the main container from ever coming up, hence the probe will not even start to run. If the main container has started, probes or not, it will fail and restart because the main process is failing, so we want to block the main container altogether not just alter its probes.
I'm a little confused what you mean here. The current probes are monitoring whether the job handler is reporting its heartbeat, not whether jobs are being run. In the meeting we were suggesting switching to trying to run a job as the check instead of a heartbeat but I don't think the probe does or was ever planned to monitor jobs themselves. I think the issue is deciding how we monitor the Job Runner. Monitoring just whether the python process is up is useless, as kubernetes automatically monitors that by default (if the main process errors out, the pod is restarted regardless of probes). We originally had the heartbeat in a different thread which was a problem because the handler would sometimes be looping failing to cleanup (a) failed job(s) but the heartbeat was still being reported, hence giving the impression that the handler is up while it's been unable to run a job for over a day. To solve this problem we put the heartbeat thread as part of the main one, to monitor the situation of when the process doesn't error out, but is also not responding to new requests. This has proven problematic because of the scenarios (like the blocking indexing that Nuwan was mentioning) where the blocking is expected and the non-responsiveness is not an issue. I think we still need to figure out a good way to address the latter, but I don't think the nature of probes (startup vs readiness vs liveness) is the issue here, I think it's that we don't know what the most sensible, fast, low-overhead probe should be running to begin with (heartbeat doesn't seem to give us the robustness we hoped).
I think this is what Luke already leveraged to implement the current heartbeat monitoring, but it hasn't been enough to accurately indicate whether the job handler can run jobs. |
That is because I was confused, I though the heartbeat was from the running job not the job handler itself.
I think the heartbeat is fine for the readiness probe; if the job runner is not writing its heartbeat because something is holding the GIL then Kubernetes likely shouldn't keep routing jobs to it. However, the heartbeat should not be used in a liveness probe as we don't want Kubernetes restarting the pod just because the job handler is dealing with a long running misbehaving job. In fact, we can likely get rid of the liveness probe if we ensure the job runner crashes when it is supposed to. |
I agree. The heartbeat is a good measure for readiness and in a production scenario with horizontal scaling, it's the most useful bit. |
That is might be the best option for now. Once I can reliably launch instances on a cluster I will try to nail down some of those problematic scenarios. I believe Nuwan had identified one particular tool that caused problems as well. |
This has been resolved with the new startup jobs in 4.0 |
Web and workflow handlers usually restart 1-2 times while migrations happen in Job handler init container.
I think the migrations as a job would fix this, I want follow-up on that PR.
The text was updated successfully, but these errors were encountered: