-
Notifications
You must be signed in to change notification settings - Fork 852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a GRPC HealthCheck endpoint #2933
Conversation
Quick draft of the idea, going to plumb this together with an operator that can keep tabs on it. The point here is that the readiness probe should really survive after the one-shot pipeline is complete, so that we get a positive signal about it. |
Hey @jan-g, thank you for raising this! Just a quick reaction for now:
Would you not be able to leverage |
| Would you not be able to leverage shutdown_delay for this purpose? I'm not 100% convinced that I can, at least not as reliably. If I've a bunch of pods running then I don't necessarily have an upper bound for how long it'll take to poll these. By wiring this into the grpc readiness signal I distribute my monitoring over a bunch of kubelets, and use the k8s cp as an event collector. I've marked the PR as a draft because I'd like to get all the moving parts lined up to see it working end-to-end. |
abe5caa
to
b5e6e69
Compare
Right, I've run this up with the contianer configured with
and I'm seeing completed pipelines move to unready (and remain there). I have a second PR knocking around to spot those and sweep them up, rather than having them restart continually. |
Question here is mostly styllistic: do we want to drive something like this (including port selection) via convi/env var or something? Also, "can we achieve the same result via the existing machinery?" but it looks like that's quite geared around cleaning up within a fixed period of pipeline completion. LMKWYT. |
b5e6e69
to
807e18c
Compare
Thanks @jan-g, LGTM overall, there's an error breaking tests:
|
18aa3ce
to
0d62f01
Compare
This can be used by kubelet to monitor the pod for readiness, and as a backup by the operator to catch pipelines that've run to completion. This means that -cloud and -ai pods that conclude a pipeline run will not exit unless explicitly terminated.
0d62f01
to
59e2750
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @jan-g! I made one small alteration so that this only applies when running the run
command, that allows us to continue using create
and list
and so on without issues.
This can be used by kubelet to monitor the pod for readiness, and as a backup by the operator to catch pipelines that've run to completion.