Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a GRPC HealthCheck endpoint #2933

Merged
merged 4 commits into from
Nov 4, 2024
Merged

Add a GRPC HealthCheck endpoint #2933

merged 4 commits into from
Nov 4, 2024

Conversation

jan-g
Copy link
Contributor

@jan-g jan-g commented Oct 14, 2024

This can be used by kubelet to monitor the pod for readiness, and as a backup by the operator to catch pipelines that've run to completion.

@jan-g jan-g marked this pull request as draft October 14, 2024 10:55
@jan-g
Copy link
Contributor Author

jan-g commented Oct 14, 2024

Quick draft of the idea, going to plumb this together with an operator that can keep tabs on it.

The point here is that the readiness probe should really survive after the one-shot pipeline is complete, so that we get a positive signal about it.

@mihaitodor
Copy link
Collaborator

Hey @jan-g, thank you for raising this! Just a quick reaction for now:

The point here is that the readiness probe should really survive after the one-shot pipeline is complete, so that we get a positive signal about it.

Would you not be able to leverage shutdown_delay for this purpose?

@jan-g
Copy link
Contributor Author

jan-g commented Oct 14, 2024

| Would you not be able to leverage shutdown_delay for this purpose?

I'm not 100% convinced that I can, at least not as reliably. If I've a bunch of pods running then I don't necessarily have an upper bound for how long it'll take to poll these. By wiring this into the grpc readiness signal I distribute my monitoring over a bunch of kubelets, and use the k8s cp as an event collector. I've marked the PR as a draft because I'd like to get all the moving parts lined up to see it working end-to-end.

@jan-g jan-g force-pushed the pause-on-completion branch 5 times, most recently from abe5caa to b5e6e69 Compare October 16, 2024 10:47
@jan-g jan-g marked this pull request as ready for review October 17, 2024 11:59
@jan-g
Copy link
Contributor Author

jan-g commented Oct 17, 2024

Right, I've run this up with the contianer configured with

readinessProbe:
  grpc:
    port: 2999

and I'm seeing completed pipelines move to unready (and remain there). I have a second PR knocking around to spot those and sweep them up, rather than having them restart continually.

@jan-g jan-g requested a review from Jeffail October 17, 2024 12:55
@jan-g
Copy link
Contributor Author

jan-g commented Oct 17, 2024

Question here is mostly styllistic: do we want to drive something like this (including port selection) via convi/env var or something?

Also, "can we achieve the same result via the existing machinery?" but it looks like that's quite geared around cleaning up within a fixed period of pipeline completion. LMKWYT.

@jan-g jan-g force-pushed the pause-on-completion branch from b5e6e69 to 807e18c Compare October 18, 2024 12:53
@Jeffail
Copy link
Collaborator

Jeffail commented Oct 30, 2024

Thanks @jan-g, LGTM overall, there's an error breaking tests:

cmd/redpanda-connect-cloud/main.go:27:2: "github.com/redpanda-data/benthos/v4/public/service" imported and not used

@jan-g jan-g force-pushed the pause-on-completion branch 2 times, most recently from 18aa3ce to 0d62f01 Compare October 31, 2024 09:23
This can be used by kubelet to monitor the pod for
readiness, and as a backup by the operator to catch
pipelines that've run to completion.

This means that -cloud and -ai pods that conclude
a pipeline run will not exit unless explicitly
terminated.
@jan-g jan-g force-pushed the pause-on-completion branch from 0d62f01 to 59e2750 Compare October 31, 2024 11:34
Copy link
Collaborator

@Jeffail Jeffail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @jan-g! I made one small alteration so that this only applies when running the run command, that allows us to continue using create and list and so on without issues.

@Jeffail Jeffail merged commit c09208b into main Nov 4, 2024
3 checks passed
@Jeffail Jeffail deleted the pause-on-completion branch November 4, 2024 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants