Add a GRPC HealthCheck endpoint #2933

jan-g · 2024-10-14T10:55:43Z

This can be used by kubelet to monitor the pod for readiness, and as a backup by the operator to catch pipelines that've run to completion.

jan-g · 2024-10-14T10:56:43Z

Quick draft of the idea, going to plumb this together with an operator that can keep tabs on it.

The point here is that the readiness probe should really survive after the one-shot pipeline is complete, so that we get a positive signal about it.

mihaitodor · 2024-10-14T10:59:11Z

Hey @jan-g, thank you for raising this! Just a quick reaction for now:

The point here is that the readiness probe should really survive after the one-shot pipeline is complete, so that we get a positive signal about it.

Would you not be able to leverage shutdown_delay for this purpose?

jan-g · 2024-10-14T11:17:30Z

| Would you not be able to leverage shutdown_delay for this purpose?

I'm not 100% convinced that I can, at least not as reliably. If I've a bunch of pods running then I don't necessarily have an upper bound for how long it'll take to poll these. By wiring this into the grpc readiness signal I distribute my monitoring over a bunch of kubelets, and use the k8s cp as an event collector. I've marked the PR as a draft because I'd like to get all the moving parts lined up to see it working end-to-end.

jan-g · 2024-10-17T12:00:28Z

Right, I've run this up with the contianer configured with

readinessProbe:
  grpc:
    port: 2999

and I'm seeing completed pipelines move to unready (and remain there). I have a second PR knocking around to spot those and sweep them up, rather than having them restart continually.

jan-g · 2024-10-17T12:56:28Z

Question here is mostly styllistic: do we want to drive something like this (including port selection) via convi/env var or something?

Also, "can we achieve the same result via the existing machinery?" but it looks like that's quite geared around cleaning up within a fixed period of pipeline completion. LMKWYT.

Jeffail · 2024-10-30T14:32:31Z

Thanks @jan-g, LGTM overall, there's an error breaking tests:

cmd/redpanda-connect-cloud/main.go:27:2: "github.com/redpanda-data/benthos/v4/public/service" imported and not used

This can be used by kubelet to monitor the pod for readiness, and as a backup by the operator to catch pipelines that've run to completion. This means that -cloud and -ai pods that conclude a pipeline run will not exit unless explicitly terminated.

Jeffail

LGTM, thanks @jan-g! I made one small alteration so that this only applies when running the run command, that allows us to continue using create and list and so on without issues.

jan-g marked this pull request as draft October 14, 2024 10:55

jan-g force-pushed the pause-on-completion branch 5 times, most recently from abe5caa to b5e6e69 Compare October 16, 2024 10:47

jan-g marked this pull request as ready for review October 17, 2024 11:59

jan-g requested a review from Jeffail October 17, 2024 12:55

jan-g force-pushed the pause-on-completion branch from b5e6e69 to 807e18c Compare October 18, 2024 12:53

jan-g force-pushed the pause-on-completion branch 2 times, most recently from 18aa3ce to 0d62f01 Compare October 31, 2024 09:23

Add a GRPC HealthCheck endpoint

59e2750

This can be used by kubelet to monitor the pod for readiness, and as a backup by the operator to catch pipelines that've run to completion. This means that -cloud and -ai pods that conclude a pipeline run will not exit unless explicitly terminated.

jan-g force-pushed the pause-on-completion branch from 0d62f01 to 59e2750 Compare October 31, 2024 11:34

Jeffail added 2 commits November 4, 2024 15:18

Merge branch 'main' into pause-on-completion

4f09a08

Only spin up grpc server on the run command

8140bd6

Jeffail approved these changes Nov 4, 2024

View reviewed changes

Add check for skinny-run

f020aa5

Jeffail merged commit c09208b into main Nov 4, 2024
3 checks passed

Jeffail deleted the pause-on-completion branch November 4, 2024 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a GRPC HealthCheck endpoint #2933

Add a GRPC HealthCheck endpoint #2933

jan-g commented Oct 14, 2024

jan-g commented Oct 14, 2024

mihaitodor commented Oct 14, 2024

jan-g commented Oct 14, 2024

jan-g commented Oct 17, 2024

jan-g commented Oct 17, 2024

Jeffail commented Oct 30, 2024

Jeffail left a comment

Add a GRPC HealthCheck endpoint #2933

Add a GRPC HealthCheck endpoint #2933

Conversation

jan-g commented Oct 14, 2024

jan-g commented Oct 14, 2024

mihaitodor commented Oct 14, 2024

jan-g commented Oct 14, 2024

jan-g commented Oct 17, 2024

jan-g commented Oct 17, 2024

Jeffail commented Oct 30, 2024

Jeffail left a comment

Choose a reason for hiding this comment