[tempo-distributed] add missing probes #3910

hkailantzis · 2025-09-22T22:44:23Z

Added the liveness probe on the 'common' scope (readiness was already there) and added the relevant probes, where they were missing.
Tested on local k8s cluster (minikube), by enabling metrics-generator and gateway component on top, to test their probes also. All relevant components started as expected.

KyriosGN0 · 2025-09-23T19:24:15Z

i don't think that having those components share the same probe config is a good idea, i think they should all have separate probe configs.

hkailantzis · 2025-09-25T13:19:40Z

hi @KyriosGN0 , acked. Thanks for the feedback.
I just however, followed for the Liveness probe, the pre-existing Readiness probe pattern, which was applied already to most of the components. So, apart from query-frontend, compactor, which didn't have probes at all, the rest of the components were already using the common/global Readiness probe, which was defined long time ago... (gateway is using its own probes config)

hkailantzis · 2025-09-29T17:22:01Z

@KyriosGN0 , could you please advise, how shall I proceed ? This PR was meant to re-use the pre-existing probes as much as possible and just add the ones missing. See my comment above.

KyriosGN0 · 2025-09-29T20:23:19Z

@hkailantzis if they are all using the same probe then yeah it ok i guess
do note that im not a maintainer for this chart, im just a user who contributes

hkailantzis · 2025-10-06T15:45:07Z

@Sheikh-Abubaker , @swartz-k @BitProcessor or @faustodavid could you please review ? Thanks in advance!

Signed-off-by: hkailantzis <[email protected]>

QuentinBisson · 2025-10-16T08:29:52Z

It does look fine to me

hkailantzis · 2025-10-24T11:08:10Z

If 2 reviewers with write access could approve this, it would be great 🙏

Sheikh-Abubaker

LGTM!

QuentinBisson · 2025-10-25T15:42:09Z

@hkailantzis can you fix the conflicts ?

Signed-off-by: Sheikh-Abubaker <[email protected]>

kppullin · 2025-10-27T20:06:50Z

charts/tempo-distributed/templates/query-frontend/deployment-query-frontend.yaml

          {{- end }}
+          livenessProbe:
+            {{- toYaml .Values.tempo.livenessProbe | nindent 12 }}
+          readinessProbe:


I believe the readinessProbe here broke the deployment, due to bi-directional dependencies of the querier and query-frontend pods, and this newly added readiness probe config whereby the frontend pod requires at least one querier pod to connect prior to reporting 'ready'.

Scenario:

From a stable cluster on chart version 1.50.0, upgrade the chart from 1.50.0 to 1.51.0.

The query-frontend pod restarts and all connections from querier pods are dropped.

The query-frontend pod starts NOT ready and the tempo-query-frontend k8s service stops routing traffic to the frontend pod.

querier pods are now unable to connect to the query-frontend.

The query-frontend readiness check requires that there is at least one querier pod is connected. However, as the k8s service is "down", querier pods can never connect.

Since querier pods cannot connect, the query-frontend readiness probe always fails.

The query-frontend pod enters a restart loop, as the liveness probe also fails.

Hey @kppullin thank you for your message and the scenario. I am not available to try and fix it today but I will try to work on this as soon as I can find the time

Hi @kppullin, I'm curious, since this scenario didn't occur for me during testing on local minikube nor even after upgrading to latest version on a test and production k8s cluster, with HPA on and 2 - 3 minReplicas enabled for each component, respectively. Is your case triggering this restart-loop, using 1 pod for each component ?

In this case there are two querier pods and a single query-frontend pod. It's a nonproduction cluster and scaled down a bit, though not to say there aren't scenarios where multiple pods could simultaneously fail or restart at the same time and violate any HPA.

This isn't critical for us at the moment; dropping down to 1.50 works fine. I was able to consistently recreate the failure mode and recovery by flipping back and forth between 1.50 and 1.51.

Hey @kppullin, thank you again. I took the time to check today. Do you think that enabling the readiness probe for the query front-end (commented of course) only if the query front-end replicas are > 1 would fix your issue?

If yes, would you be open to providing the change?

I briefly tried testing out multiple replicas and upgrading from 1.50 to 1.51. While the frontend pods rolled out, even with one valid and active pod, the new pod with the readiness probe defined never reported ready and entered a restart loop.

If I find time I'll try to look further & attempt to replicate locally.

hkailantzis requested review from a team, BitProcessor, Sheikh-Abubaker, faustodavid and swartz-k as code owners September 22, 2025 22:44

hkailantzis mentioned this pull request Sep 22, 2025

[tempo-distributed] add missing probes #3907

Closed

hkailantzis changed the title ~~add missing probes~~ [tempo-distributed] add missing probes Sep 22, 2025

add missing probes

2494777

Signed-off-by: hkailantzis <[email protected]>

hkailantzis force-pushed the add-missing-probes branch from 8a74aa7 to 2494777 Compare October 16, 2025 07:28

hkailantzis requested a review from QuentinBisson as a code owner October 16, 2025 07:28

Merge branch 'main' into add-missing-probes

f8b56f3

Sheikh-Abubaker approved these changes Oct 25, 2025

View reviewed changes

Merge branch 'main' into add-missing-probes

3c23e81

Signed-off-by: Sheikh-Abubaker <[email protected]>

QuentinBisson approved these changes Oct 26, 2025

View reviewed changes

QuentinBisson merged commit e563e52 into grafana:main Oct 26, 2025
10 checks passed

kppullin reviewed Oct 27, 2025

View reviewed changes

[tempo-distributed] add missing probes #3910

[tempo-distributed] add missing probes #3910

Uh oh!

Conversation

hkailantzis commented Sep 22, 2025

Uh oh!

KyriosGN0 commented Sep 23, 2025

Uh oh!

hkailantzis commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hkailantzis commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KyriosGN0 commented Sep 29, 2025

Uh oh!

hkailantzis commented Oct 6, 2025

Uh oh!

QuentinBisson commented Oct 16, 2025

Uh oh!

hkailantzis commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sheikh-Abubaker left a comment

Choose a reason for hiding this comment

Uh oh!

QuentinBisson commented Oct 25, 2025

Uh oh!

Uh oh!

kppullin Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

QuentinBisson Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

hkailantzis Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kppullin Oct 28, 2025

Choose a reason for hiding this comment

Uh oh!

QuentinBisson Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

kppullin Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hkailantzis commented Sep 25, 2025 •

edited

Loading

hkailantzis commented Sep 29, 2025 •

edited

Loading

hkailantzis commented Oct 24, 2025 •

edited

Loading

hkailantzis Oct 28, 2025 •

edited

Loading