Skip to content

Conversation

@hkailantzis
Copy link
Contributor

fixes #2966

  • Added the liveness probe on the 'common' scope (readiness was already there) and added the relevant probes, where they were missing.

  • Tested on local k8s cluster (minikube), by enabling metrics-generator and gateway component on top, to test their probes also. All relevant components started as expected.

@hkailantzis hkailantzis changed the title add missing probes [tempo-distributed] add missing probes Sep 22, 2025
@KyriosGN0
Copy link
Contributor

i don't think that having those components share the same probe config is a good idea, i think they should all have separate probe configs.

@hkailantzis
Copy link
Contributor Author

hkailantzis commented Sep 25, 2025

hi @KyriosGN0 , acked. Thanks for the feedback.
I just however, followed for the Liveness probe, the pre-existing Readiness probe pattern, which was applied already to most of the components. So, apart from query-frontend, compactor, which didn't have probes at all, the rest of the components were already using the common/global Readiness probe, which was defined long time ago... (gateway is using its own probes config)

@hkailantzis
Copy link
Contributor Author

hkailantzis commented Sep 29, 2025

@KyriosGN0 , could you please advise, how shall I proceed ? This PR was meant to re-use the pre-existing probes as much as possible and just add the ones missing. See my comment above.

@KyriosGN0
Copy link
Contributor

@hkailantzis if they are all using the same probe then yeah it ok i guess
do note that im not a maintainer for this chart, im just a user who contributes

@hkailantzis
Copy link
Contributor Author

@Sheikh-Abubaker , @swartz-k @BitProcessor or @faustodavid could you please review ? Thanks in advance!

Signed-off-by: hkailantzis <[email protected]>
@QuentinBisson
Copy link
Collaborator

It does look fine to me

@hkailantzis
Copy link
Contributor Author

hkailantzis commented Oct 24, 2025

If 2 reviewers with write access could approve this, it would be great 🙏

Copy link
Collaborator

@Sheikh-Abubaker Sheikh-Abubaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@QuentinBisson
Copy link
Collaborator

@hkailantzis can you fix the conflicts ?

@QuentinBisson QuentinBisson merged commit e563e52 into grafana:main Oct 26, 2025
10 checks passed
{{- end }}
livenessProbe:
{{- toYaml .Values.tempo.livenessProbe | nindent 12 }}
readinessProbe:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the readinessProbe here broke the deployment, due to bi-directional dependencies of the querier and query-frontend pods, and this newly added readiness probe config whereby the frontend pod requires at least one querier pod to connect prior to reporting 'ready'.

Scenario:

  • From a stable cluster on chart version 1.50.0, upgrade the chart from 1.50.0 to 1.51.0.
  • The query-frontend pod restarts and all connections from querier pods are dropped.
  • The query-frontend pod starts NOT ready and the tempo-query-frontend k8s service stops routing traffic to the frontend pod.
  • querier pods are now unable to connect to the query-frontend.
  • The query-frontend readiness check requires that there is at least one querier pod is connected. However, as the k8s service is "down", querier pods can never connect.
  • Since querier pods cannot connect, the query-frontend readiness probe always fails.
  • The query-frontend pod enters a restart loop, as the liveness probe also fails.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kppullin thank you for your message and the scenario. I am not available to try and fix it today but I will try to work on this as soon as I can find the time

Copy link
Contributor Author

@hkailantzis hkailantzis Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kppullin, I'm curious, since this scenario didn't occur for me during testing on local minikube nor even after upgrading to latest version on a test and production k8s cluster, with HPA on and 2 - 3 minReplicas enabled for each component, respectively. Is your case triggering this restart-loop, using 1 pod for each component ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case there are two querier pods and a single query-frontend pod. It's a nonproduction cluster and scaled down a bit, though not to say there aren't scenarios where multiple pods could simultaneously fail or restart at the same time and violate any HPA.

This isn't critical for us at the moment; dropping down to 1.50 works fine. I was able to consistently recreate the failure mode and recovery by flipping back and forth between 1.50 and 1.51.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kppullin, thank you again. I took the time to check today. Do you think that enabling the readiness probe for the query front-end (commented of course) only if the query front-end replicas are > 1 would fix your issue?

If yes, would you be open to providing the change?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I briefly tried testing out multiple replicas and upgrading from 1.50 to 1.51. While the frontend pods rolled out, even with one valid and active pod, the new pod with the readiness probe defined never reported ready and entered a restart loop.

If I find time I'll try to look further & attempt to replicate locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[tempo-distributed] Container is missing readiness or liveness probe

5 participants