-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[tempo-distributed] add missing probes #3910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
i don't think that having those components share the same probe config is a good idea, i think they should all have separate probe configs. |
|
hi @KyriosGN0 , acked. Thanks for the feedback. |
|
@KyriosGN0 , could you please advise, how shall I proceed ? This PR was meant to re-use the pre-existing probes as much as possible and just add the ones missing. See my comment above. |
|
@hkailantzis if they are all using the same probe then yeah it ok i guess |
|
@Sheikh-Abubaker , @swartz-k @BitProcessor or @faustodavid could you please review ? Thanks in advance! |
Signed-off-by: hkailantzis <[email protected]>
8a74aa7 to
2494777
Compare
|
It does look fine to me |
|
If 2 reviewers with write access could approve this, it would be great 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
@hkailantzis can you fix the conflicts ? |
Signed-off-by: Sheikh-Abubaker <[email protected]>
| {{- end }} | ||
| livenessProbe: | ||
| {{- toYaml .Values.tempo.livenessProbe | nindent 12 }} | ||
| readinessProbe: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the readinessProbe here broke the deployment, due to bi-directional dependencies of the querier and query-frontend pods, and this newly added readiness probe config whereby the frontend pod requires at least one querier pod to connect prior to reporting 'ready'.
Scenario:
- From a stable cluster on chart version 1.50.0, upgrade the chart from 1.50.0 to 1.51.0.
- The
query-frontendpod restarts and all connections fromquerierpods are dropped. - The
query-frontendpod starts NOT ready and thetempo-query-frontendk8s service stops routing traffic to the frontend pod. querierpods are now unable to connect to thequery-frontend.- The
query-frontendreadiness check requires that there is at least one querier pod is connected. However, as the k8s service is "down",querierpods can never connect. - Since
querierpods cannot connect, thequery-frontendreadiness probe always fails. - The
query-frontendpod enters a restart loop, as the liveness probe also fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @kppullin thank you for your message and the scenario. I am not available to try and fix it today but I will try to work on this as soon as I can find the time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kppullin, I'm curious, since this scenario didn't occur for me during testing on local minikube nor even after upgrading to latest version on a test and production k8s cluster, with HPA on and 2 - 3 minReplicas enabled for each component, respectively. Is your case triggering this restart-loop, using 1 pod for each component ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case there are two querier pods and a single query-frontend pod. It's a nonproduction cluster and scaled down a bit, though not to say there aren't scenarios where multiple pods could simultaneously fail or restart at the same time and violate any HPA.
This isn't critical for us at the moment; dropping down to 1.50 works fine. I was able to consistently recreate the failure mode and recovery by flipping back and forth between 1.50 and 1.51.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @kppullin, thank you again. I took the time to check today. Do you think that enabling the readiness probe for the query front-end (commented of course) only if the query front-end replicas are > 1 would fix your issue?
If yes, would you be open to providing the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I briefly tried testing out multiple replicas and upgrading from 1.50 to 1.51. While the frontend pods rolled out, even with one valid and active pod, the new pod with the readiness probe defined never reported ready and entered a restart loop.
If I find time I'll try to look further & attempt to replicate locally.
fixes #2966
Added the liveness probe on the 'common' scope (readiness was already there) and added the relevant probes, where they were missing.
Tested on local k8s cluster (minikube), by enabling metrics-generator and gateway component on top, to test their probes also. All relevant components started as expected.