-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[tempo-distributed] add missing probes #3910
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the
readinessProbehere broke the deployment, due to bi-directional dependencies of thequerierandquery-frontendpods, and this newly added readiness probe config whereby the frontend pod requires at least one querier pod to connect prior to reporting 'ready'.Scenario:
query-frontendpod restarts and all connections fromquerierpods are dropped.query-frontendpod starts NOT ready and thetempo-query-frontendk8s service stops routing traffic to the frontend pod.querierpods are now unable to connect to thequery-frontend.query-frontendreadiness check requires that there is at least one querier pod is connected. However, as the k8s service is "down",querierpods can never connect.querierpods cannot connect, thequery-frontendreadiness probe always fails.query-frontendpod enters a restart loop, as the liveness probe also fails.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @kppullin thank you for your message and the scenario. I am not available to try and fix it today but I will try to work on this as soon as I can find the time
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @kppullin, I'm curious, since this scenario didn't occur for me during testing on local minikube nor even after upgrading to latest version on a test and production k8s cluster, with HPA on and 2 - 3 minReplicas enabled for each component, respectively. Is your case triggering this restart-loop, using 1 pod for each component ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case there are two querier pods and a single query-frontend pod. It's a nonproduction cluster and scaled down a bit, though not to say there aren't scenarios where multiple pods could simultaneously fail or restart at the same time and violate any HPA.
This isn't critical for us at the moment; dropping down to 1.50 works fine. I was able to consistently recreate the failure mode and recovery by flipping back and forth between 1.50 and 1.51.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @kppullin, thank you again. I took the time to check today. Do you think that enabling the readiness probe for the query front-end (commented of course) only if the query front-end replicas are > 1 would fix your issue?
If yes, would you be open to providing the change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I briefly tried testing out multiple replicas and upgrading from 1.50 to 1.51. While the frontend pods rolled out, even with one valid and active pod, the new pod with the readiness probe defined never reported ready and entered a restart loop.
If I find time I'll try to look further & attempt to replicate locally.