Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion charts/tempo-distributed/Chart.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ apiVersion: v2
name: tempo-distributed
description: Grafana Tempo in MicroService mode
type: application
version: 1.50.0
version: 1.51.0
appVersion: 2.9.0
engine: gotpl
home: https://grafana.com/docs/tempo/latest/
Expand Down
10 changes: 9 additions & 1 deletion charts/tempo-distributed/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# tempo-distributed

![Version: 1.50.0](https://img.shields.io/badge/Version-1.50.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 2.9.0](https://img.shields.io/badge/AppVersion-2.9.0-informational?style=flat-square)
![Version: 1.51.0](https://img.shields.io/badge/Version-1.51.0-informational?style=flat-square) ![Type: application](https://img.shields.io/badge/Type-application-informational?style=flat-square) ![AppVersion: 2.9.0](https://img.shields.io/badge/AppVersion-2.9.0-informational?style=flat-square)

Grafana Tempo in MicroService mode

Expand Down Expand Up @@ -532,6 +532,10 @@ The memcached default args are removed and should be provided manually. The sett
| gateway.ingress.hosts | list | `[{"host":"gateway.tempo.example.com","paths":[{"path":"/"}]}]` | Hosts configuration for the gateway ingress |
| gateway.ingress.labels | object | `{}` | Labels for the gateway ingress |
| gateway.ingress.tls | list | `[{"hosts":["gateway.tempo.example.com"],"secretName":"tempo-gateway-tls"}]` | TLS configuration for the gateway ingress |
| gateway.livenessProbe.httpGet.path | string | `"/"` | |
| gateway.livenessProbe.httpGet.port | string | `"http-metrics"` | |
| gateway.livenessProbe.initialDelaySeconds | int | `30` | |
| gateway.livenessProbe.timeoutSeconds | int | `5` | |
| gateway.maxUnavailable | int | `1` | Pod Disruption Budget maxUnavailable |
| gateway.minReadySeconds | int | `10` | Minimum number of seconds for which a newly created Pod should be ready without any of its containers crashing/terminating |
| gateway.nginxConfig.file | string | See values.yaml | Config file contents for Nginx. Passed through the `tpl` function to allow templating |
Expand Down Expand Up @@ -967,6 +971,10 @@ The memcached default args are removed and should be provided manually. The sett
| tempo.image.registry | string | `"docker.io"` | The Docker registry |
| tempo.image.repository | string | `"grafana/tempo"` | Docker image repository |
| tempo.image.tag | string | `nil` | Overrides the image tag whose default is the chart's appVersion |
| tempo.livenessProbe.httpGet.path | string | `"/ready"` | |
| tempo.livenessProbe.httpGet.port | string | `"http-metrics"` | |
| tempo.livenessProbe.initialDelaySeconds | int | `60` | |
| tempo.livenessProbe.timeoutSeconds | int | `5` | |
| tempo.memberlist | object | `{"appProtocol":null,"service":{"annotations":{}}}` | Memberlist service configuration. |
| tempo.memberlist.appProtocol | string | `nil` | Adds the appProtocol field to the memberlist service. This allows memberlist to work with istio protocol selection. Set the optional service protocol. Ex: "tcp", "http" or "https". |
| tempo.memberlist.service | object | `{"annotations":{}}` | Adds the service field to the memberlist service |
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -95,6 +95,10 @@ spec:
{{- toYaml . | nindent 12 }}
{{- end }}
{{- end }}
livenessProbe:
{{- toYaml .Values.tempo.livenessProbe | nindent 12 }}
readinessProbe:
{{- toYaml .Values.tempo.readinessProbe | nindent 12 }}
resources:
{{- toYaml .Values.compactor.resources | nindent 12 }}
{{- with .Values.tempo.securityContext }}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,8 @@ spec:
{{- toYaml . | nindent 12 }}
{{- end }}
{{- end }}
livenessProbe:
{{- toYaml .Values.tempo.livenessProbe | nindent 12 }}
readinessProbe:
{{- toYaml .Values.tempo.readinessProbe | nindent 12 }}
resources:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,8 @@ spec:
{{- toYaml . | nindent 12 }}
{{- end }}
{{- end }}
livenessProbe:
{{- toYaml .Values.gateway.livenessProbe | nindent 12 }}
readinessProbe:
{{- toYaml .Values.gateway.readinessProbe | nindent 12 }}
volumeMounts:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,8 @@ spec:
{{- toYaml . | nindent 12 }}
{{- end }}
{{- end }}
livenessProbe:
{{- toYaml .Values.tempo.livenessProbe | nindent 12 }}
readinessProbe:
{{- toYaml .Values.tempo.readinessProbe | nindent 12 }}
resources:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,8 @@ spec:
envFrom:
{{- toYaml . | nindent 12 }}
{{- end }}
livenessProbe:
{{- toYaml .Values.tempo.livenessProbe | nindent 12 }}
readinessProbe:
{{- toYaml .Values.tempo.readinessProbe | nindent 12 }}
resources:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,10 @@ spec:
securityContext:
{{- toYaml . | nindent 12 }}
{{- end }}
{{- with .Values.tempo.livenessProbe }}
livenessProbe:
{{- toYaml . | nindent 12 }}
{{- end }}
{{- with .Values.tempo.readinessProbe }}
readinessProbe:
{{- toYaml . | nindent 12 }}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,10 @@ spec:
{{- toYaml . | nindent 12 }}
{{- end }}
{{- end }}
livenessProbe:
{{- toYaml .Values.tempo.livenessProbe | nindent 12 }}
readinessProbe:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the readinessProbe here broke the deployment, due to bi-directional dependencies of the querier and query-frontend pods, and this newly added readiness probe config whereby the frontend pod requires at least one querier pod to connect prior to reporting 'ready'.

Scenario:

  • From a stable cluster on chart version 1.50.0, upgrade the chart from 1.50.0 to 1.51.0.
  • The query-frontend pod restarts and all connections from querier pods are dropped.
  • The query-frontend pod starts NOT ready and the tempo-query-frontend k8s service stops routing traffic to the frontend pod.
  • querier pods are now unable to connect to the query-frontend.
  • The query-frontend readiness check requires that there is at least one querier pod is connected. However, as the k8s service is "down", querier pods can never connect.
  • Since querier pods cannot connect, the query-frontend readiness probe always fails.
  • The query-frontend pod enters a restart loop, as the liveness probe also fails.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kppullin thank you for your message and the scenario. I am not available to try and fix it today but I will try to work on this as soon as I can find the time

Copy link
Contributor Author

@hkailantzis hkailantzis Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kppullin, I'm curious, since this scenario didn't occur for me during testing on local minikube nor even after upgrading to latest version on a test and production k8s cluster, with HPA on and 2 - 3 minReplicas enabled for each component, respectively. Is your case triggering this restart-loop, using 1 pod for each component ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case there are two querier pods and a single query-frontend pod. It's a nonproduction cluster and scaled down a bit, though not to say there aren't scenarios where multiple pods could simultaneously fail or restart at the same time and violate any HPA.

This isn't critical for us at the moment; dropping down to 1.50 works fine. I was able to consistently recreate the failure mode and recovery by flipping back and forth between 1.50 and 1.51.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @kppullin, thank you again. I took the time to check today. Do you think that enabling the readiness probe for the query front-end (commented of course) only if the query front-end replicas are > 1 would fix your issue?

If yes, would you be open to providing the change?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I briefly tried testing out multiple replicas and upgrading from 1.50 to 1.51. While the frontend pods rolled out, even with one valid and active pod, the new pod with the readiness probe defined never reported ready and entered a restart loop.

If I find time I'll try to look further & attempt to replicate locally.

{{- toYaml .Values.tempo.readinessProbe | nindent 12 }}
resources:
{{- toYaml .Values.queryFrontend.resources | nindent 12 }}
{{- with .Values.tempo.securityContext }}
Expand Down
13 changes: 13 additions & 0 deletions charts/tempo-distributed/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,12 @@ tempo:
# -- Overrides the image tag whose default is the chart's appVersion
tag: null
pullPolicy: IfNotPresent
livenessProbe:
httpGet:
path: /ready
port: http-metrics
initialDelaySeconds: 60
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /ready
Expand Down Expand Up @@ -2102,6 +2108,13 @@ gateway:
{{ htpasswd (required "'gateway.basicAuth.username' is required" .Values.gateway.basicAuth.username) (required "'gateway.basicAuth.password' is required" .Values.gateway.basicAuth.password) }}
# -- Existing basic auth secret to use. Must contain '.htpasswd'
existingSecret: null
# Configures the liveness probe for the gateway
livenessProbe:
httpGet:
path: /
port: http-metrics
initialDelaySeconds: 30
timeoutSeconds: 5
# Configures the readiness probe for the gateway
readinessProbe:
httpGet:
Expand Down
Loading