Latency in publish increases after node restart #14542
-
Community Support Policy
RabbitMQ version used4.1.2 Erlang version used27.3.x Operating system (distribution) usedUbuntu How is RabbitMQ deployed?Debian package rabbitmq-diagnostics status outputSee https://www.rabbitmq.com/docs/cli to learn how to use rabbitmq-diagnostics
Logs from node 1 (with sensitive values edited out)See https://www.rabbitmq.com/docs/logging to learn how to collect logs
Logs from node 2 (if applicable, with sensitive values edited out)See https://www.rabbitmq.com/docs/logging to learn how to collect logs
Logs from node 3 (if applicable, with sensitive values edited out)See https://www.rabbitmq.com/docs/logging to learn how to collect logs
rabbitmq.confSee https://www.rabbitmq.com/docs/configure#config-location to learn how to find rabbitmq.conf file location
Steps to deploy RabbitMQ clusterWe deploy via ansible playbook Steps to reproduce the behavior in questionAs outlined below, during OS updates and applications reconnect to another node in the cluster advanced.configSee https://www.rabbitmq.com/docs/configure#config-location to learn how to find advanced.config file location
Application code# PASTE CODE HERE, BETWEEN BACKTICKS Kubernetes deployment file# Relevant parts of K8S deployment that demonstrate how RabbitMQ is deployed
# PASTE YAML HERE, BETWEEN BACKTICKS What problem are you trying to solve?We are running rabbit cluster(3 nodes) on ec2 instances. We do OS updates on theseonce per month. Once restarted the latency is good again. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 7 replies
-
The most obvious explanation would be a change in connection-queue/queue leader locality. By default we use the client-local strategy, so if the application declares a classic queue, it'll be hosted by the same node where the connection is. If it declares a quorum queue, its leader will be where the connection is. When you restart the server, a quorum queue leader location will likely change. A classic queue won't move, but the connection will, so either way - there'll be an additional network hop between the queue and the connection. If you can consistently fix this by restarting the app, I guess your app declares a new queue on startup, so it's local again. If that's not the case, then I'm not sure why an app restart would change anything, except by a lucky coincidence (this time it connects to where the queue/leader is). |
Beta Was this translation helpful? Give feedback.
-
As already mentioned by @kura, a node restart likely changes the queue leader distribution across cluster nodes. Given a constant number of resources available to this node, that can affect latency. An extra added hop . We cannot suggest much else with a three sentence long description without any workload/reproduction details or metrics, including multiple advanced metrics exposed via Prometheus and Grafana. |
Beta Was this translation helpful? Give feedback.
I don't see anything obviously wrong here, although there are quite a few errors in the logs you should look into (TLS errors, missed heartbeats). I think the latency issue is more client-side. I can think of two ways to debug further: