Latency in publish increases after node restart #14542

noelmcgrath · 2025-09-15T06:57:56Z

noelmcgrath
Sep 15, 2025

Community Support Policy

I have read RabbitMQ's Community Support Policy
I run RabbitMQ 4.x, the only series currently covered by community support
I promise to provide all relevant information (versions, logs from all nodes, rabbitmq-diagnostics output, detailed reproduction steps)

RabbitMQ version used

4.1.2

Erlang version used

27.3.x

Operating system (distribution) used

Ubuntu

How is RabbitMQ deployed?

Debian package

rabbitmq-diagnostics status output

See https://www.rabbitmq.com/docs/cli to learn how to use rabbitmq-diagnostics

# PASTE OUTPUT HERE, BETWEEN BACKTICKS

Logs from node 1 (with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

# PASTE LOG HERE, BETWEEN BACKTICKS

Logs from node 2 (if applicable, with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

# PASTE LOG HERE, BETWEEN BACKTICKS

Logs from node 3 (if applicable, with sensitive values edited out)

See https://www.rabbitmq.com/docs/logging to learn how to collect logs

# PASTE LOG HERE, BETWEEN BACKTICKS

rabbitmq.conf

See https://www.rabbitmq.com/docs/configure#config-location to learn how to find rabbitmq.conf file location

# PASTE rabbitmq.conf HERE, BETWEEN BACKTICKS

Steps to deploy RabbitMQ cluster

We deploy via ansible playbook

Steps to reproduce the behavior in question

As outlined below, during OS updates and applications reconnect to another node in the cluster

advanced.config

See https://www.rabbitmq.com/docs/configure#config-location to learn how to find advanced.config file location

# PASTE advanced.config HERE, BETWEEN BACKTICKS

Application code

# PASTE CODE HERE, BETWEEN BACKTICKS

Kubernetes deployment file

# Relevant parts of K8S deployment that demonstrate how RabbitMQ is deployed
# PASTE YAML HERE, BETWEEN BACKTICKS

What problem are you trying to solve?

We are running rabbit cluster(3 nodes) on ec2 instances. We do OS updates on theseonce per month.
Process is one at a time, put in to maintenace, do updates(generally there is a reboot).
We have auto recovering so if an application is connected to node 1 it reconnects to another node.
What we are seen is the latency in publish increases and it will remain this way until we restart the applications.

Once restarted the latency is good again.
Any idea what the issue is here?

Answered by mkuratczyk

Sep 16, 2025

I don't see anything obviously wrong here, although there are quite a few errors in the logs you should look into (TLS errors, missed heartbeats). I think the latency issue is more client-side. I can think of two ways to debug further:

Capture the network traffic with tcpdump. Perhaps the latency is not really between the moment the message is published and confirmed but rather between when your app thinks it's published and when it is actually published on the network
Write a simple app that reproduces the issue that you could share with us (based on what you are saying, it should be basically a tutorial-level app which just connects and publishes messages, while measuring latency).

View full answer

mkuratczyk · 2025-09-15T08:45:57Z

mkuratczyk
Sep 15, 2025
Maintainer

What queue type?
What are the before/after latencies? What kind of a different we are talking?
And how consistent is this behaviour?

The most obvious explanation would be a change in connection-queue/queue leader locality. By default we use the client-local strategy, so if the application declares a classic queue, it'll be hosted by the same node where the connection is. If it declares a quorum queue, its leader will be where the connection is. When you restart the server, a quorum queue leader location will likely change. A classic queue won't move, but the connection will, so either way - there'll be an additional network hop between the queue and the connection. If you can consistently fix this by restarting the app, I guess your app declares a new queue on startup, so it's local again. If that's not the case, then I'm not sure why an app restart would change anything, except by a lucky coincidence (this time it connects to where the queue/leader is).

5 replies

noelmcgrath Sep 15, 2025
Author

**What queue type?**
All queues are quorum

**What are the before/after latencies? What kind of a different we are talking?**
See image, the differences are sigficant, and only after app restart does latency go back down
We have a number of .net applications and they all behave the same

**And how consistent is this behaviour?**
Happens evertime

App declare queue, but this is idempotent, it already there it does not create it and they are already there.

kjnilsson Sep 15, 2025
Maintainer

Please can you share log files at debug level with us capturing before, during and after the latency issues occur.

How many quorum queues in your system?

kjnilsson Sep 15, 2025
Maintainer

@noelmcgrath do you perchance create a new channel for every message? How exactly is the latency measured?

noelmcgrath Sep 16, 2025
Author

@kjnilsson no we dont do new channel, connections only on startup, latency is measured via .net otel around the call to channel.BasicPublishAsync

kjnilsson Sep 16, 2025
Maintainer

measuring around channel.BasicPublishAsync will only tell you how long it took to make that particular call and send the message on the wire not how long it takes for the message to be accepted by the broker. You need to wait for the publisher confirmation (you are using basic.qos right?) to get the true publishing latency.

This sounds more of a client issue if channel.BasicPublishAsync is what takes up to 1 second. Which version of the client are you using?

michaelklishin · 2025-09-15T14:17:34Z

michaelklishin
Sep 15, 2025
Maintainer

As already mentioned by @kura, a node restart likely changes the queue leader distribution across cluster nodes. Given a constant number of resources available to this node, that can affect latency. An extra added hop .

We cannot suggest much else with a three sentence long description without any workload/reproduction details or metrics, including multiple advanced metrics exposed via Prometheus and Grafana.

2 replies

noelmcgrath Sep 16, 2025
Author

Attached are the log files in debug level for each node in cluster

These are the metrics
RabbitMQ-Overview
Erlang-Distribution
Erlang-Memory-Allocators
RabbitMQ-Quorum-Queues-Raft

The use case was putting each node in to maintenance and then doing a reboot and then repeating when node was back in cluster.
It is after we do this that we see an increase in latency publishing our events.
Only way we can get latency time back down is restarting the applications.

nonprodmq0.ccsnonprod.net.log
nonprodmq1.ccsnonprod.net.log
nonprodmq2.ccsnonprod.net.log

mkuratczyk Sep 16, 2025
Maintainer

I don't see anything obviously wrong here, although there are quite a few errors in the logs you should look into (TLS errors, missed heartbeats). I think the latency issue is more client-side. I can think of two ways to debug further:

Capture the network traffic with tcpdump. Perhaps the latency is not really between the moment the message is published and confirmed but rather between when your app thinks it's published and when it is actually published on the network
Write a simple app that reproduces the issue that you could share with us (based on what you are saying, it should be basically a tutorial-level app which just connects and publishes messages, while measuring latency).

Answer selected by michaelklishin

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Latency in publish increases after node restart #14542

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Latency in publish increases after node restart #14542

Uh oh!

noelmcgrath Sep 15, 2025

Community Support Policy

RabbitMQ version used

Erlang version used

Operating system (distribution) used

How is RabbitMQ deployed?

rabbitmq-diagnostics status output

Logs from node 1 (with sensitive values edited out)

Logs from node 2 (if applicable, with sensitive values edited out)

Logs from node 3 (if applicable, with sensitive values edited out)

rabbitmq.conf

Steps to deploy RabbitMQ cluster

Steps to reproduce the behavior in question

advanced.config

Application code

Kubernetes deployment file

What problem are you trying to solve?

Replies: 2 comments · 7 replies

Uh oh!

mkuratczyk Sep 15, 2025 Maintainer

Uh oh!

noelmcgrath Sep 15, 2025 Author

Uh oh!

kjnilsson Sep 15, 2025 Maintainer

Uh oh!

kjnilsson Sep 15, 2025 Maintainer

Uh oh!

noelmcgrath Sep 16, 2025 Author

Uh oh!

kjnilsson Sep 16, 2025 Maintainer

Uh oh!

michaelklishin Sep 15, 2025 Maintainer

Uh oh!

noelmcgrath Sep 16, 2025 Author

Uh oh!

mkuratczyk Sep 16, 2025 Maintainer

noelmcgrath
Sep 15, 2025

Replies: 2 comments 7 replies

mkuratczyk
Sep 15, 2025
Maintainer

noelmcgrath Sep 15, 2025
Author

kjnilsson Sep 15, 2025
Maintainer

kjnilsson Sep 15, 2025
Maintainer

noelmcgrath Sep 16, 2025
Author

kjnilsson Sep 16, 2025
Maintainer

michaelklishin
Sep 15, 2025
Maintainer

noelmcgrath Sep 16, 2025
Author

mkuratczyk Sep 16, 2025
Maintainer