Ever increasing memory usage and nats_stream_total_messages resulting in OOMKill [v2.10.16] #6302

BrentChesny · 2024-12-26T09:56:22Z

Observed behavior

We are running a 3-node NATS cluster on Kubernetes. One issue we are struggling with quite a bit recently is that some NATS pods get OOMKilled and it's not very clear why this is happening or how we could resolve it. These are some of our observation from the last time it happened.

It seems to have started with one of our NATS consumers receiving a nats: no heartbeat received error.

At this point we notice that the nats_stream_total_messages metric starts rising, whereas it is normally stable around 10-20k messages.

We also noticed after a while that the First Sequence number of the stream was stuck around the time the no heartbeat received error was seen. Stream info:

One thing that we also noticed was that the Ack Floor count of the consumer that experienced the error seemed stuck very close to this First Sequence number:

We don't quite understand why this is happening, as the nats_consumer_num_pending did not show any large number of pending events on any of our consumers.

In an attempt to resolve this issue, we tried to purge the stream. While that indeed cleared the message out of the stream, they immediately started to build up again. Restarting the NATS pods also did not work.

In the end, the only thing that helped stabilize the system again was to:

Remove the persistent volumes that are used as file storage
Restart the NATS pods
Recreate the streams and consumers
Reconnect client applications that depend on NATS

Expected behavior

NATS does not keep messages in memory unnecessarily leading to it getting OOMKilled causing an unstable environment.

If it had a good reason to do so, we'd appreciate some help in figuring out why that is and how we can prevent this from happening as it seriously affects the stability of our system at this point.

Server and client version

server - 2.10.16

Host environment

3-node Kubernetes cluster on GCP (GKE)
4 CPU/16GB RAM each

Steps to reproduce

We have not been able to reproduce this issue reliable.

The text was updated successfully, but these errors were encountered:

AllenZMC · 2024-12-31T06:29:25Z

server - 2.10.24 has the same problem.

BrentChesny added the defect Suspected defect such as a bug or regression label Dec 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ever increasing memory usage and nats_stream_total_messages resulting in OOMKill [v2.10.16] #6302

Ever increasing memory usage and nats_stream_total_messages resulting in OOMKill [v2.10.16] #6302

BrentChesny commented Dec 26, 2024 •

edited

Loading

AllenZMC commented Dec 31, 2024

Ever increasing memory usage and nats_stream_total_messages resulting in OOMKill [v2.10.16] #6302

Ever increasing memory usage and nats_stream_total_messages resulting in OOMKill [v2.10.16] #6302

Comments

BrentChesny commented Dec 26, 2024 • edited Loading

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

AllenZMC commented Dec 31, 2024

BrentChesny commented Dec 26, 2024 •

edited

Loading