Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ever increasing memory usage and nats_stream_total_messages resulting in OOMKill [v2.10.16] #6302

Open
BrentChesny opened this issue Dec 26, 2024 · 1 comment
Labels
defect Suspected defect such as a bug or regression

Comments

@BrentChesny
Copy link

BrentChesny commented Dec 26, 2024

Observed behavior

We are running a 3-node NATS cluster on Kubernetes. One issue we are struggling with quite a bit recently is that some NATS pods get OOMKilled and it's not very clear why this is happening or how we could resolve it. These are some of our observation from the last time it happened.

It seems to have started with one of our NATS consumers receiving a nats: no heartbeat received error.

At this point we notice that the nats_stream_total_messages metric starts rising, whereas it is normally stable around 10-20k messages.
Screenshot 2024-12-26 at 10 29 55

We also noticed after a while that the First Sequence number of the stream was stuck around the time the no heartbeat received error was seen. Stream info:
Screenshot 2024-12-24 at 15 38 32

One thing that we also noticed was that the Ack Floor count of the consumer that experienced the error seemed stuck very close to this First Sequence number:
Screenshot 2024-12-24 at 15 38 41

We don't quite understand why this is happening, as the nats_consumer_num_pending did not show any large number of pending events on any of our consumers.

In an attempt to resolve this issue, we tried to purge the stream. While that indeed cleared the message out of the stream, they immediately started to build up again. Restarting the NATS pods also did not work.

In the end, the only thing that helped stabilize the system again was to:

  • Remove the persistent volumes that are used as file storage
  • Restart the NATS pods
  • Recreate the streams and consumers
  • Reconnect client applications that depend on NATS

Expected behavior

NATS does not keep messages in memory unnecessarily leading to it getting OOMKilled causing an unstable environment.

If it had a good reason to do so, we'd appreciate some help in figuring out why that is and how we can prevent this from happening as it seriously affects the stability of our system at this point.

Server and client version

server - 2.10.16

Host environment

3-node Kubernetes cluster on GCP (GKE)
4 CPU/16GB RAM each

Steps to reproduce

We have not been able to reproduce this issue reliable.

@BrentChesny BrentChesny added the defect Suspected defect such as a bug or regression label Dec 26, 2024
@AllenZMC
Copy link

server - 2.10.24 has the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

2 participants