You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are running a 3-node NATS cluster on Kubernetes. One issue we are struggling with quite a bit recently is that some NATS pods get OOMKilled and it's not very clear why this is happening or how we could resolve it. These are some of our observation from the last time it happened.
It seems to have started with one of our NATS consumers receiving a nats: no heartbeat received error.
At this point we notice that the nats_stream_total_messages metric starts rising, whereas it is normally stable around 10-20k messages.
We also noticed after a while that the First Sequence number of the stream was stuck around the time the no heartbeat received error was seen. Stream info:
One thing that we also noticed was that the Ack Floor count of the consumer that experienced the error seemed stuck very close to this First Sequence number:
We don't quite understand why this is happening, as the nats_consumer_num_pending did not show any large number of pending events on any of our consumers.
In an attempt to resolve this issue, we tried to purge the stream. While that indeed cleared the message out of the stream, they immediately started to build up again. Restarting the NATS pods also did not work.
In the end, the only thing that helped stabilize the system again was to:
Remove the persistent volumes that are used as file storage
Restart the NATS pods
Recreate the streams and consumers
Reconnect client applications that depend on NATS
Expected behavior
NATS does not keep messages in memory unnecessarily leading to it getting OOMKilled causing an unstable environment.
If it had a good reason to do so, we'd appreciate some help in figuring out why that is and how we can prevent this from happening as it seriously affects the stability of our system at this point.
Server and client version
server - 2.10.16
Host environment
3-node Kubernetes cluster on GCP (GKE)
4 CPU/16GB RAM each
Steps to reproduce
We have not been able to reproduce this issue reliable.
The text was updated successfully, but these errors were encountered:
Observed behavior
We are running a 3-node NATS cluster on Kubernetes. One issue we are struggling with quite a bit recently is that some NATS pods get OOMKilled and it's not very clear why this is happening or how we could resolve it. These are some of our observation from the last time it happened.
It seems to have started with one of our NATS consumers receiving a
nats: no heartbeat received
error.At this point we notice that the nats_stream_total_messages metric starts rising, whereas it is normally stable around 10-20k messages.
We also noticed after a while that the First Sequence number of the stream was stuck around the time the
no heartbeat received
error was seen. Stream info:One thing that we also noticed was that the Ack Floor count of the consumer that experienced the error seemed stuck very close to this First Sequence number:
We don't quite understand why this is happening, as the nats_consumer_num_pending did not show any large number of pending events on any of our consumers.
In an attempt to resolve this issue, we tried to purge the stream. While that indeed cleared the message out of the stream, they immediately started to build up again. Restarting the NATS pods also did not work.
In the end, the only thing that helped stabilize the system again was to:
Expected behavior
NATS does not keep messages in memory unnecessarily leading to it getting OOMKilled causing an unstable environment.
If it had a good reason to do so, we'd appreciate some help in figuring out why that is and how we can prevent this from happening as it seriously affects the stability of our system at this point.
Server and client version
server - 2.10.16
Host environment
3-node Kubernetes cluster on GCP (GKE)
4 CPU/16GB RAM each
Steps to reproduce
We have not been able to reproduce this issue reliable.
The text was updated successfully, but these errors were encountered: