Skip to content

feat: group memory.stats sock metric#3642

Closed
cafkafk wants to merge 1 commit intogoogle:masterfrom
cafkafk:cgroup_socket_mem_pr
Closed

feat: group memory.stats sock metric#3642
cafkafk wants to merge 1 commit intogoogle:masterfrom
cafkafk:cgroup_socket_mem_pr

Conversation

@cafkafk
Copy link
Copy Markdown

@cafkafk cafkafk commented Jan 6, 2025

This adds the cgroup stat sock from the memory.stats metric to
cAdvisor.

The motivation is that we've seen numerous examples at DBC Digital of
application developers creating applications that exhaust socket memory,
e.g. by accidentally creating too many TCP connections and not closing
them, or keeping around a few large allocations, or many other such
issues.

Because cAdvisor currently doesn't report socket memory usage, this has
been hard to monitor, and will only be observed when the OOM killer is
reached.

By adding this metric, it will be possible to proactively handle socket
memory exhaustion (which is really kernel memory exhaustion), before it
becomes a potential incident, and to create alerting and enhance

Signed-off-by: Christina Sørensen [email protected]


Notice: I've been unable to figure out how to regenerate the snapshot tests,
I've opened an issue #3632 for this, but have yet to recieve any replies.

I'm hoping making this PR will bring more attention to this change, so it can
recieve feedback.

This adds the cgroup stat `sock` from the `memory.stats` metric to
cAdvisor.

The motivation is that we've seen numerous examples at DBC Digital of
application developers creating applications that exhaust socket memory,
e.g. by accidentally creating too many TCP connections and not closing
them, or keeping around a few large allocations, or many other such
issues.

Because cAdvisor currently doesn't report socket memory usage, this has
been hard to monitor, and will only be observed when the OOM killer is
reached.

By adding this metric, it will be possible to proactively handle socket
memory exhaustion (which is really kernel memory exhaustion), before it
becomes a potential incident, and to create alerting and enhance
observability of this failure mode.

Signed-off-by: Christina Sørensen <[email protected]>
@cafkafk
Copy link
Copy Markdown
Author

cafkafk commented Feb 26, 2025

I'll fix conflicts upon request, in case this does actually get a review

@cafkafk
Copy link
Copy Markdown
Author

cafkafk commented Jul 5, 2025

I'm no longer available to work on this.

@cafkafk cafkafk closed this Jul 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant