Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

version-checker seemingly leaks memory and gets oom-killed #76

Open
nadiamoe opened this issue Feb 20, 2021 · 5 comments
Open

version-checker seemingly leaks memory and gets oom-killed #76

nadiamoe opened this issue Feb 20, 2021 · 5 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@nadiamoe
Copy link

nadiamoe commented Feb 20, 2021

I am running version-checker on a single node, quite small cluster with ~60 pods. So far it is working nicely, but I do not understand the memory behavior it has.

I'm basically running the sample deployment file, plus the --test-all-containers flag and some cpu limits:

        resources:
          requests:
            cpu: 10m
            memory: 32M
          limits:
            cpu: 50m
            memory: 128M

kubectl get pod -o yaml

Over time, I see that version-checker approaches the memory limit and then stays near ~99% for a while. After some time, the kernel kills the ct due to OOM and k8s restarts the pod.

Memory chart

However, I do not see anything alarming in the logs, other than some failures and expected permission errors.

This doesn't seem to have any functional impact, but does fire some alerts and doesn't look good on my dashboards :)

Is this behavior intended, and/or is there any way to prevent it?

@trastle
Copy link

trastle commented Feb 24, 2021

We are seeing a similar behaviour while running Version Checker. Would be interested to know if there are recommended values for the limits?

@Trede1983
Copy link

Also seeing something similar with Version Checker getting OOM killed fairly frequently.

@davidcollom davidcollom added enhancement New feature or request help wanted Extra attention is needed labels Jul 12, 2023
@davidcollom
Copy link
Collaborator

Hey @Trede1983 @trastle @roobre,

Sorry its taken so long to get back to you on this issue... I have noted that there were some issues around version-checker since these issues have been raised in attempting to reduce the memory footprint.

Things like this are extremely challenging to debug and replicate and it would be amazing to know how many nodes/pods you have in the cluster at the time of this issue, along with the memory/cpu limits/requests you had/have set.

I appreciate that this may be some time ago, and that you may no longer be using version-checker, however this information could be really helpful for us to further understand the memory footprint in larger installations.

In terms of tuning through and changes the main one that comes to mind is #160 along with the already mentioned #69

  • Disabling test-all-containers and adding the enable.version-checker.io/*my-container* annotations to pods that you care about
  • Reduce/Increase the image cache timeout (Defaulted to 30minutes) via --image-cache-timeout cli arguments.

@erwanval
Copy link
Contributor

erwanval commented Jul 22, 2024

Hello @davidcollom

I'm also encountering this issue. My test cluster is pretty small:

  • 8 nodes (4cpu / 16GB ram)
  • 170 pods
  • 307 containers (208 containers and 99 init containers)
  • 67 distinct images (from docker.io, ghcr.io, quay.io, and registry.k8s.io)

Flag --test-all-containers is set, and only two pods have enable.version-checker.io/*my-container*: false annotation to disable verification (comes from a private registry I haven't configured yet).
I also defined use-sha.version-checker.io, match-regex.version-checker.io and override-url.version-checker.io on a bunch of pods, as some images comes from a registry proxy, or have "fake" versions (like grafana).

Version checker is the latest (0.7.0) and installed using helm with the following values:

replicaCount: 1
versionChecker:
  imageCacheTimeout: 30m
  testAllContainers: true

resources:
  # limits:
  #   memory: 128Mi
  requests:
    cpu: 10m
    memory: 128Mi

# This is a temporary fix until the following PR is merged:
# https://github.com/jetstack/version-checker/pull/227
ghcr:
  token: xxxx

securityContext:
  allowPrivilegeEscalation: false
  capabilities:
    drop:
      - ALL
  readOnlyRootFilesystem: true
  runAsNonRoot: true
  runAsUser: 65534
  seccompProfile:
    type: RuntimeDefault

serviceMonitor:
  enabled: true

If I set resource.limit.memory, version checker is oomkilled every ~6h. I haven't tried running it for more than 1 day without the limit, but I assume it will keep growing.
Here is a graph showing the memory usage over time:

image

@erwanval
Copy link
Contributor

erwanval commented Sep 9, 2024

Hello,

Version 0.8.2, the issue still persists.
I tried to add the following to the values:

  env:
    - name: GOMEMLIMIT
      valueFrom:
        resourceFieldRef:
          divisor: "0"
          resource: limits.memory

It reduce the frequency of OOMKill to about 1 per day instead of every 6h, but doesn't solve the issue.
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants