Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

G1GC OOM #1016

Open
ahus1 opened this issue Oct 21, 2024 · 17 comments
Open

G1GC OOM #1016

ahus1 opened this issue Oct 21, 2024 · 17 comments

Comments

@ahus1
Copy link
Contributor

ahus1 commented Oct 21, 2024

Community is reporting that with G1GC there is no out-of-memory exception / JVM shutdown when the Keycloak heap increases - test originally with Keycloak 24
The scenario was that a lot of session were used here. For KC26, maybe creating a lot of authentication sessions can be used, or setting the heap size to a very small value.

Originally reported by @sschu

@tkyjovsk
Copy link
Contributor

tkyjovsk commented Nov 4, 2024

I think I was able to reproduce the issue. I tested this with 3-pod KC 26.0.5 cluster with 8 VCPUs and 2GB RAM per pod.

I increased memory pressure by:

  • disabling persistent-user-sessions feature (keeping sessions in memory)
  • increasing session expiration from default 30m (to eg 1d)
  • running the AuthorizationCode test scenario with 0% logouts, leading to gradual accumulation of sessions

This resulted in the whole KC cluster being restarted due to health probe failures.
Image

GC overhead reached peaks of about 2.6 %, JVM memory after GC was around 40-50%, and pod memory reached over 90% utilization.

Image

However, there still weren't any "G1 Evacuation Pause" to-space exhausted or to-space overflow messages in the logs which would indicate that the GC is running out of memory (as mentioned in G1 GC tuning guide).

@tkyjovsk
Copy link
Contributor

tkyjovsk commented Nov 4, 2024

I haven't found any straightforward way to trigger an OOME. I need to read up a bit on how G1 can handle these situations.

@sschu
Copy link

sschu commented Nov 5, 2024

This is pretty much what we also saw. @ahus1 mentioned a setting where you could tell the GC to cause an OOME if it not able to free enough memory or takes too much time (-XX:GCTimeLimit and XX:GCHeapFreeLimit) but it seems they are for ParallelGC only.

@ahus1
Copy link
Contributor Author

ahus1 commented Nov 6, 2024

@tkyjovsk - the "connection refused" might be the message after the Pod restart, when the monitoring port is not yet available. Did you also see a "request timeout" or similar, indicating a slow response which would be caused by a high GC overhead? Please also confirm that there was no OOM kill from Kubernetes.

To decide on the next steps, it would be good to collect all information about the behavior we see in this issue. If the behavior is good enough (Pods eventually restart and the service becomes available again), there might be no action items. If we think it is not good enough (recovery takes too long, all Pods restart, etc.), then we can discuss what next steps we want to take.

Can you sum up this information for one of our next team meeting? Today might be a bit short-notice - does Monday sound feasible?

@tkyjovsk
Copy link
Contributor

tkyjovsk commented Nov 6, 2024

@ahus1 Looks like there actually have been OOM kills from Kubernetes. I can see these for all KC pods:

status:
  ...
  containerStatuses:
  - containerID: cri-o://37aefee3497283caafe46d2e145bfc284d1ad0992f49a4d9506fc7a685365ae0
    image: quay.io/keycloak/keycloak:26.0.5
    imageID: quay.io/keycloak/keycloak@sha256:089b5898cb0ba151224c83aef3806538582880029eb5ea71c2afd00b627d4907
    lastState:
      terminated:
        containerID: cri-o://ff70d7b7174a6b676df209d46e790c8d1246a64fd1b6c5b6f0c9378e3567eed8
        exitCode: 137
        finishedAt: "2024-11-05T17:13:01Z"
        reason: OOMKilled
        startedAt: "2024-11-05T16:47:04Z"

This would mean that the JVM is not keeping within the OS/container memory limit, and the probe failures are only a secondary symptom.

@pruivo
Copy link
Contributor

pruivo commented Nov 6, 2024

@tkyjovsk are you setting the memory request/limit in the Keycloak CR?

@sschu
Copy link

sschu commented Nov 6, 2024

@tkyjovsk How much non-heap non-metaspace memory do you give to the JVM i.e. how big is the difference between pod memory limit and java heap+metaspace max?

@ahus1
Copy link
Contributor Author

ahus1 commented Nov 7, 2024

@tkyjovsk - when running in a container, we set

-XX:MaxRAMPercentage=70 -XX:MinRAMPercentage=70 -XX:InitialRAMPercentage=50

which will allocate a percentage of the container's RAM to the heap, see this code for the details.

The "RAM Percentage" sizes the heap memory of the JVM as a percentage, but the JVM will continue to use additional pieces of memory for the Java byte code, threads and other non-heap memory elements.

The more memory you allocate to your Pod, the more memory the JVM will have as a left-over 30% from the setting above. I remember we tested this but it might have either changed over time, or we didn't push the test to its limits.

I suggest you set the memory request/limit to a higher value, and then configure -XmsXXm -XmxXXm explicitly, and then re-run the test. I'm curious how much non-heap memory Keycloak is then using in your test.

@pruivo
Copy link
Contributor

pruivo commented Nov 7, 2024

@tkyjovsk are you setting the memory request/limit in the Keycloak CR?

My theory when I wrote this comment is that the pod has unlimited memory. The JVM will compute a high max heap size, which leads to Kubernetes killing the pod when the worker node has run out of memory.

You can check the max heap size computed by running the following command in the pod

java -XX:+PrintFlagsFinal -XX:MaxRAMPercentage=70 -XX:MinRAMPercentage=70 -XX:InitialRAMPercentage=50 -version | grep HeapSize

In our deployment, for the following limits

  containers:
    - resources:
        limits:
          memory: 4G
        requests:
          cpu: '6'
          memory: 3G

It returns the following (max heap size of ~2.8GB):

   size_t ErgoHeapSizeLimit                        = 0                                         {product} {default}
   size_t HeapSizePerGCThread                      = 43620760                                  {product} {default}
   size_t InitialHeapSize                          = 2000683008                                {product} {ergonomic}
   size_t LargePageHeapSizeThreshold               = 134217728                                 {product} {default}
   size_t MaxHeapSize                              = 2801795072                                {product} {ergonomic}
   size_t MinHeapSize                              = 8388608                                   {product} {ergonomic}
    uintx NonNMethodCodeHeapSize                   = 5839564                                {pd product} {ergonomic}
    uintx NonProfiledCodeHeapSize                  = 122909338                              {pd product} {ergonomic}
    uintx ProfiledCodeHeapSize                     = 122909338                              {pd product} {ergonomic}
   size_t SoftMaxHeapSize                          = 2801795072                             {manageable} {ergonomic}
openjdk version "21.0.5" 2024-10-15 LTS
OpenJDK Runtime Environment (Red_Hat-21.0.5.0.10-1) (build 21.0.5+10-LTS)
OpenJDK 64-Bit Server VM (Red_Hat-21.0.5.0.10-1) (build 21.0.5+10-LTS, mixed mode, sharing)

@sschu
Copy link

sschu commented Nov 7, 2024

I don't think computing the heap memory as MaxRamPercentage is a good way to go. In my experience, the amount of non-Heap non-Metaspace memory you need is somewhat constant. We use 700MB for it and we know with this setting we never get an OOMKill. With MaxRamPercentage, the size of this area depends on the pod memory limit. Which means with small pod memory, you get OOMKills, with large enough pod memory, you won't see them.

@tkyjovsk
Copy link
Contributor

tkyjovsk commented Nov 7, 2024

Thanks for the feedback. Let me try and emulate the second situation then (trying to avoid the OOM kill).

@tkyjovsk
Copy link
Contributor

tkyjovsk commented Nov 8, 2024

I think I was able to reproduce the issue. I bumped the memory limits for pods to 4G while keeping the JVM heap limit at the original 1.4G (70% of the original 2G pod memory). This setting avoided the OOM kill and eventually ran the JVM into 70-90% GC overhead, showing messages like:

[2699.249s][info][gc] GC(6940) Pause Young (Normal) (G1 Evacuation Pause) (Evacuation Failure) 1409M->1368M(1434M) 3.549ms
[2699.339s][info][gc] GC(6941) Pause Young (Normal) (G1 Evacuation Pause) (Evacuation Failure) 1420M->1381M(1434M) 2.888ms
[2699.439s][info][gc] GC(6942) Pause Young (Normal) (G1 Evacuation Pause) (Evacuation Failure) 1423M->1387M(1434M) 2.329ms
[2699.488s][info][gc] GC(6943) Pause Young (Normal) (G1 Evacuation Pause) (Evacuation Failure) 1425M->1421M(1434M) 6.626ms
[2699.491s][info][gc] GC(6944) Pause Young (Normal) (G1 Evacuation Pause) (Evacuation Failure) 1426M->1429M(1434M) 2.728ms

This resulted in a large percentage of failed requests. The liveness and readiness probe failures eventually rebooted the pods but only after several minutes of failing.

The GCTimeRatio option was set to default value 12 which should correspond to a target of ~8% GC overhead. It is clear from the test result that this overhead was much larger and for a considerable amount of time.

In a state like this I would expect the GC to throw OutOfMemoryError: GC overhead limit exceeded. I'm wondering whether this can be considered as a faulty behavior in the GC itself. Should we maybe report this as a possible JVM bug? @ahus1

Screenshot from 2024-11-08 19-34-09

@ahus1
Copy link
Contributor Author

ahus1 commented Nov 11, 2024

@tkyjovsk - thank you for this detailed analysis and for capturing all these details.

Yes, please create a bug issue against the people OpenJDK at Red Hat in their JIRA. It might then be a problem on our side that we are missing out on a required configuration, or we will get a recommendation to use a different GC (ZGC).

Let's see what our friends in the OpenJDK team at Red Hat come up with ideas and suggestions.

@ahus1
Copy link
Contributor Author

ahus1 commented Nov 25, 2024

@tkyjovsk
Copy link
Contributor

tkyjovsk commented Dec 3, 2024

Additional test run to collect more info about GC activity:
keycloak-g1gc-no-oome-diagnostics.zip

Also adding a link to a single-class issue reproducer: https://github.com/tkyjovsk/gc-overhead-reproducer/

@ahus1
Copy link
Contributor Author

ahus1 commented Dec 6, 2024

Reproducer branch: main...tkyjovsk:keycloak-benchmark:issue-1016

@ahus1
Copy link
Contributor Author

ahus1 commented Dec 9, 2024

Waiting for a reaction on the OpenJDK team at Red Hat. Putting to the backlog for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants