-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
G1GC OOM #1016
Comments
I think I was able to reproduce the issue. I tested this with 3-pod KC 26.0.5 cluster with 8 VCPUs and 2GB RAM per pod. I increased memory pressure by:
This resulted in the whole KC cluster being restarted due to health probe failures. GC overhead reached peaks of about 2.6 %, JVM memory after GC was around 40-50%, and pod memory reached over 90% utilization. However, there still weren't any "G1 Evacuation Pause" |
I haven't found any straightforward way to trigger an OOME. I need to read up a bit on how G1 can handle these situations. |
This is pretty much what we also saw. @ahus1 mentioned a setting where you could tell the GC to cause an OOME if it not able to free enough memory or takes too much time ( |
@tkyjovsk - the "connection refused" might be the message after the Pod restart, when the monitoring port is not yet available. Did you also see a "request timeout" or similar, indicating a slow response which would be caused by a high GC overhead? Please also confirm that there was no OOM kill from Kubernetes. To decide on the next steps, it would be good to collect all information about the behavior we see in this issue. If the behavior is good enough (Pods eventually restart and the service becomes available again), there might be no action items. If we think it is not good enough (recovery takes too long, all Pods restart, etc.), then we can discuss what next steps we want to take. Can you sum up this information for one of our next team meeting? Today might be a bit short-notice - does Monday sound feasible? |
@ahus1 Looks like there actually have been OOM kills from Kubernetes. I can see these for all KC pods:
This would mean that the JVM is not keeping within the OS/container memory limit, and the probe failures are only a secondary symptom. |
@tkyjovsk are you setting the memory request/limit in the Keycloak CR? |
@tkyjovsk How much non-heap non-metaspace memory do you give to the JVM i.e. how big is the difference between pod memory limit and java heap+metaspace max? |
@tkyjovsk - when running in a container, we set
which will allocate a percentage of the container's RAM to the heap, see this code for the details. The "RAM Percentage" sizes the heap memory of the JVM as a percentage, but the JVM will continue to use additional pieces of memory for the Java byte code, threads and other non-heap memory elements. The more memory you allocate to your Pod, the more memory the JVM will have as a left-over 30% from the setting above. I remember we tested this but it might have either changed over time, or we didn't push the test to its limits. I suggest you set the memory request/limit to a higher value, and then configure |
My theory when I wrote this comment is that the pod has unlimited memory. The JVM will compute a high max heap size, which leads to Kubernetes killing the pod when the worker node has run out of memory. You can check the max heap size computed by running the following command in the pod
In our deployment, for the following limits containers:
- resources:
limits:
memory: 4G
requests:
cpu: '6'
memory: 3G It returns the following (max heap size of ~2.8GB):
|
I don't think computing the heap memory as MaxRamPercentage is a good way to go. In my experience, the amount of non-Heap non-Metaspace memory you need is somewhat constant. We use 700MB for it and we know with this setting we never get an OOMKill. With MaxRamPercentage, the size of this area depends on the pod memory limit. Which means with small pod memory, you get OOMKills, with large enough pod memory, you won't see them. |
Thanks for the feedback. Let me try and emulate the second situation then (trying to avoid the OOM kill). |
I think I was able to reproduce the issue. I bumped the memory limits for pods to 4G while keeping the JVM heap limit at the original 1.4G (70% of the original 2G pod memory). This setting avoided the OOM kill and eventually ran the JVM into 70-90% GC overhead, showing messages like:
This resulted in a large percentage of failed requests. The liveness and readiness probe failures eventually rebooted the pods but only after several minutes of failing. The In a state like this I would expect the GC to throw |
@tkyjovsk - thank you for this detailed analysis and for capturing all these details. Yes, please create a bug issue against the people OpenJDK at Red Hat in their JIRA. It might then be a problem on our side that we are missing out on a required configuration, or we will get a recommendation to use a different GC (ZGC). Let's see what our friends in the OpenJDK team at Red Hat come up with ideas and suggestions. |
Internal discussion: https://groups.google.com/a/redhat.com/g/java-project/c/-YYJLA5O3-s |
Additional test run to collect more info about GC activity: Also adding a link to a single-class issue reproducer: https://github.com/tkyjovsk/gc-overhead-reproducer/ |
Reproducer branch: main...tkyjovsk:keycloak-benchmark:issue-1016 |
Waiting for a reaction on the OpenJDK team at Red Hat. Putting to the backlog for now. |
Community is reporting that with G1GC there is no out-of-memory exception / JVM shutdown when the Keycloak heap increases - test originally with Keycloak 24
The scenario was that a lot of session were used here. For KC26, maybe creating a lot of authentication sessions can be used, or setting the heap size to a very small value.
Originally reported by @sschu
The text was updated successfully, but these errors were encountered: