-
Notifications
You must be signed in to change notification settings - Fork 179
6203: Add ZGC allocation stall rule #664
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
👋 Welcome back schaturvedi! A progress list of the required criteria for merging this PR into |
❗ This change is not yet ready to be integrated. |
I would be wary about making the heuristic dependent on the number of stall events per recording. Perhaps we can find a better thing to look at, that is unrelated to the length of the recording? For example, how long a particular thread has been stalled per minute, or the longest time a thread was stalled (or both)? |
@@ -751,3 +751,11 @@ VMOperationRuleFactory_TEXT_WARN_COMBINED_DURATION=There are long lasting blocki | |||
# {longestOperationDuration} is a time period, {longestOperation} is a JVM operation type, {longestOperationCaller} is a thread name, {longestOperationStartTime} is a time stamp | |||
VMOperationRuleFactory_TEXT_WARN_LONG=There are long lasting blocking VM operations in this recording. The longest was of type {longestOperation} and lasted for {longestOperationDuration}. It was initiated from thread {longestOperationCaller} and happened at {longestOperationStartTime}. VM operations are JVM internal operations. Some VM operations are executed synchronously (i.e. will block the calling thread), and some need to be executed at so called safe points. Safe point polling is a cooperative suspension mechanism that halts byte code execution in the JVM. A VM operation occurring at a safe point will effectively be "stopping the world", meaning that no Java code will be executing in any thread while executing VM operations at that safe point. Long lasting VM operations executing at safe points can decrease the responsiveness of an application. If you do find such VM operations, then the type of operation and its caller thread provide vital information to understand why the VM operation happened. To find more details, check if there is an event in the caller thread intersecting this event time wise. Looking at the stack trace for such an event can help determining what caused it. See [Runtime Overview](http://openjdk.java.net/groups/hotspot/docs/RuntimeOverview.html) for further information. | |||
VMOperationRuleFactory_TEXT_WARN_LONG_COMBINED_DURATION=There are long lasting blocking VM operations in this recording. The longest was created from multiple close consecutive operations that were of type {longestOperation} and lasted for {longestOperationDuration} in total. They were initiated from thread {longestOperationCaller} and started at {longestOperationStartTime}. VM operations are JVM internal operations. Some VM operations are executed synchronously (i.e. will block the calling thread), and some need to be executed at so called safe points. Safe point polling is a cooperative suspension mechanism that halts byte code execution in the JVM. A VM operation occurring at a safe point will effectively be "stopping the world", meaning that no Java code will be executing in any thread while executing VM operations at that safe point. Long lasting VM operations executing at safe points can decrease the responsiveness of an application. If you do find such VM operations, then the type of operation and its caller thread provide vital information to understand why the VM operation happened. To find more details, check if there is an event in the caller thread intersecting this event time wise. Looking at the stack trace for such an event can help determining what caused it. See [Runtime Overview](http://openjdk.java.net/groups/hotspot/docs/RuntimeOverview.html) for further information. | |||
ZGCAllocationStall_RULE_NAME=ZGC Allocation Stall | |||
ZgcAllocationStall_TEXT_INFO=In ZGC, a type of concurrent Garbage Collection (GC) algorithm, GC threads run concurrently with application threads, resulting in minimal stop-the-world pauses. However, because these pauses are so brief, application threads may create objects faster than GC threads can reclaim memory. In such cases, the JVM temporarily stops the application threads from creating new objects. This 'stopping of object creation' is known as an "Allocation Stall." \n Allocation Stall occurs due to following reasons: \n 1. Inefficient GC Algorithm: This is often the primary cause of Allocation Stall. Using a non-optimal GC algorithm or improper GC settings for your application's workload can lead to stalling. Earlier versions of ZGC (i.e. single-generation ZGC algorithm), are more prone to Allocation Stalls. \n 2. High Object Allocation Rate: If your application creates objects at a very high rate, it can overwhelm the GC's ability to reclaim memory quickly enough, leading to stalls.\n 3. Memory Fragmentation: Even if there is free memory, fragmentation in the heap can prevent large objects from being allocated, contributing to Allocation Stalls.\n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because these pauses are so brief, application threads may create objects faster than GC threads can reclaim memory.
This is not because the pauses are brief that we have stalls. it's just because app threads can create objects faster than the GC cycle is able to release the memory. Purely technically allocating is very cheap, while reclaiming requires to traverse object graph, move objects, plus sometimes less GC threads than app threads allocating.
Allocation Stall occurs due to following reasons: \n 1. Inefficient GC Algorithm: This is often the primary cause of Allocation Stall. Using a non-optimal GC algorithm or improper GC settings for your application's workload can lead to stalling.
I would write it like this. I prefer to say GC algorithm requires more time to reclaim, and therefore we need either more GC threads and/or more room in the JAva Heap to have more time to reclaim before reaching the heap limit.
Non-generational have a GC cycle longer than generational which contribute to having more stalls (the whole object graph needs to be scan before trying to reclaim memory, while only a fraction for generational). so improper GC settings that's true but I would suggest heap sizing, GC threads, or switching to generational.
- Memory Fragmentation: Even if there is free memory, fragmentation in the heap can prevent large objects from being allocated, contributing to Allocation Stalls.
even if technically it's possible, I am not sure this is a main issue with ZGC.
As per the JMC dev slack, perhaps sum of stalled time per minute could be a good one? That one would have to be a rate to be recording length independent. Perhaps also combined with a higher score if very long individual allocation stalls are present? I also opened https://bugs.openjdk.org/browse/JDK-8362416 to make the rule more useful in case there are a lot of stalls under the default 10ms threshold of the current ZGC Allocation Stall event. We shouldn't block this PR on that work though, but rather ensure we use the new event when available. I will open a new JMC issue to track this. |
@Suchitainf this pull request can not be integrated into git checkout 6203
git fetch https://git.openjdk.org/jmc.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push |
The new commit has the text update suggested by @jpbempel and the metric to calculate score is changed from allocation stall events count to stall rate. Also, I have added stall rate as part of rule result along with allocation stall count, total stall duration and maximum stall duration. Here are the fresh screenshots: ![]() ![]() ![]() ![]() ![]() ![]() I will create a follow up enhancement to create a new rule for maximum stall duration metric. |
This enhancement is to add new rule for ZGC Allocation Stall events.
The default configuration:
Here are few screenshots for reference:
Ignored
If we change default configuration as below:
Progress
Issue
Reviewing
Using
git
Checkout this PR locally:
$ git fetch https://git.openjdk.org/jmc.git pull/664/head:pull/664
$ git checkout pull/664
Update a local copy of the PR:
$ git checkout pull/664
$ git pull https://git.openjdk.org/jmc.git pull/664/head
Using Skara CLI tools
Checkout this PR locally:
$ git pr checkout 664
View PR using the GUI difftool:
$ git pr show -t 664
Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jmc/pull/664.diff
Using Webrev
Link to Webrev Comment