Skip to content

6203: Add ZGC allocation stall rule #664

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

Suchitainf
Copy link
Collaborator

@Suchitainf Suchitainf commented Jul 9, 2025

This enhancement is to add new rule for ZGC Allocation Stall events.

The default configuration:

image

Here are few screenshots for reference:

image image image image image

Ignored

image

If we change default configuration as below:

image image image image

Progress

  • Commit message must refer to an issue
  • Change must be properly reviewed (1 review required, with at least 1 Committer)

Issue

  • JMC-6203: Add ZGC allocation stall rule (Enhancement - P2)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jmc.git pull/664/head:pull/664
$ git checkout pull/664

Update a local copy of the PR:
$ git checkout pull/664
$ git pull https://git.openjdk.org/jmc.git pull/664/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 664

View PR using the GUI difftool:
$ git pr show -t 664

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jmc/pull/664.diff

Using Webrev

Link to Webrev Comment

@bridgekeeper
Copy link

bridgekeeper bot commented Jul 9, 2025

👋 Welcome back schaturvedi! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

@openjdk
Copy link

openjdk bot commented Jul 9, 2025

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

@openjdk openjdk bot added the rfr label Jul 9, 2025
@mlbridge
Copy link

mlbridge bot commented Jul 9, 2025

Webrevs

@thegreystone
Copy link
Member

thegreystone commented Jul 14, 2025

I would be wary about making the heuristic dependent on the number of stall events per recording. Perhaps we can find a better thing to look at, that is unrelated to the length of the recording? For example, how long a particular thread has been stalled per minute, or the longest time a thread was stalled (or both)?

@@ -751,3 +751,11 @@ VMOperationRuleFactory_TEXT_WARN_COMBINED_DURATION=There are long lasting blocki
# {longestOperationDuration} is a time period, {longestOperation} is a JVM operation type, {longestOperationCaller} is a thread name, {longestOperationStartTime} is a time stamp
VMOperationRuleFactory_TEXT_WARN_LONG=There are long lasting blocking VM operations in this recording. The longest was of type {longestOperation} and lasted for {longestOperationDuration}. It was initiated from thread {longestOperationCaller} and happened at {longestOperationStartTime}. VM operations are JVM internal operations. Some VM operations are executed synchronously (i.e. will block the calling thread), and some need to be executed at so called safe points. Safe point polling is a cooperative suspension mechanism that halts byte code execution in the JVM. A VM operation occurring at a safe point will effectively be "stopping the world", meaning that no Java code will be executing in any thread while executing VM operations at that safe point. Long lasting VM operations executing at safe points can decrease the responsiveness of an application. If you do find such VM operations, then the type of operation and its caller thread provide vital information to understand why the VM operation happened. To find more details, check if there is an event in the caller thread intersecting this event time wise. Looking at the stack trace for such an event can help determining what caused it. See [Runtime Overview](http://openjdk.java.net/groups/hotspot/docs/RuntimeOverview.html) for further information.
VMOperationRuleFactory_TEXT_WARN_LONG_COMBINED_DURATION=There are long lasting blocking VM operations in this recording. The longest was created from multiple close consecutive operations that were of type {longestOperation} and lasted for {longestOperationDuration} in total. They were initiated from thread {longestOperationCaller} and started at {longestOperationStartTime}. VM operations are JVM internal operations. Some VM operations are executed synchronously (i.e. will block the calling thread), and some need to be executed at so called safe points. Safe point polling is a cooperative suspension mechanism that halts byte code execution in the JVM. A VM operation occurring at a safe point will effectively be "stopping the world", meaning that no Java code will be executing in any thread while executing VM operations at that safe point. Long lasting VM operations executing at safe points can decrease the responsiveness of an application. If you do find such VM operations, then the type of operation and its caller thread provide vital information to understand why the VM operation happened. To find more details, check if there is an event in the caller thread intersecting this event time wise. Looking at the stack trace for such an event can help determining what caused it. See [Runtime Overview](http://openjdk.java.net/groups/hotspot/docs/RuntimeOverview.html) for further information.
ZGCAllocationStall_RULE_NAME=ZGC Allocation Stall
ZgcAllocationStall_TEXT_INFO=In ZGC, a type of concurrent Garbage Collection (GC) algorithm, GC threads run concurrently with application threads, resulting in minimal stop-the-world pauses. However, because these pauses are so brief, application threads may create objects faster than GC threads can reclaim memory. In such cases, the JVM temporarily stops the application threads from creating new objects. This 'stopping of object creation' is known as an "Allocation Stall." \n Allocation Stall occurs due to following reasons: \n 1. Inefficient GC Algorithm: This is often the primary cause of Allocation Stall. Using a non-optimal GC algorithm or improper GC settings for your application's workload can lead to stalling. Earlier versions of ZGC (i.e. single-generation ZGC algorithm), are more prone to Allocation Stalls. \n 2. High Object Allocation Rate: If your application creates objects at a very high rate, it can overwhelm the GC's ability to reclaim memory quickly enough, leading to stalls.\n 3. Memory Fragmentation: Even if there is free memory, fragmentation in the heap can prevent large objects from being allocated, contributing to Allocation Stalls.\n
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because these pauses are so brief, application threads may create objects faster than GC threads can reclaim memory.

This is not because the pauses are brief that we have stalls. it's just because app threads can create objects faster than the GC cycle is able to release the memory. Purely technically allocating is very cheap, while reclaiming requires to traverse object graph, move objects, plus sometimes less GC threads than app threads allocating.

Allocation Stall occurs due to following reasons: \n 1. Inefficient GC Algorithm: This is often the primary cause of Allocation Stall. Using a non-optimal GC algorithm or improper GC settings for your application's workload can lead to stalling.

I would write it like this. I prefer to say GC algorithm requires more time to reclaim, and therefore we need either more GC threads and/or more room in the JAva Heap to have more time to reclaim before reaching the heap limit.
Non-generational have a GC cycle longer than generational which contribute to having more stalls (the whole object graph needs to be scan before trying to reclaim memory, while only a fraction for generational). so improper GC settings that's true but I would suggest heap sizing, GC threads, or switching to generational.

  1. Memory Fragmentation: Even if there is free memory, fragmentation in the heap can prevent large objects from being allocated, contributing to Allocation Stalls.

even if technically it's possible, I am not sure this is a main issue with ZGC.

@thegreystone
Copy link
Member

As per the JMC dev slack, perhaps sum of stalled time per minute could be a good one? That one would have to be a rate to be recording length independent. Perhaps also combined with a higher score if very long individual allocation stalls are present?

I also opened https://bugs.openjdk.org/browse/JDK-8362416 to make the rule more useful in case there are a lot of stalls under the default 10ms threshold of the current ZGC Allocation Stall event. We shouldn't block this PR on that work though, but rather ensure we use the new event when available. I will open a new JMC issue to track this.

@openjdk
Copy link

openjdk bot commented Jul 18, 2025

@Suchitainf this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout 6203
git fetch https://git.openjdk.org/jmc.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

@openjdk openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Jul 18, 2025
@Suchitainf
Copy link
Collaborator Author

The new commit has the text update suggested by @jpbempel and the metric to calculate score is changed from allocation stall events count to stall rate. Also, I have added stall rate as part of rule result along with allocation stall count, total stall duration and maximum stall duration.

Here are the fresh screenshots:

image image image image image image

I will create a follow up enhancement to create a new rule for maximum stall duration metric.

@openjdk openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Jul 19, 2025
@Suchitainf Suchitainf requested a review from jpbempel July 22, 2025 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

3 participants