6203: Add ZGC allocation stall rule #664

Suchitainf · 2025-07-09T14:17:10Z

This enhancement is to add new rule for ZGC Allocation Stall events.

The default configuration:

Here are few screenshots for reference:

Ignored

If we change default configuration as below:

Progress

Commit message must refer to an issue
Change must be properly reviewed (1 review required, with at least 1 Committer)

Issue

JMC-6203: Add ZGC allocation stall rule (Enhancement - P2)

Reviewing

Using git

Checkout this PR locally:
$ git fetch https://git.openjdk.org/jmc.git pull/664/head:pull/664
$ git checkout pull/664

Update a local copy of the PR:
$ git checkout pull/664
$ git pull https://git.openjdk.org/jmc.git pull/664/head

Using Skara CLI tools

Checkout this PR locally:
$ git pr checkout 664

View PR using the GUI difftool:
$ git pr show -t 664

Using diff file

Download this PR as a diff file:
https://git.openjdk.org/jmc/pull/664.diff

Using Webrev

Link to Webrev Comment

bridgekeeper · 2025-07-09T14:17:49Z

👋 Welcome back schaturvedi! A progress list of the required criteria for merging this PR into master will be added to the body of your pull request. There are additional pull request commands available for use with this pull request.

openjdk · 2025-07-09T14:19:01Z

❗ This change is not yet ready to be integrated.
See the Progress checklist in the description for automated requirements.

mlbridge · 2025-07-09T14:22:03Z

Webrevs

thegreystone · 2025-07-14T13:33:46Z

I would be wary about making the heuristic dependent on the number of stall events per recording. Perhaps we can find a better thing to look at, that is unrelated to the length of the recording? For example, how long a particular thread has been stalled per minute, or the longest time a thread was stalled (or both)?

jpbempel · 2025-07-16T11:57:42Z

...ain/resources/org/openjdk/jmc/flightrecorder/rules/jdk/messages/internal/messages.properties

@@ -751,3 +751,11 @@ VMOperationRuleFactory_TEXT_WARN_COMBINED_DURATION=There are long lasting blocki
 # {longestOperationDuration} is a time period, {longestOperation} is a JVM operation type, {longestOperationCaller} is a thread name, {longestOperationStartTime} is a time stamp
 VMOperationRuleFactory_TEXT_WARN_LONG=There are long lasting blocking VM operations in this recording. The longest was of type {longestOperation} and lasted for {longestOperationDuration}. It was initiated from thread {longestOperationCaller} and happened at {longestOperationStartTime}. VM operations are JVM internal operations. Some VM operations are executed synchronously (i.e. will block the calling thread), and some need to be executed at so called safe points. Safe point polling is a cooperative suspension mechanism that halts byte code execution in the JVM. A VM operation occurring at a safe point will effectively be "stopping the world", meaning that no Java code will be executing in any thread while executing VM operations at that safe point. Long lasting VM operations executing at safe points can decrease the responsiveness of an application. If you do find such VM operations, then the type of operation and its caller thread provide vital information to understand why the VM operation happened. To find more details, check if there is an event in the caller thread intersecting this event time wise. Looking at the stack trace for such an event can help determining what caused it. See [Runtime Overview](http://openjdk.java.net/groups/hotspot/docs/RuntimeOverview.html) for further information.
 VMOperationRuleFactory_TEXT_WARN_LONG_COMBINED_DURATION=There are long lasting blocking VM operations in this recording. The longest was created from multiple close consecutive operations that were of type {longestOperation} and lasted for {longestOperationDuration} in total. They were initiated from thread {longestOperationCaller} and started at {longestOperationStartTime}. VM operations are JVM internal operations. Some VM operations are executed synchronously (i.e. will block the calling thread), and some need to be executed at so called safe points. Safe point polling is a cooperative suspension mechanism that halts byte code execution in the JVM. A VM operation occurring at a safe point will effectively be "stopping the world", meaning that no Java code will be executing in any thread while executing VM operations at that safe point. Long lasting VM operations executing at safe points can decrease the responsiveness of an application. If you do find such VM operations, then the type of operation and its caller thread provide vital information to understand why the VM operation happened. To find more details, check if there is an event in the caller thread intersecting this event time wise. Looking at the stack trace for such an event can help determining what caused it. See [Runtime Overview](http://openjdk.java.net/groups/hotspot/docs/RuntimeOverview.html) for further information.
+ZGCAllocationStall_RULE_NAME=ZGC Allocation Stall
+ZgcAllocationStall_TEXT_INFO=In ZGC, a type of concurrent Garbage Collection (GC) algorithm, GC threads run concurrently with application threads, resulting in minimal stop-the-world pauses. However, because these pauses are so brief, application threads may create objects faster than GC threads can reclaim memory. In such cases, the JVM temporarily stops the application threads from creating new objects. This 'stopping of object creation' is known as an "Allocation Stall." \n Allocation Stall occurs due to following reasons: \n 1. Inefficient GC Algorithm: This is often the primary cause of Allocation Stall. Using a non-optimal GC algorithm or improper GC settings for your application's workload can lead to stalling. Earlier versions of ZGC (i.e. single-generation ZGC algorithm), are more prone to Allocation Stalls. \n 2. High Object Allocation Rate: If your application creates objects at a very high rate, it can overwhelm the GC's ability to reclaim memory quickly enough, leading to stalls.\n 3. Memory Fragmentation: Even if there is free memory, fragmentation in the heap can prevent large objects from being allocated, contributing to Allocation Stalls.\n


because these pauses are so brief, application threads may create objects faster than GC threads can reclaim memory.

This is not because the pauses are brief that we have stalls. it's just because app threads can create objects faster than the GC cycle is able to release the memory. Purely technically allocating is very cheap, while reclaiming requires to traverse object graph, move objects, plus sometimes less GC threads than app threads allocating.

Allocation Stall occurs due to following reasons: \n 1. Inefficient GC Algorithm: This is often the primary cause of Allocation Stall. Using a non-optimal GC algorithm or improper GC settings for your application's workload can lead to stalling.

I would write it like this. I prefer to say GC algorithm requires more time to reclaim, and therefore we need either more GC threads and/or more room in the JAva Heap to have more time to reclaim before reaching the heap limit.
Non-generational have a GC cycle longer than generational which contribute to having more stalls (the whole object graph needs to be scan before trying to reclaim memory, while only a fraction for generational). so improper GC settings that's true but I would suggest heap sizing, GC threads, or switching to generational.

Memory Fragmentation: Even if there is free memory, fragmentation in the heap can prevent large objects from being allocated, contributing to Allocation Stalls.

even if technically it's possible, I am not sure this is a main issue with ZGC.

thegreystone · 2025-07-16T12:30:41Z

As per the JMC dev slack, perhaps sum of stalled time per minute could be a good one? That one would have to be a rate to be recording length independent. Perhaps also combined with a higher score if very long individual allocation stalls are present?

I also opened https://bugs.openjdk.org/browse/JDK-8362416 to make the rule more useful in case there are a lot of stalls under the default 10ms threshold of the current ZGC Allocation Stall event. We shouldn't block this PR on that work though, but rather ensure we use the new event when available. I will open a new JMC issue to track this.

openjdk · 2025-07-18T18:08:17Z

@Suchitainf this pull request can not be integrated into master due to one or more merge conflicts. To resolve these merge conflicts and update this pull request you can run the following commands in the local repository for your personal fork:

git checkout 6203
git fetch https://git.openjdk.org/jmc.git master
git merge FETCH_HEAD
# resolve conflicts and follow the instructions given by git merge
git commit -m "Merge master"
git push

Suchitainf · 2025-07-18T18:13:58Z

The new commit has the text update suggested by @jpbempel and the metric to calculate score is changed from allocation stall events count to stall rate. Also, I have added stall rate as part of rule result along with allocation stall count, total stall duration and maximum stall duration.

Here are the fresh screenshots:

I will create a follow up enhancement to create a new rule for maximum stall duration metric.

6203: Add ZGC allocation stall rule

2d95955

openjdk bot added the rfr label Jul 9, 2025

jpbempel reviewed Jul 16, 2025

View reviewed changes

Updated the rule as per new metric allocation stall rate

e6c3ab8

openjdk bot added the merge-conflict Pull request has merge conflict with target branch label Jul 18, 2025

Resolving merge conflict

c180558

openjdk bot removed the merge-conflict Pull request has merge conflict with target branch label Jul 19, 2025

Suchitainf requested a review from jpbempel July 22, 2025 05:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

6203: Add ZGC allocation stall rule #664

6203: Add ZGC allocation stall rule #664

Uh oh!

Suchitainf commented Jul 9, 2025 •

edited by openjdk bot

Loading

Uh oh!

bridgekeeper bot commented Jul 9, 2025

Uh oh!

openjdk bot commented Jul 9, 2025

Uh oh!

mlbridge bot commented Jul 9, 2025 •

edited

Loading

Uh oh!

thegreystone commented Jul 14, 2025 •

edited

Loading

Uh oh!

jpbempel Jul 16, 2025

Uh oh!

thegreystone commented Jul 16, 2025

Uh oh!

openjdk bot commented Jul 18, 2025

Uh oh!

Suchitainf commented Jul 18, 2025

Uh oh!

Uh oh!

6203: Add ZGC allocation stall rule #664

Are you sure you want to change the base?

6203: Add ZGC allocation stall rule #664

Uh oh!

Conversation

Suchitainf commented Jul 9, 2025 • edited by openjdk bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Progress

Issue

Reviewing

Uh oh!

bridgekeeper bot commented Jul 9, 2025

Uh oh!

openjdk bot commented Jul 9, 2025

Uh oh!

mlbridge bot commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Webrevs

Uh oh!

thegreystone commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jpbempel Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

thegreystone commented Jul 16, 2025

Uh oh!

openjdk bot commented Jul 18, 2025

Uh oh!

Suchitainf commented Jul 18, 2025

Uh oh!

Uh oh!

Suchitainf commented Jul 9, 2025 •

edited by openjdk bot

Loading

mlbridge bot commented Jul 9, 2025 •

edited

Loading

thegreystone commented Jul 14, 2025 •

edited

Loading