Skip to content

Commit 638c43d

Browse files
committed
Sharing is scaring article
1 parent 4b25db8 commit 638c43d

File tree

6 files changed

+246
-0
lines changed

6 files changed

+246
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
package redhat.app.services.benchmark;
2+
3+
import java.util.concurrent.TimeUnit;
4+
5+
import org.openjdk.jmh.annotations.Benchmark;
6+
import org.openjdk.jmh.annotations.BenchmarkMode;
7+
import org.openjdk.jmh.annotations.CompilerControl;
8+
import org.openjdk.jmh.annotations.Fork;
9+
import org.openjdk.jmh.annotations.Measurement;
10+
import org.openjdk.jmh.annotations.Mode;
11+
import org.openjdk.jmh.annotations.OutputTimeUnit;
12+
import org.openjdk.jmh.annotations.Scope;
13+
import org.openjdk.jmh.annotations.State;
14+
import org.openjdk.jmh.annotations.Warmup;
15+
16+
/**
17+
* This benchmark should be used with the following JVM options to tune the tier compilation level:
18+
* -XX:TieredStopAtLevel=
19+
*
20+
*/
21+
@State(Scope.Benchmark)
22+
@Fork(2)
23+
@BenchmarkMode(Mode.AverageTime)
24+
@OutputTimeUnit(TimeUnit.NANOSECONDS)
25+
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
26+
@Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
27+
public class MethodDataSharing {
28+
29+
@Benchmark
30+
public void doFoo() {
31+
foo(1000, true);
32+
}
33+
34+
35+
@CompilerControl(CompilerControl.Mode.DONT_INLINE)
36+
private static int foo(int count, boolean countAll) {
37+
int total = 0;
38+
for (int i = 0; i < count; i++) {
39+
if (countAll) {
40+
total++;
41+
}
42+
}
43+
return total;
44+
}
45+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
---
2+
title: "Sharing is (S)Caring: how tiered compilation can affect the scalability of Java applications"
3+
date: 2024-12-20T00:00:00Z
4+
categories: ['performance', 'benchmarking', 'methodology']
5+
summary: 'Learn about how Tiered Compilation and its effect on the scalability of Java applications'
6+
image: 'sharing_is_scaring.png'
7+
related: ['']
8+
authors:
9+
- Francesco Nigro
10+
---
11+
# JVM Challenges in Containers
12+
13+
The landscape of software deployment has been transformed by containers, which have become the de facto standard for modern applications. Containers offer lightweight, portable, and consistent environments, and with the rise of orchestration platforms like Kubernetes, developers and operators can efficiently deploy, scale, and manage applications across diverse infrastructures.
14+
15+
However, this new containerized world is not without challenges -particularly for applications with nuanced runtime requirements, such as those built on the Java Virtual Machine (JVM). The JVM, a cornerstone of enterprise software, was originally designed in an era when it could assume unrestricted access to the underlying system's resources. Containers, in contrast, abstract these resources and often impose quotas on CPU, memory, and other critical parameters.
16+
17+
While the JVM has evolved to become more container-aware — incorporating features like container-cgroup resource detection — there remains a gap in ensuring that all of its components can function optimally within the constraints of a containerized environment.
18+
19+
This is especially true for the Just-In-Time (JIT) compilers, C1 and C2, which are essential for delivering peak application performance. These components are particularly sensitive to suboptimal resource allocation or misconfiguration, and their efficiency can be severely hampered in containers that are not tuned for JVM workloads.
20+
21+
Unfortunately, while containers excel at abstracting infrastructure and simplifying deployment, they do not natively address the specific needs of the JVM — in short, they are not JVM-aware.
22+
This leaves developers and operators responsible for bridging the gap. Despite the promise of containers to abstract away low-level runtime details, achieving a healthy and efficient JVM runtime still requires an understanding of the underlying system and careful configuration.
23+
24+
In this series of articles we’ll present what happens to a Java application running with shortage of resources and the impact to its performance.
25+
Furthermore, since our Application Service Performance Team is known https://redhatperf.github.io/post/type-check-scalability-issue/[to find stealthy scalability issues], in this part we’ll reveal another scary one.
26+
27+
# Tiered Compilation Basics
28+
29+
First, let’s recall a key mechanism employed by OpenJDK Hotspot to optimize the application’s code: Tiered Compilation.
30+
31+
Tiered compilation in the HotSpot JVM balances application startup speed and runtime performance by using https://developers.redhat.com/articles/2021/06/23/how-jit-compiler-boosts-java-performance-openjdk[multiple levels] of code execution and optimization.
32+
Initially, it uses an interpreter for immediate execution. As methods are invoked frequently, it employs a fast compiler i.e. C1 to generate native code.
33+
Over time, methods that are heavily used ("hot spots") are further optimized with the optimizing compiler i.e. C2, which applies advanced optimizations for maximum performance.
34+
35+
This tiered approach ensures quick application responsiveness while progressively optimizing performance-critical code paths. The name "HotSpot" reflects this focus on dynamically identifying and optimizing hot spots in code execution for efficiency​.
36+
37+
What’s less known about tiered compilation is that the C2 compiler can be very CPU intensive and, when it doesn’t have enough resources, its activity https://jpbempel.github.io/2020/05/22/startup-containers-tieredcompilation.html[affects startup time].
38+
This has led to different initiatives and efforts, like https://openjdk.org/projects/leyden/[Project Leyden], to help Java applications, especially ones which perform a lot of repetitive work at startup - to benefit from saving CPU resources spent into compilation.
39+
40+
Not only, since the C2’s work affects the time to reach peak performance, what happens to the application runtime performance if C2 hasn’t completed its job?
41+
42+
## Full-Profile Compilation under microscope
43+
44+
To answer the previous question, we need to understand what happens while moving from a C1 compiled code to issuing a request to compile at C2 level.
45+
46+
The so-called C1 full-profile level (i.e. Tier 3) compiles a Java method into native code adding telemetry which captures different aspects useful to perform a more effective C2 (Tier 4) compilation .e.g.
47+
48+
- Number of invocation of a method
49+
- Number of iteration of loops
50+
- Branch taken/not-taken and occurrence
51+
- Type profiling to perform dynamic calls
52+
- … and many others!
53+
54+
Such telemetry is implemented in the form of a https://github.com/openjdk/jdk/blob/jdk-24%2B26/src/hotspot/share/oops/methodData.hpp#L119-L132[MethodData], each one containing different counters for the same Java method; some are used to trigger further compilations or drive different optimization decisions.
55+
By reading OpenJDK https://github.com/openjdk/jdk/blob/jdk-24%2B26/src/hotspot/share/oops/methodData.hpp#L119-L132[MethodData’s documentation], something interesting popups:
56+
57+
```
58+
// All data in the profile is approximate. It is expected to be accurate
59+
// on the whole, but the system expects occasional inaccuraces, due to
60+
// counter overflow, multiprocessor races during data collection
61+
```
62+
63+
Which implies that these counters are shared and collected concurrently by any Java application threads using a Tier 3 compiled method, and, as we’ve already shown in the https://redhatperf.github.io/post/type-check-scalability-issue/[type pollution article], this could affect the scalability performance of an application while hit in the hot path.
64+
65+
Let’s see what are the performance impacts and implications if that happen.
66+
67+
# Sharing is (S)Caring
68+
69+
As mentioned in the previous part, we expect some scalability bottleneck due to MethodData counters updates.
70+
In order to show it, we use link:MethodDataSharing.java[this micro-benchmark] using https://github.com/openjdk/jmh[JMH]:
71+
72+
image::method_data_sharing.png[MethodDataSharing.java]
73+
74+
75+
In the next benchmarks runs we will control the maximum level of compilation available to the whole application (including the JMH infrastructure) via `-XX:TieredStopAtLevel=3` - but since we have focused the counters update in the *foo* method by adding a tight loop, the JMH infrastructure cost while calling it shouldn’t be as relevant as the method benchmarked.
76+
77+
Running it with a single thread:
78+
79+
```
80+
Benchmark Mode Cnt Score Error Units
81+
MethodDataSharing.doFoo avgt 20 1374.518 ± 0.676 ns/op
82+
```
83+
84+
While, with 2 threads, execution time slowed down by a relevant factor - if there won’t be any sharing, it should stay the same:
85+
```
86+
Benchmark Mode Cnt Score Error Units
87+
MethodDataSharing.doFoo avgt 20 19115.045 ± 736.856 ns/op
88+
```
89+
90+
Inspecting the assembly output (see https://psy-lob-saw.blogspot.com/2015/07/jmh-perfasm.html[JMH perfasm explained] and https://perfwiki.github.io/main/[perf: Linux profiling with performance counters]), it pop-ups a large amount of cycles (leftmost column) spent in this region:
91+
92+
```asm
93+
....[Hottest Region 1]..............................................................................
94+
c1, level 3, redhat.app.services.benchmark.MethodDataSharing::foo, version 2, compile id 719
95+
96+
0x00007fb68cf14168: mov $0x0,%edi
97+
╭ 0x00007fb68cf1416d: jmp 0x00007fb68cf141f7 ;*iload_3 {reexecute=0 rethrow=0 return_oop=0}
98+
│ ; - redhat.app.services.benchmark.MethodDataSharing::foo@4 (line 39)
99+
│ 0x00007fb68cf14172: nopw 0x0(%rax,%rax,1)
100+
0.03% │ 0x00007fb68cf14178: cmp $0x0,%edx
101+
│ 0x00007fb68cf1417b: movabs $0x7fb6104de900,%rbx ; {metadata(method data for {method} {0x00007fb6104783c8} &apos;foo&apos; &apos;(IZ)I&apos; in &apos;redhat/app/services/benchmark/MethodDataSharing&apos;)}
102+
│ 0x00007fb68cf14185: movabs $0x158,%rcx
103+
│╭ 0x00007fb68cf1418f: je 0x00007fb68cf1419f
104+
││ 0x00007fb68cf14195: movabs $0x168,%rcx
105+
│↘ 0x00007fb68cf1419f: mov (%rbx,%rcx,1),%r8
106+
9.41% │ 0x00007fb68cf141a3: lea 0x1(%r8),%r8
107+
│ 0x00007fb68cf141a7: mov %r8,(%rbx,%rcx,1)
108+
0.11% │ ╭ 0x00007fb68cf141ab: je 0x00007fb68cf141b3 ;*ifeq {reexecute=0 rethrow=0 return_oop=0}
109+
│ │ ; - redhat.app.services.benchmark.MethodDataSharing::foo@10 (line 40)
110+
│ │ 0x00007fb68cf141b1: inc %edi
111+
│ ↘ 0x00007fb68cf141b3: inc %eax
112+
│ 0x00007fb68cf141b5: movabs $0x7fb6104de900,%rbx ; {metadata(method data for {method} {0x00007fb6104783c8} &apos;foo&apos; &apos;(IZ)I&apos; in &apos;redhat/app/services/benchmark/MethodDataSharing&apos;)}
113+
│ 0x00007fb68cf141bf: mov 0xf8(%rbx),%ecx
114+
17.60% │ 0x00007fb68cf141c5: add $0x2,%ecx
115+
0.01% │ 0x00007fb68cf141c8: mov %ecx,0xf8(%rbx)
116+
0.02% │ 0x00007fb68cf141ce: and $0x3ffe,%ecx
117+
0.05% │ 0x00007fb68cf141d4: cmp $0x0,%ecx
118+
│ 0x00007fb68cf141d7: je 0x00007fb68cf14266 ;*goto {reexecute=0 rethrow=0 return_oop=0}
119+
│ ; - redhat.app.services.benchmark.MethodDataSharing::foo@19 (line 39)
120+
│ 0x00007fb68cf141dd: mov 0x458(%r15),%r10 ; ImmutableOopMap {}
121+
│ ;*goto {reexecute=1 rethrow=0 return_oop=0}
122+
│ ; - (reexecute) redhat.app.services.benchmark.MethodDataSharing::foo@19 (line 39)
123+
0.03% │ 0x00007fb68cf141e4: test %eax,(%r10) ; {poll}
124+
│ 0x00007fb68cf141e7: movabs $0x7fb6104de900,%rbx ; {metadata(method data for {method} {0x00007fb6104783c8} &apos;foo&apos; &apos;(IZ)I&apos; in &apos;redhat/app/services/benchmark/MethodDataSharing&apos;)}
125+
│ 0x00007fb68cf141f1: incl 0x178(%rbx) ;*goto {reexecute=0 rethrow=0 return_oop=0}
126+
│ ; - redhat.app.services.benchmark.MethodDataSharing::foo@19 (line 39)
127+
40.64% ↘ 0x00007fb68cf141f7: cmp %esi,%eax
128+
0x00007fb68cf141f9: movabs $0x7fb6104de900,%rbx ; {metadata(method data for {method} {0x00007fb6104783c8} &apos;foo&apos; &apos;(IZ)I&apos; in &apos;redhat/app/services/benchmark/MethodDataSharing&apos;)}
129+
0x00007fb68cf14203: movabs $0x148,%rcx
130+
╭ 0x00007fb68cf1420d: jl 0x00007fb68cf1421d
131+
│ 0x00007fb68cf14213: movabs $0x138,%rcx
132+
0.01% ↘ 0x00007fb68cf1421d: mov (%rbx,%rcx,1),%r8
133+
30.63% 0x00007fb68cf14221: lea 0x1(%r8),%r8
134+
0.03% 0x00007fb68cf14225: mov %r8,(%rbx,%rcx,1)
135+
0.04% 0x00007fb68cf14229: jl 0x00007fb68cf14178 ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
136+
; - redhat.app.services.benchmark.MethodDataSharing::foo@6 (line 39)
137+
....................................................................................................
138+
67.86% <total for region 1>
139+
```
140+
141+
Which reports 4 `metadata(method data for {method} {0x00007fb6104783c8}` comments, referring to the MethodData starting at `0x7fb6104de900.`
142+
These comments are placed near the updates of 6 different fields (in hex):
143+
144+
* `0x158` or `0x168` based on some condition
145+
* `0xf8`
146+
* `0x178`
147+
* `0x148` or `0x138` based on some condition
148+
149+
According to these field offsets, the relative distance from each counter in the MethodData object can be less than 64 bytes (e.g. `0x178 - 0x158 = 32 bytes`), which can trigger https://en.wikipedia.org/wiki/False_sharing[False sharing] among counters sharing the same cache-line(s), further impacting scalability.
150+
Sadly, this bad effect is not reproducible since fields can fall (or not) in different cache-lines without much control on our side (i.e. Hotspot is responsible to allocate MethodData and layout its fields)
151+
152+
It’s still possible to further analyse the number of shared cache lines using https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/8/html/monitoring_and_managing_system_status_and_performance/detecting-false-sharing_monitoring-and-managing-system-status-and-performance#the-purpose-of-perf-c2c_detecting-false-sharing[perfc2c] although, for this benchmark, it won’t be possible to distinguish when false sharing occurs, because both true and false sharing produce the same effect (i.e. cache lines sharing).
153+
154+
> **NOTE:**
155+
> As a side note for the curious readers, on Intel, false-sharing can occur at 128 bytes granularity due to the Spatial Prefetcher (discovered in this https://github.com/jbossas/jboss-threads/pull/191[jboss-threads issue])
156+
>
157+
> image:intel_opt_guide.png[Intel Optimization Guide]
158+
>
159+
> Which means that according to the distance of some of these counters reported in this benchmark (32 bytes) - it will likely happen.
160+
161+
# We live in a NUMA world
162+
163+
Right now we’ve seen that having a concurrent Java application running Tier 3 compiled code can hit a severe scalability problem, but this is still optimistic if compared to what would happen in the “real world”.
164+
Nowadays modern CPUs can have multiple https://en.wikipedia.org/wiki/Non-uniform_memory_access[NUMA nodes]: for example the benchmarks in this article are run on a machine with 2 numa nodes (each with 8 physical cores with SMT):
165+
166+
image::numa.png[NUMA lstopo output]
167+
168+
We can now verify that binding the application to run in 2 physical cores in the same NUMA node would produce a different effect than running over the same.
169+
170+
Running on the same node (in node 0):
171+
172+
```
173+
numactl --physcpubind 0,1 java -jar target/benchmark.jar MethodDataSharing -t 2 --jvmArgs="-XX:TieredStopAtLevel=3"
174+
175+
Benchmark Mode Cnt Score Error Units
176+
MethodDataSharing.doFoo avgt 20 8662.030 ± 731.919 ns/op
177+
```
178+
179+
or in 2 different nodes:
180+
181+
```
182+
$ numactl --physcpubind 0,8 java -jar target/benchmark.jar MethodDataSharing -t 2 --jvmArgs="-XX:TieredStopAtLevel=3"
183+
184+
Benchmark Mode Cnt Score Error Units
185+
MethodDataSharing.doFoo avgt 20 16427.929 ± 1475.128 ns/op
186+
```
187+
188+
Shows a much worse execution time, due to the different way to handle the cache coherency traffic and/or due to the increased distance between cores.
189+
190+
More information about the different cost of communication between cores in different architectures can be found in https://github.com/nviennot/core-to-core-latency[this GitHub repository]
191+
192+
# …and what about containers?
193+
194+
What we’ve shown so far is pretty scary:
195+
196+
* Running code at Tier 3 due to slowed-down/blocked C2 compiler thread(s) hit a severe scalability problem - with just 2 cores!
197+
* If that happens, false sharing and NUMA can make it much worse
198+
199+
So, what about containers?
200+
201+
In containers, users usually set the cpu quotas (e.g. in the form of request/limit, in https://docs.openshift.com/container-platform/3.11/dev_guide/compute_resources.html[Openshift]) without binding a container to run on a specific NUMA node, which can expose their application to the worse possible version of this issue: in the next part we will focus on this use case with a more realistic example.
Loading
Loading
92.6 KB
Loading
Loading

0 commit comments

Comments
 (0)