Fix ARM64 Virtual Thread currentCarrierThread Intrinsic Bug by macarte · Pull Request #25 · microsoft/openjdk-jdk25u

macarte · 2026-02-23T19:28:59Z

Windows AArch64 Fixes — PR Packaging Plan

The 11 commits (combined, with internal reverts resolved) touch 17 files and decompose into 6 logically distinct PRs.

PR 1: Windows AArch64 MSVC `/volatile:iso` memory ordering — HotSpot C++ runtime

Files:

make/autoconf/flags-cflags.m4 — add /volatile:ms to JVM_CFLAGS for AArch64
src/hotspot/os_cpu/windows_aarch64/atomic_windows_aarch64.hpp — PlatformLoad/PlatformStore/PlatformOrderedLoad/PlatformOrderedStore using LDAR/STLR intrinsics
src/hotspot/os_cpu/windows_aarch64/orderAccess_windows_aarch64.hpp — switch READ_MEM_BARRIER/WRITE_MEM_BARRIER/FULL_MEM_BARRIER from std::atomic_thread_fence to __dmb()

Rationale: This is the foundational fix. MSVC's /volatile:iso (default on ARM64) means volatile reads/writes are plain LDR/STR with no acquire/release semantics, breaking HotSpot's assumption of volatile ≈ acquire/release throughout the C++ runtime. The /volatile:ms flag restores those semantics for all JVM code. The explicit LDAR/STLR intrinsics in atomic_windows_aarch64.hpp produce identical codegen to what /volatile:ms already generates for plain volatile dereferences, but are retained as defense-in-depth — they guarantee correct acquire/release semantics for Atomic::load()/Atomic::store() regardless of the compiler flag setting. The PlatformOrderedLoad/PlatformOrderedStore overrides additionally avoid a redundant dmb that the generic fallback would emit before the LDAR/STLR. The __dmb() barriers in orderAccess replace the previous std::atomic_thread_fence calls with the ARM-specific intrinsics.

Tests unblocked: Broad set of intermittent failures across ObjectMonitor, ParkEvent, lock-free algorithms in the HotSpot runtime.

PR 2: ObjectMonitor Dekker protocol StoreLoad barriers in `try_spin()`

Files:

src/hotspot/share/runtime/objectMonitor.cpp — add OrderAccess::storeload() after set_successor(current) in both places in try_spin()

Rationale: On ARM64, the Dekker protocol between try_spin() (ST _succ → LD _owner) and exit() is broken without an explicit StoreLoad barrier. Even with PR 1's /volatile:ms (which gives acquire/release semantics), volatile store + volatile load to different addresses does not imply StoreLoad ordering on ARM64. This causes missed wakeups and thread starvation. This is a shared (not Windows-specific) fix that is correct on all ARM64 platforms.

Tests unblocked: Starvation test, various monitor contention hangs.

Note: Could be folded into PR 1 since it depends on the barrier infrastructure, but it's a clean logical unit on its own and touches shared code (not os_cpu).

PR 3: ARM64 `currentCarrierThread` intrinsic — MO_ACQUIRE for OopHandle loads + `@DontInline` on `Continuation.yield()`

Files:

src/hotspot/share/c1/c1_GraphBuilder.cpp — don't inline @ChangesCurrentThread methods into non-@ChangesCurrentThread callers
src/hotspot/share/c1/c1_LIRGenerator.cpp — IN_NATIVE | MO_ACQUIRE on OopHandle load in do_JavaThreadField
src/hotspot/share/gc/shared/c1/barrierSetC1.cpp — honor MO_ACQUIRE decorator with membar_acquire()
src/hotspot/share/opto/library_call.cpp — IN_NATIVE | MO_ACQUIRE on OopHandle load in C2 current_thread_helper
src/java.base/share/classes/jdk/internal/vm/Continuation.java — @DontInline on Continuation.yield()

Rationale: On ARM64, after a virtual thread migrates between carriers, the OopHandle dereference for currentCarrierThread can be reordered with subsequent dependent loads (e.g., Thread.cont), observing stale data. The MO_ACQUIRE ensures ordering. The @DontInline on yield() and the @ChangesCurrentThread inlining guard prevent the compiler from inlining across carrier-change boundaries. This is a shared ARM64 fix (not Windows-specific).

Tests unblocked: Virtual thread scheduling failures, PingPong test.

PR 4: ForkJoinPool carrier starvation — `tryCompensateForMonitor` + `CarrierThread.beginMonitorBlock()`

Files:

src/hotspot/share/runtime/objectMonitor.cpp — compensate_pinned_carrier() / end_compensate_pinned_carrier() + call sites in enter_with_contention_mark
src/java.base/share/classes/java/util/concurrent/ForkJoinPool.java — tryCompensateForMonitor(), beginMonitorCompensatedBlock(), updated endCompensatedBlock(), FJP access registration
src/java.base/share/classes/jdk/internal/access/JavaUtilConcurrentFJPAccess.java — add beginMonitorCompensatedBlock to interface
src/java.base/share/classes/jdk/internal/misc/CarrierThread.java — beginMonitorBlock() + ForkJoinPools.beginMonitorCompensatedBlock()

Rationale: When a pinned virtual thread's carrier blocks on a contended monitor, the ForkJoinPool has no visibility that its worker is blocked. All carriers can end up pinned, preventing the VT holding the contested lock from ever getting a carrier → deadlock. This PR adds FJP compensation that activates an idle worker or creates a spare carrier. This is a correctness fix for virtual threads on all platforms.

Tests unblocked: Starvation test (fully), carrier deadlock scenarios.

Note: The introduced fences execute on every platform. However, the fences are correctness-critical on ARM64, functionally redundant on x86, and low-cost on the non-hot paths where they appear.

Without the fences, on ARM64 both sides can miss each other's writes: acquire doesn't see the state update, release doesn't see WAITING, and the thread is never unparked.
On x86, these are all redundant — TSO guarantees StoreLoad for stores and loads to different addresses. The fullFence() compiles to a lock addl but has negligible cost since release/releaseShared are not hot tight loops (they're called once per lock release).

Dependency: Depends on PR 2's includes (symbolTable.hpp, javaCalls.hpp) being in objectMonitor.cpp. Package them together or add the includes here.

PR 5: ARM64 Dekker-pattern StoreLoad fences in `java.util.concurrent` + `VirtualThread`

Files:

src/java.base/share/classes/java/lang/VirtualThread.java — U.fullFence() in afterYield (PARKING, BLOCKING, WAITING paths), afterDone, unpark, unblock, joinNanos
src/java.base/share/classes/java/util/concurrent/LinkedTransferQueue.java — VarHandle.fullFence() after CAS in xfer
src/java.base/share/classes/java/util/concurrent/SynchronousQueue.java — VarHandle.fullFence() after CAS in xferLifo
src/java.base/share/classes/java/util/concurrent/locks/AbstractQueuedSynchronizer.java — U.fullFence() in acquire (after setting WAITING), release, and releaseShared

Rationale: On ARM64, volatile write (STLR) + volatile read (LDAR) to different addresses does NOT provide StoreLoad ordering. All these sites have Dekker-like patterns where one side writes field A then reads field B, while the other writes B then reads A. Without an explicit StoreLoad fence, both sides can miss each other's stores. This is a shared ARM64 correctness fix that applies to all ARM64 JDKs (Linux too, though timing makes it less likely to manifest).

Tests unblocked: SynchronousQueue, LinkedTransferQueue, virtual thread parking/unparking races, JSR166TestCase failures.

PR 6: TestTrampoline Windows fix

Files:

test/hotspot/jtreg/compiler/c2/aarch64/TestTrampoline.java — warm up UTF-16 String.charAt path on Windows before trampoline test

Rationale: On Windows, the JNU encoding may cause String.charAt UTF-16 path to not get hot enough for C2 inlining, changing trampoline generation behavior. Standalone test fix.

Suggested Integration Order

Order	PR	Risk	Scope
1	MSVC `/volatile:iso` fix (build + platform atomics)	Low — Windows AArch64 only	`os_cpu`, `make`
2	ObjectMonitor `try_spin()` StoreLoad	Low — all ARM64	`runtime`
3	`currentCarrierThread` intrinsic acquire + `@DontInline`	Medium — C1/C2/shared	`compiler`, `java.base`
4	FJP carrier starvation compensation	Medium — runtime + `java.base`	`runtime`, `j.u.c`
5	Dekker fences in `j.u.c` + `VirtualThread`	High reviewability concern — touches Doug Lea's code	`java.base`
6	TestTrampoline Windows fix	Trivial	`test`

PR 1 should go first as the foundation. PRs 2 and 6 are independent and can go in parallel. PR 3 is independent of 2/4. PR 4 depends on 2 (for the includes). PR 5 is the most contentious (touching j.u.c. core) and benefits from having all other fixes in first to narrow down remaining test failures.

…dering On ARM64's weakly-ordered memory model, pointer loads (LoadP/LoadN) can be reordered with subsequent dependent loads by the hardware. This causes failures in concurrent data structures used by virtual thread scheduling (SynchronousQueue, LinkedTransferQueue) where a loaded pointer is immediately dereferenced but the dependent field load sees stale data. Force MemNode::acquire at IR creation time in LoadNode::make() for all LoadP/LoadN nodes on AARCH64 when barrier_data == 0. This must happen at IR creation rather than instruction selection because C2's optimizer can eliminate or reorder loads between those phases — a load removed by LoadNode::Identity() or hoisted by Ideal() will never reach the matcher to receive a barrier. Loads with barrier_data != 0 (GC reference barriers) are excluded since GC barrier patterns emit their own ordering sequences and have no acquire variants in the .ad file. Also reverts needs_acquiring_load() to the upstream form (is_acquire() only), since the opcode-based check is now redundant — all pointer loads already have acquire set on the IR node.

…atching Lower the cost of loadP_acq and loadN_acq from (4*INSN_COST + VOLATILE_REF_COST) to VOLATILE_REF_COST so the instruction matcher prefers these patterns over the loadP_volatile/loadN_volatile + addP_reg_imm combination when the address has a foldable constant offset. The acq patterns use ldr+dmb with memory8/memory4 operands that support offset addressing, while the volatile patterns use ldar with an indirect operand (no offset). At the previous higher cost, the matcher would choose the cheaper ldar variant plus a separate addP_reg_imm instruction rather than folding the offset into a single ldr+dmb. This left unexpected addP_reg_imm nodes in the compiled output, causing TestCombineAddPWithConstantOffsets to fail on AArch64 fastdebug builds.

MSVC's /volatile:iso means volatile reads/writes have no implicit acquire/release on ARM64, unlike GCC/Clang. This causes memory ordering violations in the interpreter, C1, and C2 JIT tiers. Changes: 1. orderAccess_windows_aarch64.hpp: Use __dmb() intrinsics instead of std::atomic_thread_fence() for READ_MEM_BARRIER (dmb ishld), WRITE_MEM_BARRIER and FULL_MEM_BARRIER (dmb ish). 2. atomic_windows_aarch64.hpp: Implement PlatformOrderedLoad (__ldar) and PlatformOrderedStore (__stlr) for 1/2/4/8-byte widths, ensuring acquire/release semantics for Atomic::load_acquire/store_release. 3. barrierSetAssembler_aarch64.cpp: Add dmb ishld after interpreter oop reference loads to enforce load ordering. 4. barrierSetC1.cpp: Add membar_acquire() after C1 oop reference loads. 5. memnode.cpp: Add MemNode::acquire to C2 LoadP/LoadN for T_OBJECT and T_NARROWOOP when barrier_data == 0. T_ADDRESS is intentionally excluded to avoid breaking C2 TLAB inline allocation ordering.

Add /volatile:ms to JVM_CFLAGS for Windows AArch64 in build config Override PlatformLoad/PlatformStore with LDAR/STLR intrinsics as defense-in-depth Add StoreLoad barrier in ObjectMonitor::try_spin() Dekker protocol Use acquire loads and opaque base update in ForkJoinPool work-stealing

swesonga · 2026-02-27T22:57:06Z

@macarte, which docs or sources are you looking at that show that "GCC/Clang on AArch64 default to /volatile:ms"?

macarte · 2026-02-27T23:00:35Z

/volatile:ms

The default is /volatile:iso

reference: https://learn.microsoft.com/en-us/cpp/cpp/volatile-cpp?view=msvc-170

swesonga · 2026-02-27T23:01:38Z

make/autoconf/flags-cflags.m4

+      # and GCC/Clang AArch64). Use /volatile:ms to restore those semantics and
+      # prevent memory ordering bugs in ObjectMonitor, ParkEvent, and other
+      # lock-free algorithms that use plain volatile fields.
+      $1_CFLAGS_CPU_JVM="/volatile:ms"


Consider using the same style for these flags as the rest of the make files -volatile:ms

swesonga · 2026-02-27T23:02:34Z

/volatile:ms

The default is /volatile:iso

reference: https://learn.microsoft.com/en-us/cpp/cpp/volatile-cpp?view=msvc-170

sorry, I was asking about a reference for GCC/Clang, which the PR mentions

swesonga · 2026-02-27T23:05:39Z

out of curiosity, what was the cause of the earlier failure in the debug build? was it a syntax issue in the .ad file?

swesonga · 2026-02-27T23:19:14Z

src/hotspot/share/c1/c1_GraphBuilder.cpp

+
+  // Don't inline a method that changes Thread.currentThread() except
+  // into another method that is annotated @ChangesCurrentThread.
+  if (callee->changes_current_thread()


Given that we're fixing atomics handling on Windows AArch64, won't we need additional justification for why we're changing code that other platforms are using given that they haven't run into any of these concurrency issues?

[CoPilot] This should not be guarded with an architecture check. Here's why:

The problem this solves is a compiler correctness issue, not a hardware barrier issue. If C1 inlines a @ChangesCurrentThread method (like Continuation.yield() or carrier-thread switching code) into a caller that isn't annotated @ChangesCurrentThread, the compiler can cache the result of Thread.currentCarrierThread() across the inlined call — observing the old carrier thread after the virtual thread has migrated to a new one.

This is broken on all architectures, not just ARM64:

On ARM64, it manifests more easily because the stale OopHandle dereference can also be reordered (which the MO_ACQUIRE fix in c1_LIRGenerator/library_call addresses separately).

On x86, it could still manifest if C1 keeps the currentCarrierThread result in a register across the inlined yield — TSO doesn't help because there's no memory access to reorder, the value is simply never re-read.

The check is also very low cost — it only runs during C1 inlining decisions, not at runtime. And @ChangesCurrentThread is used on a tiny number of methods, so this almost never triggers.

Bottom line: This is a correct-by-construction inlining guard that prevents a real compiler bug on all platforms. No architecture guard needed. The existing code is right as-is.

swesonga · 2026-02-27T23:19:30Z

src/hotspot/share/c1/c1_GraphBuilder.cpp

+  // Don't inline a method that changes Thread.currentThread() except
+  // into another method that is annotated @ChangesCurrentThread.
+  if (callee->changes_current_thread()
+      && !compilation()->method()->changes_current_thread()) {


For this specific change, for example, wouldn't it be a bug for an inlined method that changes threads to not read the correct state given that all its registers are correct and the load/acquire semantics have been fixed?

[CoPilot] Good question. The concern is specifically about compiler value caching, not memory ordering.

When C1 inlines a @ChangesCurrentThread method into a non-@ChangesCurrentThread caller, the compiler can CSE (common subexpression elimination) or hoist the currentCarrierThread intrinsic across the carrier-change boundary. The result: the caller reuses a register-held value from before the yield/mount/unmount and never re-reads it.

The MO_ACQUIRE fix addresses a different problem — it ensures that when you do load currentCarrierThread, subsequent dependent loads (e.g., Thread.cont) see consistent data. But it can't force a re-load that the compiler optimized away entirely.

Consider this concrete scenario after inlining:

// Caller (not @ChangesCurrentThread): Thread carrier = Thread.currentCarrierThread(); // C1 intrinsic → LDAR, result in R0 // ... inlined @ChangesCurrentThread code ... // yield() freezes frame (saves R0=old_carrier) // VM mounts VT on new carrier // thaw restores frame (R0 still = old_carrier) // ... back in caller ... carrier.cont // uses stale R0 — wrong carrier!

The MO_ACQUIRE on the LDAR is irrelevant here because there is no second LDAR — C1 reused R0. And the "correct registers" after thaw are correct in the sense that they're faithfully restored from freeze — but they contain the pre-yield values, which is exactly the stale data.

The inlining guard ensures @ChangesCurrentThread methods stay as real calls, which forces C1 to treat currentCarrierThread as potentially invalidated across the call boundary and re-read it.

So: MO_ACQUIRE = correct ordering on loads that happen. Inlining guard = ensuring loads actually happen. They're complementary.

macarte · 2026-03-06T00:58:57Z

src/java.base/share/classes/java/util/concurrent/LinkedTransferQueue.java

                q = p.next;
                if (p.isData != haveData && haveData != (m != null) &&
                    p.cmpExItem(m, e) == m) {
+                    // Full fence (StoreLoad) for Dekker with await() which


VarHandle.fullFence() executes on all platforms, not just ARM64. On x86, it compiles to an MFENCE or lock addl $0, (%rsp) — which is redundant because x86 already provides StoreLoad ordering for its CAS instructions (they include a full barrier via the lock prefix). So x86 pays a small performance cost for a fence it doesn't need.

That said, the cost is low in practice:

This fence is only hit on the matching path (successful cmpExItem), which is the fast path that immediately unparks the waiter and exits. It's not in a tight spin loop.

On x86, processors often elide redundant fences when the store buffer is already drained (which it is after the lock cmpxchg from the CAS).

If we want to make this zero-cost on x86, we could guard it with an architecture check, but that would add complexity to Doug Lea's code for negligible gain. The comment already documents the ARM64-specific motivation clearly.

tl;dr: Functionally no-op on x86/x64 (CAS already has full-barrier semantics), but the instruction does get emitted and executed. The performance impact is negligible on the match path.

macarte · 2026-03-06T00:59:34Z

src/java.base/share/classes/java/util/concurrent/SynchronousQueue.java

                    else if (p.cmpExItem(m, e) != m)
                        p = head;                  // missed; restart
                    else {                         // matched complementary node
+                        // Full fence (StoreLoad) for Dekker with await() which


VarHandle.fullFence() executes on all platforms, not just ARM64. On x86, it compiles to an MFENCE or lock addl $0, (%rsp) — which is redundant because x86 already provides StoreLoad ordering for its CAS instructions (they include a full barrier via the lock prefix). So x86 pays a small performance cost for a fence it doesn't need.

That said, the cost is low in practice:

This fence is only hit on the matching path (successful cmpExItem), which is the fast path that immediately unparks the waiter and exits. It's not in a tight spin loop.

On x86, processors often elide redundant fences when the store buffer is already drained (which it is after the lock cmpxchg from the CAS).

If we want to make this zero-cost on x86, we could guard it with an architecture check, but that would add complexity to Doug Lea's code for negligible gain. The comment already documents the ARM64-specific motivation clearly.

tl;dr: Functionally no-op on x86/x64 (CAS already has full-barrier semantics), but the instruction does get emitted and executed. The performance impact is negligible on the match path.

macarte added 6 commits February 23, 2026 11:28

Fix ARM64 Virtual Thread currentCarrierThread Intrinsic Bug

d54a0cf

Force MemNode::acquire on all ARM64 pointer loads

b8d5fca

swesonga reviewed Feb 27, 2026

View reviewed changes

macarte added 5 commits March 4, 2026 12:10

snapshot of potential fixes

01c7ef5

Starvation test passes

368d041

Fix for PingPong test

4acd3a3

Saints fix for TestTrampoline when testing on windows

dba3bac

Fixes for issues introduced that impacted JSR166TestCase

80ff85a

macarte commented Mar 6, 2026

View reviewed changes

Update comments to reflect current code state

2baed3c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ARM64 Virtual Thread currentCarrierThread Intrinsic Bug#25

Fix ARM64 Virtual Thread currentCarrierThread Intrinsic Bug#25
macarte wants to merge 12 commits intomacarte/rebased-win-arrch64-fixesfrom
macarte/rebased-additional-fixes

macarte commented Feb 23, 2026 •

edited

Loading

Uh oh!

swesonga commented Feb 27, 2026

Uh oh!

macarte commented Feb 27, 2026

Uh oh!

swesonga Feb 27, 2026

Uh oh!

swesonga commented Feb 27, 2026

Uh oh!

swesonga commented Feb 27, 2026

Uh oh!

swesonga Feb 27, 2026

Uh oh!

macarte Mar 6, 2026 •

edited

Loading

Uh oh!

swesonga Feb 27, 2026

Uh oh!

macarte Mar 6, 2026

Uh oh!

macarte Mar 6, 2026

Uh oh!

macarte Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

macarte commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Windows AArch64 Fixes — PR Packaging Plan

PR 1: Windows AArch64 MSVC /volatile:iso memory ordering — HotSpot C++ runtime

PR 2: ObjectMonitor Dekker protocol StoreLoad barriers in try_spin()

PR 3: ARM64 currentCarrierThread intrinsic — MO_ACQUIRE for OopHandle loads + @DontInline on Continuation.yield()

PR 4: ForkJoinPool carrier starvation — tryCompensateForMonitor + CarrierThread.beginMonitorBlock()

PR 5: ARM64 Dekker-pattern StoreLoad fences in java.util.concurrent + VirtualThread

PR 6: TestTrampoline Windows fix

Suggested Integration Order

Uh oh!

swesonga commented Feb 27, 2026

Uh oh!

macarte commented Feb 27, 2026

Uh oh!

swesonga Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

swesonga commented Feb 27, 2026

Uh oh!

swesonga commented Feb 27, 2026

Uh oh!

swesonga Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

macarte Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swesonga Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

macarte Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

macarte Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

macarte Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

macarte commented Feb 23, 2026 •

edited

Loading

PR 1: Windows AArch64 MSVC `/volatile:iso` memory ordering — HotSpot C++ runtime

PR 2: ObjectMonitor Dekker protocol StoreLoad barriers in `try_spin()`

PR 3: ARM64 `currentCarrierThread` intrinsic — MO_ACQUIRE for OopHandle loads + `@DontInline` on `Continuation.yield()`

PR 4: ForkJoinPool carrier starvation — `tryCompensateForMonitor` + `CarrierThread.beginMonitorBlock()`

PR 5: ARM64 Dekker-pattern StoreLoad fences in `java.util.concurrent` + `VirtualThread`

macarte Mar 6, 2026 •

edited

Loading