Skip to content

Fix ARM64 Virtual Thread currentCarrierThread Intrinsic Bug#25

Open
macarte wants to merge 12 commits intomacarte/rebased-win-arrch64-fixesfrom
macarte/rebased-additional-fixes
Open

Fix ARM64 Virtual Thread currentCarrierThread Intrinsic Bug#25
macarte wants to merge 12 commits intomacarte/rebased-win-arrch64-fixesfrom
macarte/rebased-additional-fixes

Conversation

@macarte
Copy link
Contributor

@macarte macarte commented Feb 23, 2026

Windows AArch64 Fixes — PR Packaging Plan

The 11 commits (combined, with internal reverts resolved) touch 17 files and decompose into 6 logically distinct PRs.


PR 1: Windows AArch64 MSVC /volatile:iso memory ordering — HotSpot C++ runtime

Files:

  • make/autoconf/flags-cflags.m4 — add /volatile:ms to JVM_CFLAGS for AArch64
  • src/hotspot/os_cpu/windows_aarch64/atomic_windows_aarch64.hppPlatformLoad/PlatformStore/PlatformOrderedLoad/PlatformOrderedStore using LDAR/STLR intrinsics
  • src/hotspot/os_cpu/windows_aarch64/orderAccess_windows_aarch64.hpp — switch READ_MEM_BARRIER/WRITE_MEM_BARRIER/FULL_MEM_BARRIER from std::atomic_thread_fence to __dmb()

Rationale: This is the foundational fix. MSVC's /volatile:iso (default on ARM64) means volatile reads/writes are plain LDR/STR with no acquire/release semantics, breaking HotSpot's assumption of volatile ≈ acquire/release throughout the C++ runtime. The /volatile:ms flag restores those semantics for all JVM code. The explicit LDAR/STLR intrinsics in atomic_windows_aarch64.hpp produce identical codegen to what /volatile:ms already generates for plain volatile dereferences, but are retained as defense-in-depth — they guarantee correct acquire/release semantics for Atomic::load()/Atomic::store() regardless of the compiler flag setting. The PlatformOrderedLoad/PlatformOrderedStore overrides additionally avoid a redundant dmb that the generic fallback would emit before the LDAR/STLR. The __dmb() barriers in orderAccess replace the previous std::atomic_thread_fence calls with the ARM-specific intrinsics.

Tests unblocked: Broad set of intermittent failures across ObjectMonitor, ParkEvent, lock-free algorithms in the HotSpot runtime.


PR 2: ObjectMonitor Dekker protocol StoreLoad barriers in try_spin()

Files:

  • src/hotspot/share/runtime/objectMonitor.cpp — add OrderAccess::storeload() after set_successor(current) in both places in try_spin()

Rationale: On ARM64, the Dekker protocol between try_spin() (ST _succ → LD _owner) and exit() is broken without an explicit StoreLoad barrier. Even with PR 1's /volatile:ms (which gives acquire/release semantics), volatile store + volatile load to different addresses does not imply StoreLoad ordering on ARM64. This causes missed wakeups and thread starvation. This is a shared (not Windows-specific) fix that is correct on all ARM64 platforms.

Tests unblocked: Starvation test, various monitor contention hangs.

Note: Could be folded into PR 1 since it depends on the barrier infrastructure, but it's a clean logical unit on its own and touches shared code (not os_cpu).


PR 3: ARM64 currentCarrierThread intrinsic — MO_ACQUIRE for OopHandle loads + @DontInline on Continuation.yield()

Files:

  • src/hotspot/share/c1/c1_GraphBuilder.cpp — don't inline @ChangesCurrentThread methods into non-@ChangesCurrentThread callers
  • src/hotspot/share/c1/c1_LIRGenerator.cppIN_NATIVE | MO_ACQUIRE on OopHandle load in do_JavaThreadField
  • src/hotspot/share/gc/shared/c1/barrierSetC1.cpp — honor MO_ACQUIRE decorator with membar_acquire()
  • src/hotspot/share/opto/library_call.cppIN_NATIVE | MO_ACQUIRE on OopHandle load in C2 current_thread_helper
  • src/java.base/share/classes/jdk/internal/vm/Continuation.java@DontInline on Continuation.yield()

Rationale: On ARM64, after a virtual thread migrates between carriers, the OopHandle dereference for currentCarrierThread can be reordered with subsequent dependent loads (e.g., Thread.cont), observing stale data. The MO_ACQUIRE ensures ordering. The @DontInline on yield() and the @ChangesCurrentThread inlining guard prevent the compiler from inlining across carrier-change boundaries. This is a shared ARM64 fix (not Windows-specific).

Tests unblocked: Virtual thread scheduling failures, PingPong test.


PR 4: ForkJoinPool carrier starvation — tryCompensateForMonitor + CarrierThread.beginMonitorBlock()

Files:

  • src/hotspot/share/runtime/objectMonitor.cppcompensate_pinned_carrier() / end_compensate_pinned_carrier() + call sites in enter_with_contention_mark
  • src/java.base/share/classes/java/util/concurrent/ForkJoinPool.javatryCompensateForMonitor(), beginMonitorCompensatedBlock(), updated endCompensatedBlock(), FJP access registration
  • src/java.base/share/classes/jdk/internal/access/JavaUtilConcurrentFJPAccess.java — add beginMonitorCompensatedBlock to interface
  • src/java.base/share/classes/jdk/internal/misc/CarrierThread.javabeginMonitorBlock() + ForkJoinPools.beginMonitorCompensatedBlock()

Rationale: When a pinned virtual thread's carrier blocks on a contended monitor, the ForkJoinPool has no visibility that its worker is blocked. All carriers can end up pinned, preventing the VT holding the contested lock from ever getting a carrier → deadlock. This PR adds FJP compensation that activates an idle worker or creates a spare carrier. This is a correctness fix for virtual threads on all platforms.

Tests unblocked: Starvation test (fully), carrier deadlock scenarios.

Note: The introduced fences execute on every platform. However, the fences are correctness-critical on ARM64, functionally redundant on x86, and low-cost on the non-hot paths where they appear.

  • Without the fences, on ARM64 both sides can miss each other's writes: acquire doesn't see the state update, release doesn't see WAITING, and the thread is never unparked.

  • On x86, these are all redundant — TSO guarantees StoreLoad for stores and loads to different addresses. The fullFence() compiles to a lock addl but has negligible cost since release/releaseShared are not hot tight loops (they're called once per lock release).

Dependency: Depends on PR 2's includes (symbolTable.hpp, javaCalls.hpp) being in objectMonitor.cpp. Package them together or add the includes here.


PR 5: ARM64 Dekker-pattern StoreLoad fences in java.util.concurrent + VirtualThread

Files:

  • src/java.base/share/classes/java/lang/VirtualThread.javaU.fullFence() in afterYield (PARKING, BLOCKING, WAITING paths), afterDone, unpark, unblock, joinNanos
  • src/java.base/share/classes/java/util/concurrent/LinkedTransferQueue.javaVarHandle.fullFence() after CAS in xfer
  • src/java.base/share/classes/java/util/concurrent/SynchronousQueue.javaVarHandle.fullFence() after CAS in xferLifo
  • src/java.base/share/classes/java/util/concurrent/locks/AbstractQueuedSynchronizer.javaU.fullFence() in acquire (after setting WAITING), release, and releaseShared

Rationale: On ARM64, volatile write (STLR) + volatile read (LDAR) to different addresses does NOT provide StoreLoad ordering. All these sites have Dekker-like patterns where one side writes field A then reads field B, while the other writes B then reads A. Without an explicit StoreLoad fence, both sides can miss each other's stores. This is a shared ARM64 correctness fix that applies to all ARM64 JDKs (Linux too, though timing makes it less likely to manifest).

Tests unblocked: SynchronousQueue, LinkedTransferQueue, virtual thread parking/unparking races, JSR166TestCase failures.


PR 6: TestTrampoline Windows fix

Files:

  • test/hotspot/jtreg/compiler/c2/aarch64/TestTrampoline.java — warm up UTF-16 String.charAt path on Windows before trampoline test

Rationale: On Windows, the JNU encoding may cause String.charAt UTF-16 path to not get hot enough for C2 inlining, changing trampoline generation behavior. Standalone test fix.


Suggested Integration Order

Order PR Risk Scope
1 MSVC /volatile:iso fix (build + platform atomics) Low — Windows AArch64 only os_cpu, make
2 ObjectMonitor try_spin() StoreLoad Low — all ARM64 runtime
3 currentCarrierThread intrinsic acquire + @DontInline Medium — C1/C2/shared compiler, java.base
4 FJP carrier starvation compensation Medium — runtime + java.base runtime, j.u.c
5 Dekker fences in j.u.c + VirtualThread High reviewability concern — touches Doug Lea's code java.base
6 TestTrampoline Windows fix Trivial test

PR 1 should go first as the foundation. PRs 2 and 6 are independent and can go in parallel. PR 3 is independent of 2/4. PR 4 depends on 2 (for the includes). PR 5 is the most contentious (touching j.u.c. core) and benefits from having all other fixes in first to narrow down remaining test failures.

…dering

On ARM64's weakly-ordered memory model, pointer loads (LoadP/LoadN)
can be reordered with subsequent dependent loads by the hardware. This
causes failures in concurrent data structures used by virtual thread
scheduling (SynchronousQueue, LinkedTransferQueue) where a loaded
pointer is immediately dereferenced but the dependent field load sees
stale data.

Force MemNode::acquire at IR creation time in LoadNode::make() for all
LoadP/LoadN nodes on AARCH64 when barrier_data == 0. This must happen
at IR creation rather than instruction selection because C2's optimizer
can eliminate or reorder loads between those phases — a load removed by
LoadNode::Identity() or hoisted by Ideal() will never reach the matcher
to receive a barrier.

Loads with barrier_data != 0 (GC reference barriers) are excluded since
GC barrier patterns emit their own ordering sequences and have no
acquire variants in the .ad file.

Also reverts needs_acquiring_load() to the upstream form (is_acquire()
only), since the opcode-based check is now redundant — all pointer
loads already have acquire set on the IR node.
…atching

Lower the cost of loadP_acq and loadN_acq from (4*INSN_COST + VOLATILE_REF_COST)
to VOLATILE_REF_COST so the instruction matcher prefers these patterns over the
loadP_volatile/loadN_volatile + addP_reg_imm combination when the address has a
foldable constant offset.

The acq patterns use ldr+dmb with memory8/memory4 operands that support offset
addressing, while the volatile patterns use ldar with an indirect operand (no
offset). At the previous higher cost, the matcher would choose the cheaper
ldar variant plus a separate addP_reg_imm instruction rather than folding the
offset into a single ldr+dmb. This left unexpected addP_reg_imm nodes in the
compiled output, causing TestCombineAddPWithConstantOffsets to fail on AArch64
fastdebug builds.
MSVC's /volatile:iso means volatile reads/writes have no implicit
acquire/release on ARM64, unlike GCC/Clang. This causes memory ordering
violations in the interpreter, C1, and C2 JIT tiers.

Changes:

1. orderAccess_windows_aarch64.hpp: Use __dmb() intrinsics instead of
   std::atomic_thread_fence() for READ_MEM_BARRIER (dmb ishld),
   WRITE_MEM_BARRIER and FULL_MEM_BARRIER (dmb ish).

2. atomic_windows_aarch64.hpp: Implement PlatformOrderedLoad (__ldar)
   and PlatformOrderedStore (__stlr) for 1/2/4/8-byte widths, ensuring
   acquire/release semantics for Atomic::load_acquire/store_release.

3. barrierSetAssembler_aarch64.cpp: Add dmb ishld after interpreter
   oop reference loads to enforce load ordering.

4. barrierSetC1.cpp: Add membar_acquire() after C1 oop reference loads.

5. memnode.cpp: Add MemNode::acquire to C2 LoadP/LoadN for T_OBJECT
   and T_NARROWOOP when barrier_data == 0. T_ADDRESS is intentionally
   excluded to avoid breaking C2 TLAB inline allocation ordering.
Add /volatile:ms to JVM_CFLAGS for Windows AArch64 in build config
Override PlatformLoad/PlatformStore with LDAR/STLR intrinsics as defense-in-depth
Add StoreLoad barrier in ObjectMonitor::try_spin() Dekker protocol
Use acquire loads and opaque base update in ForkJoinPool work-stealing
@swesonga
Copy link
Member

@macarte, which docs or sources are you looking at that show that "GCC/Clang on AArch64 default to /volatile:ms"?

@macarte
Copy link
Contributor Author

macarte commented Feb 27, 2026

/volatile:ms

The default is /volatile:iso

reference: https://learn.microsoft.com/en-us/cpp/cpp/volatile-cpp?view=msvc-170

# and GCC/Clang AArch64). Use /volatile:ms to restore those semantics and
# prevent memory ordering bugs in ObjectMonitor, ParkEvent, and other
# lock-free algorithms that use plain volatile fields.
$1_CFLAGS_CPU_JVM="/volatile:ms"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using the same style for these flags as the rest of the make files -volatile:ms

@swesonga
Copy link
Member

/volatile:ms

The default is /volatile:iso

reference: https://learn.microsoft.com/en-us/cpp/cpp/volatile-cpp?view=msvc-170

sorry, I was asking about a reference for GCC/Clang, which the PR mentions

@swesonga
Copy link
Member

out of curiosity, what was the cause of the earlier failure in the debug build? was it a syntax issue in the .ad file?


// Don't inline a method that changes Thread.currentThread() except
// into another method that is annotated @ChangesCurrentThread.
if (callee->changes_current_thread()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we're fixing atomics handling on Windows AArch64, won't we need additional justification for why we're changing code that other platforms are using given that they haven't run into any of these concurrency issues?

Copy link
Contributor Author

@macarte macarte Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CoPilot] This should not be guarded with an architecture check. Here's why:

The problem this solves is a compiler correctness issue, not a hardware barrier issue. If C1 inlines a @ChangesCurrentThread method (like Continuation.yield() or carrier-thread switching code) into a caller that isn't annotated @ChangesCurrentThread, the compiler can cache the result of Thread.currentCarrierThread() across the inlined call — observing the old carrier thread after the virtual thread has migrated to a new one.

This is broken on all architectures, not just ARM64:

  • On ARM64, it manifests more easily because the stale OopHandle dereference can also be reordered (which the MO_ACQUIRE fix in c1_LIRGenerator/library_call addresses separately).

  • On x86, it could still manifest if C1 keeps the currentCarrierThread result in a register across the inlined yield — TSO doesn't help because there's no memory access to reorder, the value is simply never re-read.

The check is also very low cost — it only runs during C1 inlining decisions, not at runtime. And @ChangesCurrentThread is used on a tiny number of methods, so this almost never triggers.

Bottom line: This is a correct-by-construction inlining guard that prevents a real compiler bug on all platforms. No architecture guard needed. The existing code is right as-is.

// Don't inline a method that changes Thread.currentThread() except
// into another method that is annotated @ChangesCurrentThread.
if (callee->changes_current_thread()
&& !compilation()->method()->changes_current_thread()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this specific change, for example, wouldn't it be a bug for an inlined method that changes threads to not read the correct state given that all its registers are correct and the load/acquire semantics have been fixed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[CoPilot] Good question. The concern is specifically about compiler value caching, not memory ordering.

When C1 inlines a @ChangesCurrentThread method into a non-@ChangesCurrentThread caller, the compiler can CSE (common subexpression elimination) or hoist the currentCarrierThread intrinsic across the carrier-change boundary. The result: the caller reuses a register-held value from before the yield/mount/unmount and never re-reads it.

The MO_ACQUIRE fix addresses a different problem — it ensures that when you do load currentCarrierThread, subsequent dependent loads (e.g., Thread.cont) see consistent data. But it can't force a re-load that the compiler optimized away entirely.

Consider this concrete scenario after inlining:

// Caller (not @ChangesCurrentThread):
Thread carrier = Thread.currentCarrierThread();  // C1 intrinsic → LDAR, result in R0
// ... inlined @ChangesCurrentThread code ...
//   yield() freezes frame (saves R0=old_carrier)
//   VM mounts VT on new carrier
//   thaw restores frame (R0 still = old_carrier)
// ... back in caller ...
carrier.cont  // uses stale R0 — wrong carrier!

The MO_ACQUIRE on the LDAR is irrelevant here because there is no second LDAR — C1 reused R0. And the "correct registers" after thaw are correct in the sense that they're faithfully restored from freeze — but they contain the pre-yield values, which is exactly the stale data.

The inlining guard ensures @ChangesCurrentThread methods stay as real calls, which forces C1 to treat currentCarrierThread as potentially invalidated across the call boundary and re-read it.

So: MO_ACQUIRE = correct ordering on loads that happen. Inlining guard = ensuring loads actually happen. They're complementary.

q = p.next;
if (p.isData != haveData && haveData != (m != null) &&
p.cmpExItem(m, e) == m) {
// Full fence (StoreLoad) for Dekker with await() which
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VarHandle.fullFence() executes on all platforms, not just ARM64. On x86, it compiles to an MFENCE or lock addl $0, (%rsp) — which is redundant because x86 already provides StoreLoad ordering for its CAS instructions (they include a full barrier via the lock prefix). So x86 pays a small performance cost for a fence it doesn't need.

That said, the cost is low in practice:

  • This fence is only hit on the matching path (successful cmpExItem), which is the fast path that immediately unparks the waiter and exits. It's not in a tight spin loop.

  • On x86, processors often elide redundant fences when the store buffer is already drained (which it is after the lock cmpxchg from the CAS).

If we want to make this zero-cost on x86, we could guard it with an architecture check, but that would add complexity to Doug Lea's code for negligible gain. The comment already documents the ARM64-specific motivation clearly.

tl;dr: Functionally no-op on x86/x64 (CAS already has full-barrier semantics), but the instruction does get emitted and executed. The performance impact is negligible on the match path.

else if (p.cmpExItem(m, e) != m)
p = head; // missed; restart
else { // matched complementary node
// Full fence (StoreLoad) for Dekker with await() which
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VarHandle.fullFence() executes on all platforms, not just ARM64. On x86, it compiles to an MFENCE or lock addl $0, (%rsp) — which is redundant because x86 already provides StoreLoad ordering for its CAS instructions (they include a full barrier via the lock prefix). So x86 pays a small performance cost for a fence it doesn't need.

That said, the cost is low in practice:

This fence is only hit on the matching path (successful cmpExItem), which is the fast path that immediately unparks the waiter and exits. It's not in a tight spin loop.

On x86, processors often elide redundant fences when the store buffer is already drained (which it is after the lock cmpxchg from the CAS).

If we want to make this zero-cost on x86, we could guard it with an architecture check, but that would add complexity to Doug Lea's code for negligible gain. The comment already documents the ARM64-specific motivation clearly.

tl;dr: Functionally no-op on x86/x64 (CAS already has full-barrier semantics), but the instruction does get emitted and executed. The performance impact is negligible on the match path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants