Windows AArch64 MSVC /volatile:iso memory ordering — HotSpot C++ runtime#48
Windows AArch64 MSVC /volatile:iso memory ordering — HotSpot C++ runtime#48macarte wants to merge 1 commit intomacarte/baselinePR-TestTrampolineFixfrom
Conversation
MSVC's /volatile:iso (default on ARM64) makes volatile reads/writes plain LDR/STR with no acquire/release barriers. HotSpot's C++ runtime was written assuming volatile provides acquire/release semantics. Changes: 1. flags-cflags.m4: Add /volatile:ms to JVM_CFLAGS for Windows AArch64 to restore acquire/release semantics for volatile accesses. 2. orderAccess_windows_aarch64.hpp: Replace std::atomic_thread_fence() with __dmb() intrinsics for READ_MEM_BARRIER (dmb ishld), WRITE_MEM_BARRIER and FULL_MEM_BARRIER (dmb ish). The __dmb() intrinsic acts as both a hardware barrier and compiler barrier for volatile/non-atomic accesses, which std::atomic_thread_fence() does not guarantee under /volatile:iso. 3. atomicAccess_windows_aarch64.hpp: Override PlatformLoad/PlatformStore with __ldar/__stlr intrinsics (defense-in-depth for Atomic::load/ store). Add PlatformOrderedLoad/PlatformOrderedStore specializations using __ldar/__stlr to avoid redundant dmb in load_acquire/ release_store paths, matching the Linux AArch64 approach.
| # and GCC/Clang AArch64). Use /volatile:ms to restore those semantics and | ||
| # prevent memory ordering bugs in ObjectMonitor, ParkEvent, and other | ||
| # lock-free algorithms that use plain volatile fields. | ||
| $1_CFLAGS_CPU_JVM="-volatile:ms" |
There was a problem hiding this comment.
Is this change necessary given the atomics changes in atomicAccess_windows_aarch64.hpp?
There was a problem hiding this comment.
Is this change necessary given the atomics changes in atomicAccess_windows_aarch64.hpp?
I ask because https://learn.microsoft.com/en-us/cpp/cpp/volatile-cpp?view=msvc-170 makes it sound like the compiler expects /volatile:iso to be used along with explicit synchronization primitives where the code needs it.
There was a problem hiding this comment.
So while the atomics changes address many issues, there's still too many weak memory issues in the JVM cpp source itself that don't use atomics. So setting the flag addresses all compiled JVM code issues
There was a problem hiding this comment.
To be clearer, the issues in the JVM are only a problem when those code paths are in virtual thread execution or support virtual threads
|
results |
|
Note that not listed as an error in the GHA results is an intermittent failure in Starvation.java; this occurs more frequently on machines with > 64 processors |
| // The generic PlatformLoad and PlatformStore use plain volatile dereferences. | ||
| // With /volatile:ms (set in flags-cflags.m4 for AArch64), MSVC already compiles | ||
| // those to LDAR/STLR, so these overrides produce identical codegen. They are | ||
| // retained as defense-in-depth: they guarantee acquire/release semantics for |
There was a problem hiding this comment.
@mo-beck, was the use of -volatile:iso (by not specifying this flag) when porting to windows-aarch64 an intentional choice or just the natural outcome of not specifying this flag? I'm curious because I would think that in the long term, it would be better to have these in Platform overloads in place while keeping -volatile:iso to actually catch other locations where synchronization is required but wasn't implemented. Thoughts?
| #define FULL_MEM_BARRIER atomic_thread_fence(std::memory_order_seq_cst); | ||
| #define READ_MEM_BARRIER __dmb(_ARM64_BARRIER_ISHLD) | ||
| #define WRITE_MEM_BARRIER __dmb(_ARM64_BARRIER_ISH) | ||
| #define FULL_MEM_BARRIER __dmb(_ARM64_BARRIER_ISH) |
There was a problem hiding this comment.
Shouldn't this be the _ARM64_BARRIER_SY barrier type?
There was a problem hiding this comment.
Shouldn't this be the _ARM64_BARRIER_SY barrier type?
I'm specifically referring to the FULL_MEM_BARRIER based on explanation at https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics?view=msvc-170#BarrierRestrictions
There was a problem hiding this comment.
I'll check on this more (thanks for bringing up) ... an initial analysis by CoPilot:
The ISH (Inner Shareable) domain is correct and matches the rest of HotSpot's AArch64 backend. Here's the reasoning:
- ISH covers all CPUs sharing the same OS/hypervisor — i.e., all cores that could run Java threads. This is the standard domain
for thread-to-thread synchronization.
- SY extends to the full system including external devices (DMA engines, GPUs). It's only needed for device MMIO synchronization,
not for inter-thread memory ordering.
- Linux AArch64's orderAccess uses __atomic_thread_fence(__ATOMIC_ACQUIRE/RELEASE) which GCC/Clang compile to dmb ishld/dmb ish —
the same ISH domain.
- HotSpot's own Membar_mask_bits enum maps StoreLoad and AnyAny to ISH, LoadLoad/LoadStore to ISHLD. The only SY usage in the
entire AArch64 backend is isb() (instruction barrier).
Using _ARM64_BARRIER_SY would be functionally correct but unnecessarily strong, potentially causing performance degradation on
systems with outer-shareable observers.
There was a problem hiding this comment.
Saint, following yours and Monica's input I ran test on jdk25: microsoft/openjdk-jdk25u#27
Where I reverted this change and also the AtomicAccess::PlatformLoad, AtomicAccess::PlatformStore pairs (kept the AtomicAccess::PlatformOrderedLoad, AtomicAccess::PlatformOrderedStore pairs)
winarm64 GHA tests are passing; will make the same change here on tip
| #define FULL_MEM_BARRIER atomic_thread_fence(std::memory_order_seq_cst); | ||
| #define READ_MEM_BARRIER __dmb(_ARM64_BARRIER_ISHLD) | ||
| #define WRITE_MEM_BARRIER __dmb(_ARM64_BARRIER_ISH) | ||
| #define FULL_MEM_BARRIER __dmb(_ARM64_BARRIER_ISH) |
There was a problem hiding this comment.
revalidate that dmb is needed (its a pause vs sequencing) - this was targetting the same issue that volatile:ms ended up solving
MSVC's /volatile:iso (default on ARM64) makes volatile reads/writes plain LDR/STR with no acquire/release barriers. HotSpot's C++ runtime was written assuming volatile provides acquire/release semantics.
Changes:
flags-cflags.m4: Add /volatile:ms to JVM_CFLAGS for Windows AArch64 to restore acquire/release semantics for volatile accesses.
orderAccess_windows_aarch64.hpp: Replace std::atomic_thread_fence() with __dmb() intrinsics for READ_MEM_BARRIER (dmb ishld), WRITE_MEM_BARRIER and FULL_MEM_BARRIER (dmb ish). The __dmb() intrinsic acts as both a hardware barrier and compiler barrier for volatile/non-atomic accesses, which std::atomic_thread_fence() does not guarantee under /volatile:iso.
atomicAccess_windows_aarch64.hpp: Override PlatformLoad/PlatformStore with __ldar/__stlr intrinsics (defense-in-depth for Atomic::load/ store). Add PlatformOrderedLoad/PlatformOrderedStore specializations using __ldar/__stlr to avoid redundant dmb in load_acquire/ release_store paths, matching the Linux AArch64 approach.
Rationale: This is the foundational fix. MSVC's /volatile:iso (default on ARM64) means volatile reads/writes are plain LDR/STR with no acquire/release semantics, breaking HotSpot's assumption of volatile ≈ acquire/release throughout the C++ runtime. The /volatile:ms flag restores those semantics for all JVM code. The explicit LDAR/STLR intrinsics in atomic_windows_aarch64.hpp produce identical codegen to what /volatile:ms already generates for plain volatile dereferences, but are retained as defense-in-depth — they guarantee correct acquire/release semantics for Atomic::load()/Atomic::store() regardless of the compiler flag setting. The PlatformOrderedLoad/PlatformOrderedStore overrides additionally avoid a redundant dmb that the generic fallback would emit before the LDAR/STLR. The __dmb() barriers in orderAccess replace the previous std::atomic_thread_fence calls with the ARM-specific intrinsics.
Tests unblocked: Broad set of intermittent failures across ObjectMonitor, ParkEvent, lock-free algorithms in the HotSpot runtime.