Skip to content

Windows AArch64 MSVC /volatile:iso memory ordering — HotSpot C++ runtime#48

Open
macarte wants to merge 1 commit intomacarte/baselinePR-TestTrampolineFixfrom
macarte/PR1-winarm64
Open

Windows AArch64 MSVC /volatile:iso memory ordering — HotSpot C++ runtime#48
macarte wants to merge 1 commit intomacarte/baselinePR-TestTrampolineFixfrom
macarte/PR1-winarm64

Conversation

@macarte
Copy link

@macarte macarte commented Mar 6, 2026

MSVC's /volatile:iso (default on ARM64) makes volatile reads/writes plain LDR/STR with no acquire/release barriers. HotSpot's C++ runtime was written assuming volatile provides acquire/release semantics.

Changes:

  1. flags-cflags.m4: Add /volatile:ms to JVM_CFLAGS for Windows AArch64 to restore acquire/release semantics for volatile accesses.

  2. orderAccess_windows_aarch64.hpp: Replace std::atomic_thread_fence() with __dmb() intrinsics for READ_MEM_BARRIER (dmb ishld), WRITE_MEM_BARRIER and FULL_MEM_BARRIER (dmb ish). The __dmb() intrinsic acts as both a hardware barrier and compiler barrier for volatile/non-atomic accesses, which std::atomic_thread_fence() does not guarantee under /volatile:iso.

  3. atomicAccess_windows_aarch64.hpp: Override PlatformLoad/PlatformStore with __ldar/__stlr intrinsics (defense-in-depth for Atomic::load/ store). Add PlatformOrderedLoad/PlatformOrderedStore specializations using __ldar/__stlr to avoid redundant dmb in load_acquire/ release_store paths, matching the Linux AArch64 approach.

Rationale: This is the foundational fix. MSVC's /volatile:iso (default on ARM64) means volatile reads/writes are plain LDR/STR with no acquire/release semantics, breaking HotSpot's assumption of volatile ≈ acquire/release throughout the C++ runtime. The /volatile:ms flag restores those semantics for all JVM code. The explicit LDAR/STLR intrinsics in atomic_windows_aarch64.hpp produce identical codegen to what /volatile:ms already generates for plain volatile dereferences, but are retained as defense-in-depth — they guarantee correct acquire/release semantics for Atomic::load()/Atomic::store() regardless of the compiler flag setting. The PlatformOrderedLoad/PlatformOrderedStore overrides additionally avoid a redundant dmb that the generic fallback would emit before the LDAR/STLR. The __dmb() barriers in orderAccess replace the previous std::atomic_thread_fence calls with the ARM-specific intrinsics.

Tests unblocked: Broad set of intermittent failures across ObjectMonitor, ParkEvent, lock-free algorithms in the HotSpot runtime.

MSVC's /volatile:iso (default on ARM64) makes volatile reads/writes
plain LDR/STR with no acquire/release barriers. HotSpot's C++ runtime
was written assuming volatile provides acquire/release semantics.

Changes:

1. flags-cflags.m4: Add /volatile:ms to JVM_CFLAGS for Windows AArch64
   to restore acquire/release semantics for volatile accesses.

2. orderAccess_windows_aarch64.hpp: Replace std::atomic_thread_fence()
   with __dmb() intrinsics for READ_MEM_BARRIER (dmb ishld),
   WRITE_MEM_BARRIER and FULL_MEM_BARRIER (dmb ish). The __dmb()
   intrinsic acts as both a hardware barrier and compiler barrier for
   volatile/non-atomic accesses, which std::atomic_thread_fence() does
   not guarantee under /volatile:iso.

3. atomicAccess_windows_aarch64.hpp: Override PlatformLoad/PlatformStore
   with __ldar/__stlr intrinsics (defense-in-depth for Atomic::load/
   store). Add PlatformOrderedLoad/PlatformOrderedStore specializations
   using __ldar/__stlr to avoid redundant dmb in load_acquire/
   release_store paths, matching the Linux AArch64 approach.
# and GCC/Clang AArch64). Use /volatile:ms to restore those semantics and
# prevent memory ordering bugs in ObjectMonitor, ParkEvent, and other
# lock-free algorithms that use plain volatile fields.
$1_CFLAGS_CPU_JVM="-volatile:ms"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change necessary given the atomics changes in atomicAccess_windows_aarch64.hpp?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change necessary given the atomics changes in atomicAccess_windows_aarch64.hpp?

I ask because https://learn.microsoft.com/en-us/cpp/cpp/volatile-cpp?view=msvc-170 makes it sound like the compiler expects /volatile:iso to be used along with explicit synchronization primitives where the code needs it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So while the atomics changes address many issues, there's still too many weak memory issues in the JVM cpp source itself that don't use atomics. So setting the flag addresses all compiled JVM code issues

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clearer, the issues in the JVM are only a problem when those code paths are in virtual thread execution or support virtual threads

@macarte
Copy link
Author

macarte commented Mar 9, 2026

results

	
	  ┌──────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────┬────────────┐
	  │ Group            │ Test                                                                                            │ Error      │
	  ├──────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┼────────────┤
	  │ jdk/tier1 part 1 │ java/lang/Thread/virtual/stress/PingPong.java#ltq                                               │ Timed out  │
	  │                  │                                                                                                 │ 480s       │
	  ├──────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┼────────────┤
	  │ jdk/tier1 part 1 │ java/lang/Thread/virtual/stress/PingPong.java#sq                                                │ Timed out  │
	  │                  │                                                                                                 │ 480s       │
	  ├──────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┼────────────┤
	  │ jdk/tier1 part 1 │ java/lang/Thread/virtual/stress/Skynet.java#default                                             │ JVM timed  │
	  │                  │                                                                                                 │ out 6400s  │
	  ├──────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┼────────────┤
	  │ jdk/tier1 part 2 │ java/util/concurrent/BlockingQueue/SingleProducerMultipleConsumerLoops.java                     │ Timed out  │
	  │                  │                                                                                                 │ 480s       │
	  ├──────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────┼────────────┤
	  │ hs/tier1         │ serviceability/jvmti/stress/StackTrace/NotSuspended/GetStackTraceNotSuspendedStressTest.java    │ JVM timed  │
	  │ serviceability   │                                                                                                 │ out 480s   │
	  └──────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────┴────────────┘
	

@macarte
Copy link
Author

macarte commented Mar 9, 2026

Note that not listed as an error in the GHA results is an intermittent failure in Starvation.java; this occurs more frequently on machines with > 64 processors

// The generic PlatformLoad and PlatformStore use plain volatile dereferences.
// With /volatile:ms (set in flags-cflags.m4 for AArch64), MSVC already compiles
// those to LDAR/STLR, so these overrides produce identical codegen. They are
// retained as defense-in-depth: they guarantee acquire/release semantics for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mo-beck, was the use of -volatile:iso (by not specifying this flag) when porting to windows-aarch64 an intentional choice or just the natural outcome of not specifying this flag? I'm curious because I would think that in the long term, it would be better to have these in Platform overloads in place while keeping -volatile:iso to actually catch other locations where synchronization is required but wasn't implemented. Thoughts?

#define FULL_MEM_BARRIER atomic_thread_fence(std::memory_order_seq_cst);
#define READ_MEM_BARRIER __dmb(_ARM64_BARRIER_ISHLD)
#define WRITE_MEM_BARRIER __dmb(_ARM64_BARRIER_ISH)
#define FULL_MEM_BARRIER __dmb(_ARM64_BARRIER_ISH)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be the _ARM64_BARRIER_SY barrier type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be the _ARM64_BARRIER_SY barrier type?

I'm specifically referring to the FULL_MEM_BARRIER based on explanation at https://learn.microsoft.com/en-us/cpp/intrinsics/arm64-intrinsics?view=msvc-170#BarrierRestrictions

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll check on this more (thanks for bringing up) ... an initial analysis by CoPilot:

The ISH (Inner Shareable) domain is correct and matches the rest of HotSpot's AArch64 backend. Here's the reasoning:

   - ISH covers all CPUs sharing the same OS/hypervisor — i.e., all cores that could run Java threads. This is the standard domain
  for thread-to-thread synchronization.
   - SY extends to the full system including external devices (DMA engines, GPUs). It's only needed for device MMIO synchronization,
  not for inter-thread memory ordering.
   - Linux AArch64's orderAccess uses __atomic_thread_fence(__ATOMIC_ACQUIRE/RELEASE) which GCC/Clang compile to dmb ishld/dmb ish —
  the same ISH domain.
   - HotSpot's own Membar_mask_bits enum maps StoreLoad and AnyAny to ISH, LoadLoad/LoadStore to ISHLD. The only SY usage in the
  entire AArch64 backend is isb() (instruction barrier).

  Using _ARM64_BARRIER_SY would be functionally correct but unnecessarily strong, potentially causing performance degradation on
  systems with outer-shareable observers.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saint, following yours and Monica's input I ran test on jdk25: microsoft/openjdk-jdk25u#27

Where I reverted this change and also the AtomicAccess::PlatformLoad, AtomicAccess::PlatformStore pairs (kept the AtomicAccess::PlatformOrderedLoad, AtomicAccess::PlatformOrderedStore pairs)

winarm64 GHA tests are passing; will make the same change here on tip

#define FULL_MEM_BARRIER atomic_thread_fence(std::memory_order_seq_cst);
#define READ_MEM_BARRIER __dmb(_ARM64_BARRIER_ISHLD)
#define WRITE_MEM_BARRIER __dmb(_ARM64_BARRIER_ISH)
#define FULL_MEM_BARRIER __dmb(_ARM64_BARRIER_ISH)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

revalidate that dmb is needed (its a pause vs sequencing) - this was targetting the same issue that volatile:ms ended up solving

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants