Skip to content

[LLVM][AMDGPU] Track external-runtime AddressSanitizer support for GPU device code #88

Description

@benvanik

This tracks the LLVM/clang-side work needed for AMDGPU device code to use
compiler-emitted AddressSanitizer checks with an external device runtime.

The external-runtime model is:

  • clang/LLVM inserts ASAN memory access checks into selected GPU device code
  • generated code calls the stable __asan_* hook ABI
  • a caller-provided device bitcode library defines those hooks
  • the launcher/runtime owns shadow mapping, poison/unpoison policy, and report delivery
  • the mode does not depend on the stock ROCm ASAN hostcall reporting path
  • the mode does not require XNACK because it is not relying on replayable GPU faults

This is a general infrastructure request for GPU execution environments with
structured allocation ownership. It should not change stock ROCm ASAN behavior.

Reproducers

Standalone reproducers are in this public gist:

https://gist.github.com/benvanik/b01df0c0914e6a5849bc4256b21e69b5

Zip download:

https://gist.github.com/benvanik/b01df0c0914e6a5849bc4256b21e69b5/archive/HEAD.zip

The gist contains:

  • sample_checked_kernel.hip: a tiny HIP kernel with three byte loads and one
    32-bit store.
  • runtime_bitcode_no_sanitize.c: a tiny external runtime bitcode source with
    two support globals and a few ASAN hook definitions.
  • reproduce.sh: commands that demonstrate the current compiler behavior and
    the desired local-only no-sanitize bitcode shape.
  • README.md: expected outputs and setup notes.

The reproducer uses -nogpulib where possible so the interesting behavior is
compiler policy, LLVM IR composition, and code object metadata rather than a
particular ROCm package layout.

Known Blockers

1. Non-XNACK AMDGPU targets cannot request explicit ASAN checks with an external runtime

For AMDGPU, clang currently treats device AddressSanitizer support as equivalent
to the stock ROCm ASAN runtime model. On a non-XNACK target such as gfx1100,
the driver ignores -fsanitize=address for device code:

clang++ \
  --offload-arch=gfx1100 \
  --cuda-device-only \
  -x hip sample_checked_kernel.hip \
  --hip-path="$ROCM_PATH" \
  --rocm-path="$ROCM_PATH" \
  -nogpulib \
  -std=c++17 \
  -fsanitize=address \
  -fsanitize-stable-abi \
  -O1 \
  -S -emit-llvm \
  -o sample_checked_kernel.stock.ll

Observed warning:

warning: ignoring '-fsanitize=address' option for offload arch 'gfx1100' as it is not currently supported there. Use it with an offload arch containing 'xnack+' instead
warning: argument unused during compilation: '-fsanitize=address'
warning: argument unused during compilation: '-fsanitize-stable-abi'

Observed IR contains raw loads/stores and no sanitizer hook calls:

%4 = load i8, ptr addrspace(1) %3, align 1
%7 = load i8, ptr addrspace(1) %6, align 1
%11 = load i8, ptr addrspace(1) %10, align 1
store i32 %13, ptr addrspace(1) %1, align 4

With only the non-XNACK policy gate bypassed in a local clang build, the same
source emits the expected stable hook calls:

call void @__asan_load1(i64 %0)
%1 = load i8, ptr addrspace(1) %arrayidx, align 1
call void @__asan_load1(i64 %2)
%3 = load i8, ptr addrspace(1) %arrayidx3, align 1
call void @__asan_load1(i64 %4)
%5 = load i8, ptr addrspace(1) %arrayidx6, align 1
call void @__asan_store4(i64 %6)
store i32 %add8, ptr addrspace(1) %output.coerce, align 4

That is the key mechanism: the backend can represent and lower this
instrumented code shape for a non-XNACK target. The blocker is policy coupling
to the stock runtime model, not the target's ability to execute explicit shadow
checks.

The XNACK requirement can remain valid for stock ROCm ASAN if that runtime
depends on replayable faults, host-visible fault handling, or unified CPU/GPU
ASAN assumptions. The external mode is a different contract: the launcher/runtime
publishes shadow state explicitly and the compiler inserts explicit checks.

2. External runtime bitcode needs local no-sanitize state without suppressing the consumer module

An external ASAN runtime bitcode module needs two things at once:

  • selected support globals/functions are not instrumented
  • the linked consumer GPU module remains instrumented

The support globals may have a binary ABI with the launcher/runtime. ASAN global
instrumentation can add padding, ODR indicator symbols, and registration data,
which changes the symbol layout and breaks that ABI.

The support functions implement __asan_* hooks. Recursively instrumenting
those hook implementations is wrong.

Compiling the support runtime normally preserves
disable_sanitizer_instrumentation on functions, but does not preserve
no_sanitize_address on the support globals in emitted IR:

@external_shadow_config = protected addrspace(1) global ..., align 8
@external_feedback_config = protected addrspace(1) global ..., align 8

Linking that normal runtime bitcode into a sanitized GPU module produces final
object symbols like:

B __odr_asan_gen_external_feedback_config
B __odr_asan_gen_external_shadow_config
B external_feedback_config
B external_shadow_config

Those __odr_asan_gen_* symbols show that the support globals were treated as
ASAN globals. That is the wrong result for ABI globals owned by an external
runtime.

Compiling the support runtime through ASAN cc1 flags emits the per-global
markers the support library needs:

@external_shadow_config = protected addrspace(1) global ..., no_sanitize_address, align 8
@external_feedback_config = protected addrspace(1) global ..., no_sanitize_address, align 8

But it also emits a module-wide flag:

!llvm.module.flags = !{!0, !1, !2}
!2 = !{i32 4, !"nosanitize_address", i32 1}

When this bitcode is linked into the sanitized consumer module, clang reports:

warning: Redundant instrumentation detected, with module flag: nosanitize_address

and the final kernel body contains raw loads/stores instead of calls into the
ASAN hook path:

global_load_u8 ...
global_load_u8 ...
global_load_u8 ...
global_store_b32 ...

This is the opposite failure mode: the support globals remain exact, but the
consumer module loses the instrumentation we wanted.

The desired behavior is local-only no-sanitize state:

  • keep per-global no_sanitize_address on selected support globals
  • keep disable_sanitizer_instrumentation on selected support functions
  • do not propagate a module-wide nosanitize_address flag into the linked
    consumer module
  • keep the consumer module eligible for ASAN instrumentation

The gist script demonstrates this by removing only the module-wide
nosanitize_address flag from the ASAN-compiled support runtime IR and keeping
the per-global attributes. The final object then has exact support globals and
no ODR indicators:

B external_feedback_config
B external_shadow_config

The final kernel still calls sanitizer hooks before memory operations:

s_swappc_b64 ...   ; call __asan_load1
global_load_u8 ...
s_swappc_b64 ...   ; call __asan_load1
global_load_u8 ...
s_swappc_b64 ...   ; call __asan_load1
global_load_u8 ...
s_swappc_b64 ...   ; call __asan_store4
global_store_b32 ...

LLVM needs a principled way to build or link external sanitizer runtime bitcode
with item-local sanitizer exclusions but without transitive module-wide
nosanitize_address suppression.

Possible implementation shapes:

  • preserve no_sanitize_address on annotated globals without requiring the
    runtime source file to be compiled under whole-module ASAN mode
  • or treat external ASAN runtime bitcode as trusted support code, preserve
    item-level no_sanitize_address attributes, and drop/ignore the module-wide
    nosanitize_address flag for consumer instrumentation decisions

The important distinction is local exclusion vs. transitive exclusion. External
runtime bitcode needs local exclusion.

3. External ASAN should not require the stock hostcall ABI

Current AMDGPU ASAN instrumentation still introduces stock hostcall-related
kernel metadata even when the linked runtime hooks do not use hostcall.

Using the local-only support bitcode shape above, the final raw AMDGPU object
still contains:

.value_kind: hidden_hostcall_buffer
.kernarg_segment_size: 272
.name: sample_checked_kernel

The source kernel only has two explicit pointer arguments. The hidden hostcall
buffer is stock-runtime ABI coupling. It is not needed by an external runtime
that reports through a different device-global, queue, signal, trap, or
launcher-owned feedback mechanism.

The external-runtime mode should separate memory access check insertion from
the stock HIP/ROCm ASAN reporting transport.

In external mode:

  • compiler-emitted memory checks still call __asan_* hooks
  • stock hidden hostcall kernel arguments are not added unless explicitly needed
  • stock hostcall metadata is not required for the code object to be considered valid
  • the external runtime supplies any reporting globals, queues, or signals it needs

Proposed LLVM Capability

Add an explicit external-runtime mode for AMDGPU AddressSanitizer device code.
Flag spelling is open; the semantic shape matters more than the exact name:

-fsanitize=address
-fsanitize-stable-abi
-fgpu-address-sanitizer-runtime=external

or:

-fsanitize=address
-fsanitize-address-gpu-runtime=external

The mode would:

  • enable compiler insertion of ASAN memory access checks for AMDGPU device code
  • allow non-XNACK targets when the external mode is explicit
  • emit calls to the stable __asan_* hook ABI
  • link or otherwise allow a caller-provided device runtime bitcode library
  • preserve item-local no-sanitize state in the external runtime bitcode
  • avoid stock hostcall hidden kernel arguments when external mode does not use them
  • leave stock ROCm ASAN behavior unchanged when external mode is not selected

The runtime bitcode link mechanism should be a supported driver-level path, not
a user-visible dependency on cc1-only -mlink-bitcode-file choreography. The
cc1 flag is useful as a proof mechanism, but an external runtime feature needs
clear driver semantics.

Hook ABI Surface

The minimal reproducer only needs:

__asan_load1
__asan_store4
__asan_init
__asan_version_mismatch_check_v8
__asan_register_elf_globals
__asan_unregister_elf_globals

A complete external runtime must match the actual hook surface emitted by the
selected ASAN lowering mode, including fixed-width loads/stores, variable-width
accesses, report/noabort forms, poison/unpoison helpers, and any device-library
support symbols the frontend references.

This does not necessarily require LLVM to invent a new ABI. The better shape is:

  • external mode emits the same stable hook names where possible
  • any stock-runtime-only hooks are documented or disabled in external mode
  • the selected hook surface is testable without linking the stock runtime

The relationship with -fsanitize-stable-abi should be explicit. External GPU
ASAN likely wants that mode required or implied; unstable private ABI expansion
would make external runtimes brittle.

Non-Goals

This is not a request to change stock ROCm ASAN semantics.

This is not a request to make stock ASAN work without XNACK.

This is not a request for LLVM to standardize one downstream shadow memory
layout, feedback ring, or report packet format.

This is not a request to support arbitrary legacy pointer behavior that only
works with replayable faults. External-runtime ASAN is for launchers/runtimes
that can make allocation ranges and shadow state explicit.

Suggested Acceptance Criteria

Stock behavior remains unchanged without the new external-runtime option.

With the external-runtime option, compiling the sample kernel for gfx1100
with -fsanitize=address -fsanitize-stable-abi emits __asan_load* and
__asan_store* calls instead of warning that ASAN is ignored.

The emitted object can be linked against a caller-provided device runtime that
defines the required hooks.

Support bitcode can keep selected globals/functions uninstrumented without
disabling consumer instrumentation.

External mode does not add stock hostcall hidden kernel arguments when the
external runtime does not request them.

LLVM tests cover this with generic HIP/C/LLVM IR inputs independent of any
downstream runtime source tree.

Current Confidence

Directly observed:

  • stock non-XNACK AMDGPU clang ignores device ASAN
  • bypassing only that policy gate causes the sample kernel to emit __asan hooks
  • normal support bitcode loses per-global no_sanitize_address
  • ASAN-compiled support bitcode preserves per-global no_sanitize_address but
    emits a transitive module flag
  • removing only the module-wide flag gives the desired final object shape
  • current ASAN lowering still emits hidden_hostcall_buffer metadata

Inferred:

  • an upstream external-runtime mode can be mostly policy/linkage-shaped
  • the backend does not need a new lowering strategy for explicit check calls
  • stock ASAN can remain unchanged

Evidence that would change this conclusion:

  • AMDGPU backend lowering failures for real external-runtime hook bodies
  • code object loader requirements that make hidden_hostcall_buffer mandatory
    for any ASAN-marked kernel
  • an LLVM module-flag reason that prevents local-only support bitcode semantics

None of those appeared in the reproducers above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions