Skip to content

rocr: Drain ASAN quarantine on runtime teardown#7764

Open
ApurvMishra-amd wants to merge 2 commits into
developfrom
users/apumishr/ROCM-26385
Open

rocr: Drain ASAN quarantine on runtime teardown#7764
ApurvMishra-amd wants to merge 2 commits into
developfrom
users/apumishr/ROCM-26385

Conversation

@ApurvMishra-amd

@ApurvMishra-amd ApurvMishra-amd commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Drain ASan quarantine on runtime teardown to avoid stale device chunks

Motivation

On full teardown, Unload() unmaps ROCr device memory still held in the ASAN
device allocator's quarantine, leaving dangling chunks in the sanitizer's
process-global allocator. A later hsa_amd_pointer_info() then reads an
uninitialized chunk header and aborts in DeviceAllocatorT::GetBlockBegin.

Under ASAN, hsa_amd_memory_pool_free does not release the memory immediately.
The sanitizer holds it in a quarantine and performs the real release (and runs
ROCr's deallocation notifier) later. Tests that check the result of a free
right after calling it therefore fail: Memory_Available sees the freed VRAM as
still in use, and Deallocation_Notifier sees its callback as not yet run.

Technical Details

Call __sanitizer_purge_allocator() in Runtime::Release() before Unload() so
the quarantine is drained while device memory is still mapped. Guarded by
SANITIZER_AMDGPU.

Drain the quarantine with __sanitizer_purge_allocator() so the deferred release
completes before each check (ROCRTST_ASAN-guarded; no-op otherwise).
The recursive and double-free notifier callbacks run while the quarantine is
already being drained, and the drain cannot be invoked again from inside
itself, so their inner same-callback checks are skipped under ASAN. The outer
notifier behavior is still validated.

JIRA ID

ROCM-26384, 26746

Test Plan

Running full rocrtst suite.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Drains the AddressSanitizer (AMDGPU) device allocator quarantine during ROCr runtime teardown to prevent stale/dangling device chunks from surviving past Runtime::Unload(), which can later trip hsa_amd_pointer_info().

Changes:

  • Forward-declare __sanitizer_purge_allocator() behind SANITIZER_AMDGPU.
  • Invoke __sanitizer_purge_allocator() in Runtime::Release() immediately before Unload().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Call __sanitizer_purge_allocator() in Runtime::Release() before
Unload() so the ASAN quarantine is drained while device memory
is still mapped. Guarded by SANITIZER_AMDGPU.

Signed-off-by: Apurv Mishra <Apurv.Mishra@amd.com>
Drain the ASAN quarantine with __sanitizer_purge_allocator(), so the
deferred release completes before each check (ROCRTST_ASAN-guarded).

The recursive and double-free notifier callbacks run while the quarantine
is already being drained, and the drain cannot be invoked again from inside
itself, so their inner repeated callback checks are skipped under ASAN. The
outer notifier behavior is still validated.

Signed-off-by: Apurv Mishra <Apurv.Mishra@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants