rocr: Close shareable dmabuf fds after use instead of holding them open by dayatsin-amd · Pull Request #7786 · ROCm/rocm-systems

dayatsin-amd · 2026-06-24T22:39:49Z

Summary

Stop holding a shareable dmabuf_fd open for the entire lifetime of a VMM shareable handle; CreateShareableHandle now leaves dmabuf_fd as -1 after converting the allocation to a driver handle.
Lazily export a dmabuf fd in VMemorySetAccessPerHandle when peer import needs it, then close it again via a scope guard before returning.
Pass DriverMemoryHandle* through ImportMemoryHandle (reading .dmabuf_fd or .fabric_handle) and close fds in MemoryHandle destructor / DestroyMemoryHandle.

Fabric-handle export/import is unchanged and continues to use the internal BO handle path.

Test plan

Built libhsa-runtime64 on S83-1 (MI300X) against TheRock toolchain
Unit_hipMemExportToShareableHandle_Positive_Basic — pass
Unit_hipMemImportFromShareableHandle_Positive_Basic — pass
Full VirtualMemoryManagementTest suite (85 cases, HIP_VISIBLE_DEVICES=0): 69 passed / 6 failed / 10 skipped
A/B vs pre-patch baseline ROCr: same 6 failures in full suite; all 6 previously-failing tests pass in isolation on both builds — failures are not regressions from this change (suite-order / VA exhaustion)

Made with Cursor

Defer creating the shareable dmabuf_fd until VMemorySetAccessPerHandle needs it for peer import, then close it again via a scope guard. This avoids leaking open fds for the lifetime of a VMM shareable handle while keeping fabric-handle export/import on the existing BO-based path. Co-authored-by: Cursor <cursoragent@cursor.com>

cfreeamd

I don't see any big problems, but have a few suggestions for consideration, but nothing blocking.

cfreeamd · 2026-06-25T04:20:01Z

@@ -4148,6 +4153,11 @@ Runtime::MemoryHandle::MemoryHandle(hsa_fabric_handle_t fabric_handle)
 Runtime::MemoryHandle::~MemoryHandle() {
  if (driver_handle.handle != 0 && region != nullptr)
    agentOwner()->driver().DestroyMemoryHandle(&driver_handle);


Suggestion from Claude related to lines 4155 and the block at 4157. In a nutshell, it's saying if we derive from the base class in the future, we may miss doing the clean up that is happening here.

~MemoryHandle dual-cleanup coupling — runtime.cpp:4153

The destructor now has two cleanup sites:
if (driver_handle.handle != 0 && region != nullptr)
agentOwner()->driver().DestroyMemoryHandle(&driver_handle); // (A)

if (driver_handle.dmabuf_fd >= 0) { // (B)
os::DmaBufClose(driver_handle.dmabuf_fd);
driver_handle.dmabuf_fd = -1;
}

This is safe today because KfdDriver::DestroyMemoryHandle now zeroes dmabuf_fd before returning (added in this PR), so block (B) won't fire for
owned handles. But the two cleanup sites are coupled through an informal convention: any future Driver subclass that implements
DestroyMemoryHandle without zeroing dmabuf_fd will silently double-close the fd — a latent fd-aliasing bug with no compiler enforcement.

Suggestion: Either make the base class zero dmabuf_fd after the virtual call, or merge the dmabuf_fd close into DestroyMemoryHandle entirely and
remove block (B), with a brief comment explaining the imported-handle path.

cfreeamd · 2026-06-25T04:29:28Z

+   * holding an fd open for the lifetime of the handle. Export it lazily here so the target agents
+   * can import the memory below, then close it again before returning.
+   */
+  bool created_dmabuf_fd = false;


I'm not sure how often a single MemoryHandle is mapped across multiple VA chunks--if not often, then this probably isn't worth worrying about, but here's Claude feedback on this:

[MINOR] Per-chunk re-export amplification — runtime.cpp:4166

When a single MemoryHandle is mapped across multiple VA chunks (e.g., the same handle mapped twice via VMemoryHandleMap), VMemorySetAccess calls
VMemorySetAccessPerHandle once per chunk. Each call finds dmabuf_fd == -1 (the scope guard reset it) and does a fresh ExportMemoryHandle → kernel
ioctl. Cost becomes O(chunks × agents) rather than O(agents). Not a correctness issue, but worth lifting the lazy-export out to the per-handle
level in VMemorySetAccess if this path gets hot.

dayatsin-amd requested review from HongleiHuang-amd, atgutier, cfreeamd, kentrussell and ypapadop-amd as code owners June 24, 2026 22:39

github-actions Bot added the project: rocr-runtime label Jun 24, 2026

systems-assistant Bot added the organization: ROCm label Jun 24, 2026

cfreeamd approved these changes Jun 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

rocr: Close shareable dmabuf fds after use instead of holding them open#7786

rocr: Close shareable dmabuf fds after use instead of holding them open#7786
dayatsin-amd wants to merge 1 commit into
developfrom
users/dayatsin/close-dmabuf-handles

dayatsin-amd commented Jun 24, 2026

Uh oh!

cfreeamd left a comment

Uh oh!

cfreeamd Jun 25, 2026

Uh oh!

cfreeamd Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dayatsin-amd commented Jun 24, 2026

Summary

Test plan

Uh oh!

cfreeamd left a comment

Choose a reason for hiding this comment

Uh oh!

cfreeamd Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

cfreeamd Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants