Use caller CUDA stream for D2H and H2D copies (#20498) by Conarnar · Pull Request #20498 · pytorch/executorch

Conarnar · 2026-06-24T22:51:11Z

Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via getCallerStream()), copy_host_to_device and copy_device_to_host use cudaMemcpyAsync. When no caller stream is set, the synchronous cudaMemcpy path is used as before.

Additionally:

Added null pointer and zero-byte validation — null dst/src return Error::InvalidArgument instead of aborting in cudaMemcpy, and zero-byte copies return Error::Ok early.
Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
Wired //executorch/extension/cuda:caller_stream dependency in TARGETS.
Added extension_cuda dependencies to CMakeLists.txt.
Added test_cuda_allocator with coverage for sync/async paths and error handling.

Reviewed By: Gasoonjia

Differential Revision: D109590531

pytorch-bot · 2026-06-24T22:51:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20498

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 20 Pending

As of commit c657616 with merge base b919db7 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-06-24T22:51:21Z

@Conarnar has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109590531.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

github-actions · 2026-06-24T22:52:05Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync` and synchronize the stream before returning — preserving the blocking API contract while allowing work to be issued on the caller's stream. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Differential Revision: D109590531

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

+  cudaError_t err = cudaSuccess;
+  const auto caller_stream = executorch::extension::cuda::getCallerStream();
+  if (caller_stream) {
+    err = cudaMemcpyAsync(
+        dst, src, nbytes, cudaMemcpyHostToDevice, *caller_stream);
+  } else {
+    err = cudaMemcpy(dst, src, nbytes, cudaMemcpyHostToDevice);
+  }


+  // TODO: validate caller stream device matches index.
+  // For now assert single-GPU case.
+  ET_CHECK_OR_RETURN_ERROR(
+      index == -1 || index == 0,
+      InvalidArgument,
+      "CudaAllocator::copy_host_to_device only supports device 0, got %d",
+      static_cast<int>(index));


+  // TODO: validate caller stream device matches index.
+  // For now assert single-GPU case.
+  ET_CHECK_OR_RETURN_ERROR(
+      index == -1 || index == 0,
+      InvalidArgument,
+      "CudaAllocator::copy_device_to_host only supports device 0, got %d",
+      static_cast<int>(index));


+  cudaStream_t s;
+  ASSERT_EQ(cudaStreamCreate(&s), cudaSuccess);
+  executorch::extension::cuda::CallerStreamGuard g(s);
+
+  CudaAllocator& a = CudaAllocator::instance();
+  auto res = a.allocate(256, 0);
+  ASSERT_TRUE(res.ok());
+  void* d = res.get();
+  std::vector<uint8_t> h(256, 7);
+  // should take async branch internally, still return Ok
+  EXPECT_EQ(a.copy_host_to_device(d, h.data(), 256, 0), Error::Ok);
+  a.deallocate(d, 0);
+  cudaStreamDestroy(s);


+  cudaStream_t s;
+  ASSERT_EQ(cudaStreamCreate(&s), cudaSuccess);
+  executorch::extension::cuda::CallerStreamGuard g(s);
+
+  CudaAllocator& a = CudaAllocator::instance();
+  auto res = a.allocate(256, 0);
+  ASSERT_TRUE(res.ok());
+  void* d = res.get();
+  std::vector<uint8_t> h_src(256, 5), h_dst(256, 0);
+  ASSERT_EQ(a.copy_host_to_device(d, h_src.data(), 256, 0), Error::Ok);
+  EXPECT_EQ(a.copy_device_to_host(h_dst.data(), d, 256, 0), Error::Ok);
+  EXPECT_EQ(h_src, h_dst);
+
+  a.deallocate(d, 0);
+  cudaStreamDestroy(s);


Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Differential Revision: D109590531

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

+  // TODO: validate caller stream device matches index.
+  // For now assert index is -1 or 0.
+  ET_CHECK_OR_RETURN_ERROR(
+      index == -1 || index == 0,
+      InvalidArgument,
+      "CudaAllocator::copy_host_to_device only supports device 0 or -1 (current), got %d",
+      static_cast<int>(index));


+  // TODO: validate caller stream device matches index.
+  // For now assert index is -1 or 0.
+  ET_CHECK_OR_RETURN_ERROR(
+      index == -1 || index == 0,
+      InvalidArgument,
+      "CudaAllocator::copy_device_to_host only supports device 0 or -1 (current), got %d",
+      static_cast<int>(index));


+  cudaError_t err = cudaSuccess;
+  const auto caller_stream = executorch::extension::cuda::getCallerStream();
+  if (caller_stream) {
+    err = cudaMemcpyAsync(
+        dst, src, nbytes, cudaMemcpyHostToDevice, *caller_stream);
+    // We don't synchronize the stream here because the caller is expected to


+  if (caller_stream) {
+    err = cudaMemcpyAsync(
+        dst, src, nbytes, cudaMemcpyDeviceToHost, *caller_stream);
+    if (err == cudaSuccess) {
+      err = cudaStreamSynchronize(*caller_stream);
+    }


+    ASSERT_EQ(a.copy_host_to_device(d, h_src.data(), 256, 0), Error::Ok);
+    EXPECT_EQ(a.copy_device_to_host(h_dst.data(), d, 256, 0), Error::Ok);
+    EXPECT_EQ(h_src, h_dst);


Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Differential Revision: D109590531

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

 #include <cuda_runtime.h>

+#include <executorch/extension/cuda/caller_stream.h>
 #include <executorch/runtime/platform/log.h>


+  // TODO: validate caller stream device matches index.
+  // For now assert index is -1 or 0.
+  ET_CHECK_OR_RETURN_ERROR(
+      index == -1 || index == 0,
+      InvalidArgument,
+      "CudaAllocator::copy_host_to_device only supports device 0 or -1 (current), got %d",
+      static_cast<int>(index));


+  // TODO: validate caller stream device matches index.
+  // For now assert index is -1 or 0.
+  ET_CHECK_OR_RETURN_ERROR(
+      index == -1 || index == 0,
+      InvalidArgument,
+      "CudaAllocator::copy_device_to_host only supports device 0 or -1 (current), got %d",
+      static_cast<int>(index));


+  if (caller_stream) {
+    err = cudaMemcpyAsync(
+        dst, src, nbytes, cudaMemcpyDeviceToHost, *caller_stream);
+    if (err == cudaSuccess) {
+      err = cudaStreamSynchronize(*caller_stream);
+    }
+  } else {


+    std::vector<uint8_t> h(256, 7);
+    // should take async branch internally, still return Ok
+    EXPECT_EQ(a.copy_host_to_device(d, h.data(), 256, 0), Error::Ok);
+    ASSERT_EQ(cudaStreamSynchronize(s), cudaSuccess);
+    a.deallocate(d, 0);


+  std::vector<uint8_t> h_src(N, 42);
+  EXPECT_EQ(a.copy_host_to_device(dptr, h_src.data(), N, 0), Error::Ok);
+
+  a.deallocate(dptr, 0);


Gasoonjia · 2026-06-25T21:13:17Z

@claude review the PR plz

claude · 2026-06-25T21:14:01Z

Claude finished @Gasoonjia's task in 2m 45s —— View job

PR Review: Use caller CUDA stream for D2H and H2D copies

Thanks @Conarnar — solid direction, and the null/zero-byte hardening + new tests are welcome. I read all four changed files plus the surrounding build files and the DeviceAllocator contract. Below are the issues I think matter, in priority order. Several echo Copilot's findings, which I've independently confirmed.

🔴 1. H2D async path never synchronizes → breaks the blocking contract (use-after-free risk)

backends/cuda/runtime/cuda_allocator.cpp:162-170

if (caller_stream) {
  err = cudaMemcpyAsync(dst, src, nbytes, cudaMemcpyHostToDevice, *caller_stream);
  // We don't synchronize the stream here because the caller is expected to
  // synchronize the stream.
} else {
  err = cudaMemcpy(...);          // blocking
}

The base API in runtime/core/device_allocator.h:73 is documented as a plain blocking copy, and the sole caller — kernels/portable/cpu/op__device_copy.cpp:79 — returns out immediately with no stream handle to synchronize on:

Error err = allocator->copy_host_to_device(out.mutable_data_ptr(), self.const_data_ptr(), nbytes, device_index);
// ... returns out; the kernel has no access to the caller stream

So "the caller is expected to synchronize" doesn't hold here — the op kernel can't. The concrete hazard: src is a host buffer (self). If it is pinned, cudaMemcpyAsync is truly asynchronous and the host buffer may be freed/overwritten before the copy lands. Even for pageable memory the device result isn't ordered against subsequent non-stream work. copy_device_to_host does synchronize (line 226-228), so the two halves of the same API now have opposite blocking semantics — that asymmetry is the core problem.

Pick one and make both consistent:

(a) synchronize H2D too (cudaStreamSynchronize on success), keeping the blocking contract — simplest and safest; or
(b) keep H2D async, drop the D2H sync, and update the base-class doc + every caller to synchronize explicitly — larger change.

Given the current callsite, (a) is the right scope for this PR. Fix this →

🔴 2. Likely link/load break on non-MSVC shared builds

backends/cuda/CMakeLists.txt:106, 124, 169-175

On non-MSVC, cuda_allocator.cpp is compiled only into the aoti_cuda_shims SHARED library (line 106 / 124); it is added to aoti_cuda_backend only under _cuda_is_msvc_toolchain (line 194). But aoti_cuda_shims's link list (line 169-175) is cuda_platform, aoti_common_shims_slim, CUDA::cudart, CUDA::curand, ${CMAKE_DL_LIBS} — no extension_cuda, which is where getCallerStream() is defined (extension/cuda/CMakeLists.txt:25).

This PR wired the Buck dep (TARGETS:95) and the test, but not the CMake link for the shared-lib path. The symbol currently only resolves transitively/by accident through aoti_cuda_backend → extension_cuda in the global symbol scope, which is fragile (and would fail under -Wl,--no-undefined). Add extension_cuda to aoti_cuda_shims's target_link_libraries (non-MSVC branch) to match the TARGETS change. Fix this →

🟠 3. The "single-GPU" guard doesn't actually enforce single-GPU

cuda_allocator.cpp:146-150 and 206-210

ET_CHECK_OR_RETURN_ERROR(index == -1 || index == 0, InvalidArgument,
    "...only supports device 0 or -1 (current), got %d", ...);

index == -1 means "current device" (see the allocate() comment at line 25). On a multi-GPU host a caller can cudaSetDevice(1) then pass -1 and sail past this check — exactly the case the TODO says to defend against, and it's the dangerous one given the caller-stream/device mismatch risk you flag. If the intent is to restrict to device 0 for now, also assert the current device is 0 when index == -1 (and it makes the error message accurate). Fix this →

🟡 4. Tests don't validate H2D actually copied data

backends/cuda/runtime/test/test_cuda_allocator.cpp:37-48, 65-87

CopyHostToDevice and CopyHostToDeviceWithCallerStream only assert the return code is Error::Ok. A no-op/broken copy would still pass. Add a copy_device_to_host readback + EXPECT_EQ on the buffers (as CopyDeviceToHost at line 50 already does) so these catch functional regressions.

🟡 5. D2H caller-stream test relies on allocator-internal sync

test_cuda_allocator.cpp:106-107 — CopyDeviceToHostWithCallerStream compares h_dst right after the call, which only works because of the internal cudaStreamSynchronize (item 1). Add an explicit cudaStreamSynchronize(s) before the compare so the test is robust regardless of which way item 1 is resolved.

Minor / nits

Error code for copy failures. Both copies map a failed cudaMemcpy* to Error::Internal (line 183, 244). A bad dst/src/size from the caller is arguably InvalidArgument; Internal is fine as a catch-all but worth a thought.
Test stream scoping (Copilot): cudaStreamDestroy(s) correctly runs after the CallerStreamGuard scope closes (line 86/111), and return values are asserted — this already addresses the earlier Copilot note. No change needed.
device_type()/memcpy_async unchanged and look fine.

Summary

The async-stream plumbing is the right idea, but I'd block on #1 (semantics/safety) and #2 (build) before merge; #3 is a quick correctness tightening that matches the stated intent; #4/#5 harden the new tests. Nice work on the validation and test coverage otherwise.
• branch export-D109590531

shoumikhin · 2026-06-25T22:04:19Z

+  }
+  // TODO: validate caller stream device matches index.
+  // For now assert index is -1 or 0.
+  ET_CHECK_OR_RETURN_ERROR(


The index==-1 || index==0 guard is too broad. The reason to restrict the device is that a green-context caller stream is tied to its creation device. The plain cudaMemcpy path has no caller stream and used to handle any index via setDevice/restore, and tensor_ptr.cpp passes arbitrary indices and aborts on a non-Ok result, so this guard crashes multi-GPU callers that worked before. Suggest checking the device only in the caller-stream branch and letting the no-stream path copy on any index>=-1 like before. Same applies to copy_device_to_host (line 207).

shoumikhin · 2026-06-25T22:04:19Z

    void* dst,
    const void* src,
    size_t nbytes,
    DeviceIndex index) {


copy_host_to_device and copy_device_to_host are almost identical now (null checks, nbytes==0, the index guard, device save/restore, the stream branch), differing only in the copy kind and the sync. Could fold them into one private helper that takes the cudaMemcpyKind to keep the shared logic in one place.

Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `extension_cuda` dependencies to CMakeLists.txt. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Reviewed By: Gasoonjia Differential Revision: D109590531

pytorch-bot · 2026-06-25T23:34:15Z

~~Workflows were awaiting approval.~~ CI has now been triggered for the ciflow labels on this PR.

Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `extension_cuda` dependencies to CMakeLists.txt. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Reviewed By: Gasoonjia Differential Revision: D109590531

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

+  if (caller_stream) {
+    err = cudaMemcpyAsync(dst, src, nbytes, kind, *caller_stream);
+    if (err == cudaSuccess && kind == cudaMemcpyDeviceToHost) {
+      err = cudaStreamSynchronize(*caller_stream);
+    }
+  } else {
+    err = cudaMemcpy(dst, src, nbytes, kind);
+  }


+  const auto caller_stream = executorch::extension::cuda::getCallerStream();
+  if (caller_stream) {
+    // TODO: validate caller stream device matches index.
+    // For now assert index is -1 or 0.
+    ET_CHECK_OR_RETURN_ERROR(
+        index == -1 || index == 0,
+        InvalidArgument,
+        "%s: with caller stream, only supports device 0 or -1 (current), got %d",
+        method,
+        static_cast<int>(index));
+  }


Copilot

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Summary: CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before. Additionally: - Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early. - Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added. - Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS. - Added `extension_cuda` dependencies to CMakeLists.txt. - Added `test_cuda_allocator` with coverage for sync/async paths and error handling. Reviewed By: Gasoonjia Differential Revision: D109590531

Copilot AI review requested due to automatic review settings June 24, 2026 22:51

Conarnar requested review from kirklandsign and larryliu0820 as code owners June 24, 2026 22:51

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 24, 2026

meta-codesync Bot added the meta-exported label Jun 24, 2026

meta-codesync Bot temporarily deployed to cadence June 24, 2026 22:51 Inactive

Copilot started reviewing on behalf of Conarnar June 24, 2026 22:51 View session

Copilot AI reviewed Jun 24, 2026

meta-codesync Bot changed the title ~~Use caller CUDA stream for D2H and H2D copies~~ Use caller CUDA stream for D2H and H2D copies (#20498) Jun 24, 2026

Conarnar force-pushed the export-D109590531 branch 2 times, most recently from 3d8da75 to 07765c3 Compare June 25, 2026 17:10

Copilot AI review requested due to automatic review settings June 25, 2026 17:10

Copilot started reviewing on behalf of Conarnar June 25, 2026 17:10 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Conarnar force-pushed the export-D109590531 branch from 07765c3 to b316b71 Compare June 25, 2026 17:59

Conarnar force-pushed the export-D109590531 branch from b316b71 to 98081dc Compare June 25, 2026 18:57

Copilot AI review requested due to automatic review settings June 25, 2026 18:57

Copilot started reviewing on behalf of Conarnar June 25, 2026 18:58 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Conarnar force-pushed the export-D109590531 branch from 98081dc to 1e001a5 Compare June 25, 2026 20:50

Copilot AI review requested due to automatic review settings June 25, 2026 20:52

Conarnar force-pushed the export-D109590531 branch from 1e001a5 to 4e57af7 Compare June 25, 2026 20:52

Copilot started reviewing on behalf of Conarnar June 25, 2026 20:52 View session

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Gasoonjia approved these changes Jun 25, 2026

View reviewed changes

Gasoonjia added the ciflow/cuda label Jun 25, 2026

Gasoonjia temporarily deployed to cadence June 25, 2026 21:42 — with GitHub Actions Inactive

shoumikhin reviewed Jun 25, 2026

View reviewed changes

Conarnar force-pushed the export-D109590531 branch from 4e57af7 to d19c2d3 Compare June 25, 2026 23:34

Copilot AI review requested due to automatic review settings June 25, 2026 23:46

Conarnar force-pushed the export-D109590531 branch from d19c2d3 to 665a2f8 Compare June 25, 2026 23:46

Copilot started reviewing on behalf of Conarnar June 25, 2026 23:46 View session

Conarnar force-pushed the export-D109590531 branch from 665a2f8 to ae9a55e Compare June 25, 2026 23:48

Copilot AI reviewed Jun 25, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings June 25, 2026 23:50

Copilot AI reviewed Jun 25, 2026

Conarnar force-pushed the export-D109590531 branch from ae9a55e to c657616 Compare June 26, 2026 18:25

Uh oh!

Conversation

Conarnar commented Jun 24, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20498

⏳ No Failures, 20 Pending

Uh oh!

meta-codesync Bot commented Jun 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 24, 2026

This PR needs a release notes: label

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Gasoonjia commented Jun 25, 2026

Uh oh!

claude Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Use caller CUDA stream for D2H and H2D copies

🔴 1. H2D async path never synchronizes → breaks the blocking contract (use-after-free risk)

🔴 2. Likely link/load break on non-MSVC shared builds

🟠 3. The "single-GPU" guard doesn't actually enforce single-GPU

🟡 4. Tests don't validate H2D actually copied data

🟡 5. D2H caller-stream test relies on allocator-internal sync

Minor / nits

Summary

Uh oh!

shoumikhin Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

shoumikhin Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

pytorch-bot Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conarnar commented Jun 24, 2026 •

edited by meta-codesync Bot

Loading

pytorch-bot Bot commented Jun 24, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 25, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 25, 2026 •

edited

Loading