Skip to content

Use caller CUDA stream for D2H and H2D copies (#20498)#20498

Open
Conarnar wants to merge 1 commit into
pytorch:mainfrom
Conarnar:export-D109590531
Open

Use caller CUDA stream for D2H and H2D copies (#20498)#20498
Conarnar wants to merge 1 commit into
pytorch:mainfrom
Conarnar:export-D109590531

Conversation

@Conarnar

@Conarnar Conarnar commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via getCallerStream()), copy_host_to_device and copy_device_to_host use cudaMemcpyAsync. When no caller stream is set, the synchronous cudaMemcpy path is used as before.

Additionally:

  • Added null pointer and zero-byte validation — null dst/src return Error::InvalidArgument instead of aborting in cudaMemcpy, and zero-byte copies return Error::Ok early.
  • Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
  • Wired //executorch/extension/cuda:caller_stream dependency in TARGETS.
  • Added extension_cuda dependencies to CMakeLists.txt.
  • Added test_cuda_allocator with coverage for sync/async paths and error handling.

Reviewed By: Gasoonjia

Differential Revision: D109590531

Copilot AI review requested due to automatic review settings June 24, 2026 22:51
@pytorch-bot

pytorch-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20498

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 20 Pending

As of commit c657616 with merge base b919db7 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 24, 2026
@meta-codesync

meta-codesync Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

@Conarnar has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109590531.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-codesync meta-codesync Bot changed the title Use caller CUDA stream for D2H and H2D copies Use caller CUDA stream for D2H and H2D copies (#20498) Jun 24, 2026
Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 24, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync` and synchronize the stream before returning — preserving the blocking API contract while allowing work to be issued on the caller's stream. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Differential Revision: D109590531
@Conarnar Conarnar force-pushed the export-D109590531 branch 2 times, most recently from 3d8da75 to 07765c3 Compare June 25, 2026 17:10
Copilot AI review requested due to automatic review settings June 25, 2026 17:10
Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 25, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync` and synchronize the stream before returning — preserving the blocking API contract while allowing work to be issued on the caller's stream. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Differential Revision: D109590531

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

Comment on lines +161 to +168
cudaError_t err = cudaSuccess;
const auto caller_stream = executorch::extension::cuda::getCallerStream();
if (caller_stream) {
err = cudaMemcpyAsync(
dst, src, nbytes, cudaMemcpyHostToDevice, *caller_stream);
} else {
err = cudaMemcpy(dst, src, nbytes, cudaMemcpyHostToDevice);
}
Comment on lines +144 to +150
// TODO: validate caller stream device matches index.
// For now assert single-GPU case.
ET_CHECK_OR_RETURN_ERROR(
index == -1 || index == 0,
InvalidArgument,
"CudaAllocator::copy_host_to_device only supports device 0, got %d",
static_cast<int>(index));
Comment on lines +202 to +208
// TODO: validate caller stream device matches index.
// For now assert single-GPU case.
ET_CHECK_OR_RETURN_ERROR(
index == -1 || index == 0,
InvalidArgument,
"CudaAllocator::copy_device_to_host only supports device 0, got %d",
static_cast<int>(index));
Comment on lines +78 to +90
cudaStream_t s;
ASSERT_EQ(cudaStreamCreate(&s), cudaSuccess);
executorch::extension::cuda::CallerStreamGuard g(s);

CudaAllocator& a = CudaAllocator::instance();
auto res = a.allocate(256, 0);
ASSERT_TRUE(res.ok());
void* d = res.get();
std::vector<uint8_t> h(256, 7);
// should take async branch internally, still return Ok
EXPECT_EQ(a.copy_host_to_device(d, h.data(), 256, 0), Error::Ok);
a.deallocate(d, 0);
cudaStreamDestroy(s);
Comment on lines +103 to +117
cudaStream_t s;
ASSERT_EQ(cudaStreamCreate(&s), cudaSuccess);
executorch::extension::cuda::CallerStreamGuard g(s);

CudaAllocator& a = CudaAllocator::instance();
auto res = a.allocate(256, 0);
ASSERT_TRUE(res.ok());
void* d = res.get();
std::vector<uint8_t> h_src(256, 5), h_dst(256, 0);
ASSERT_EQ(a.copy_host_to_device(d, h_src.data(), 256, 0), Error::Ok);
EXPECT_EQ(a.copy_device_to_host(h_dst.data(), d, 256, 0), Error::Ok);
EXPECT_EQ(h_src, h_dst);

a.deallocate(d, 0);
cudaStreamDestroy(s);
Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 25, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Differential Revision: D109590531
@Conarnar Conarnar force-pushed the export-D109590531 branch from 07765c3 to b316b71 Compare June 25, 2026 17:59
Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 25, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Differential Revision: D109590531
@Conarnar Conarnar force-pushed the export-D109590531 branch from b316b71 to 98081dc Compare June 25, 2026 18:57
Copilot AI review requested due to automatic review settings June 25, 2026 18:57

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

Comment on lines +144 to +150
// TODO: validate caller stream device matches index.
// For now assert index is -1 or 0.
ET_CHECK_OR_RETURN_ERROR(
index == -1 || index == 0,
InvalidArgument,
"CudaAllocator::copy_host_to_device only supports device 0 or -1 (current), got %d",
static_cast<int>(index));
Comment on lines +204 to +210
// TODO: validate caller stream device matches index.
// For now assert index is -1 or 0.
ET_CHECK_OR_RETURN_ERROR(
index == -1 || index == 0,
InvalidArgument,
"CudaAllocator::copy_device_to_host only supports device 0 or -1 (current), got %d",
static_cast<int>(index));
Comment on lines +161 to +166
cudaError_t err = cudaSuccess;
const auto caller_stream = executorch::extension::cuda::getCallerStream();
if (caller_stream) {
err = cudaMemcpyAsync(
dst, src, nbytes, cudaMemcpyHostToDevice, *caller_stream);
// We don't synchronize the stream here because the caller is expected to
Comment on lines +223 to +228
if (caller_stream) {
err = cudaMemcpyAsync(
dst, src, nbytes, cudaMemcpyDeviceToHost, *caller_stream);
if (err == cudaSuccess) {
err = cudaStreamSynchronize(*caller_stream);
}
Comment on lines +116 to +118
ASSERT_EQ(a.copy_host_to_device(d, h_src.data(), 256, 0), Error::Ok);
EXPECT_EQ(a.copy_device_to_host(h_dst.data(), d, 256, 0), Error::Ok);
EXPECT_EQ(h_src, h_dst);
Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 25, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Differential Revision: D109590531
@Conarnar Conarnar force-pushed the export-D109590531 branch from 98081dc to 1e001a5 Compare June 25, 2026 20:50
Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 25, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Differential Revision: D109590531
Copilot AI review requested due to automatic review settings June 25, 2026 20:52
@Conarnar Conarnar force-pushed the export-D109590531 branch from 1e001a5 to 4e57af7 Compare June 25, 2026 20:52

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

Comment on lines 11 to 14
#include <cuda_runtime.h>

#include <executorch/extension/cuda/caller_stream.h>
#include <executorch/runtime/platform/log.h>
Comment on lines +144 to +150
// TODO: validate caller stream device matches index.
// For now assert index is -1 or 0.
ET_CHECK_OR_RETURN_ERROR(
index == -1 || index == 0,
InvalidArgument,
"CudaAllocator::copy_host_to_device only supports device 0 or -1 (current), got %d",
static_cast<int>(index));
Comment on lines +204 to +210
// TODO: validate caller stream device matches index.
// For now assert index is -1 or 0.
ET_CHECK_OR_RETURN_ERROR(
index == -1 || index == 0,
InvalidArgument,
"CudaAllocator::copy_device_to_host only supports device 0 or -1 (current), got %d",
static_cast<int>(index));
Comment on lines +223 to +229
if (caller_stream) {
err = cudaMemcpyAsync(
dst, src, nbytes, cudaMemcpyDeviceToHost, *caller_stream);
if (err == cudaSuccess) {
err = cudaStreamSynchronize(*caller_stream);
}
} else {
Comment on lines +80 to +84
std::vector<uint8_t> h(256, 7);
// should take async branch internally, still return Ok
EXPECT_EQ(a.copy_host_to_device(d, h.data(), 256, 0), Error::Ok);
ASSERT_EQ(cudaStreamSynchronize(s), cudaSuccess);
a.deallocate(d, 0);
Comment on lines +44 to +47
std::vector<uint8_t> h_src(N, 42);
EXPECT_EQ(a.copy_host_to_device(dptr, h_src.data(), N, 0), Error::Ok);

a.deallocate(dptr, 0);
@Gasoonjia

Copy link
Copy Markdown
Contributor

@claude review the PR plz

@claude

claude Bot commented Jun 25, 2026

Copy link
Copy Markdown

Claude finished @Gasoonjia's task in 2m 45s —— View job


PR Review: Use caller CUDA stream for D2H and H2D copies

Thanks @Conarnar — solid direction, and the null/zero-byte hardening + new tests are welcome. I read all four changed files plus the surrounding build files and the DeviceAllocator contract. Below are the issues I think matter, in priority order. Several echo Copilot's findings, which I've independently confirmed.


🔴 1. H2D async path never synchronizes → breaks the blocking contract (use-after-free risk)

backends/cuda/runtime/cuda_allocator.cpp:162-170

if (caller_stream) {
  err = cudaMemcpyAsync(dst, src, nbytes, cudaMemcpyHostToDevice, *caller_stream);
  // We don't synchronize the stream here because the caller is expected to
  // synchronize the stream.
} else {
  err = cudaMemcpy(...);          // blocking
}

The base API in runtime/core/device_allocator.h:73 is documented as a plain blocking copy, and the sole caller — kernels/portable/cpu/op__device_copy.cpp:79 — returns out immediately with no stream handle to synchronize on:

Error err = allocator->copy_host_to_device(out.mutable_data_ptr(), self.const_data_ptr(), nbytes, device_index);
// ... returns out; the kernel has no access to the caller stream

So "the caller is expected to synchronize" doesn't hold here — the op kernel can't. The concrete hazard: src is a host buffer (self). If it is pinned, cudaMemcpyAsync is truly asynchronous and the host buffer may be freed/overwritten before the copy lands. Even for pageable memory the device result isn't ordered against subsequent non-stream work. copy_device_to_host does synchronize (line 226-228), so the two halves of the same API now have opposite blocking semantics — that asymmetry is the core problem.

Pick one and make both consistent:

  • (a) synchronize H2D too (cudaStreamSynchronize on success), keeping the blocking contract — simplest and safest; or
  • (b) keep H2D async, drop the D2H sync, and update the base-class doc + every caller to synchronize explicitly — larger change.

Given the current callsite, (a) is the right scope for this PR. Fix this →


🔴 2. Likely link/load break on non-MSVC shared builds

backends/cuda/CMakeLists.txt:106, 124, 169-175

On non-MSVC, cuda_allocator.cpp is compiled only into the aoti_cuda_shims SHARED library (line 106 / 124); it is added to aoti_cuda_backend only under _cuda_is_msvc_toolchain (line 194). But aoti_cuda_shims's link list (line 169-175) is cuda_platform, aoti_common_shims_slim, CUDA::cudart, CUDA::curand, ${CMAKE_DL_LIBS}no extension_cuda, which is where getCallerStream() is defined (extension/cuda/CMakeLists.txt:25).

This PR wired the Buck dep (TARGETS:95) and the test, but not the CMake link for the shared-lib path. The symbol currently only resolves transitively/by accident through aoti_cuda_backend → extension_cuda in the global symbol scope, which is fragile (and would fail under -Wl,--no-undefined). Add extension_cuda to aoti_cuda_shims's target_link_libraries (non-MSVC branch) to match the TARGETS change. Fix this →


🟠 3. The "single-GPU" guard doesn't actually enforce single-GPU

cuda_allocator.cpp:146-150 and 206-210

ET_CHECK_OR_RETURN_ERROR(index == -1 || index == 0, InvalidArgument,
    "...only supports device 0 or -1 (current), got %d", ...);

index == -1 means "current device" (see the allocate() comment at line 25). On a multi-GPU host a caller can cudaSetDevice(1) then pass -1 and sail past this check — exactly the case the TODO says to defend against, and it's the dangerous one given the caller-stream/device mismatch risk you flag. If the intent is to restrict to device 0 for now, also assert the current device is 0 when index == -1 (and it makes the error message accurate). Fix this →


🟡 4. Tests don't validate H2D actually copied data

backends/cuda/runtime/test/test_cuda_allocator.cpp:37-48, 65-87

CopyHostToDevice and CopyHostToDeviceWithCallerStream only assert the return code is Error::Ok. A no-op/broken copy would still pass. Add a copy_device_to_host readback + EXPECT_EQ on the buffers (as CopyDeviceToHost at line 50 already does) so these catch functional regressions.

🟡 5. D2H caller-stream test relies on allocator-internal sync

test_cuda_allocator.cpp:106-107CopyDeviceToHostWithCallerStream compares h_dst right after the call, which only works because of the internal cudaStreamSynchronize (item 1). Add an explicit cudaStreamSynchronize(s) before the compare so the test is robust regardless of which way item 1 is resolved.


Minor / nits

  • Error code for copy failures. Both copies map a failed cudaMemcpy* to Error::Internal (line 183, 244). A bad dst/src/size from the caller is arguably InvalidArgument; Internal is fine as a catch-all but worth a thought.
  • Test stream scoping (Copilot): cudaStreamDestroy(s) correctly runs after the CallerStreamGuard scope closes (line 86/111), and return values are asserted — this already addresses the earlier Copilot note. No change needed.
  • device_type()/memcpy_async unchanged and look fine.

Summary

The async-stream plumbing is the right idea, but I'd block on #1 (semantics/safety) and #2 (build) before merge; #3 is a quick correctness tightening that matches the stated intent; #4/#5 harden the new tests. Nice work on the validation and test coverage otherwise.
• branch export-D109590531

}
// TODO: validate caller stream device matches index.
// For now assert index is -1 or 0.
ET_CHECK_OR_RETURN_ERROR(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The index==-1 || index==0 guard is too broad. The reason to restrict the device is that a green-context caller stream is tied to its creation device. The plain cudaMemcpy path has no caller stream and used to handle any index via setDevice/restore, and tensor_ptr.cpp passes arbitrary indices and aborts on a non-Ok result, so this guard crashes multi-GPU callers that worked before. Suggest checking the device only in the caller-stream branch and letting the no-stream path copy on any index>=-1 like before. Same applies to copy_device_to_host (line 207).

void* dst,
const void* src,
size_t nbytes,
DeviceIndex index) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copy_host_to_device and copy_device_to_host are almost identical now (null checks, nbytes==0, the index guard, device save/restore, the stream branch), differing only in the copy kind and the sync. Could fold them into one private helper that takes the cudaMemcpyKind to keep the shared logic in one place.

Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 25, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `extension_cuda` dependencies to CMakeLists.txt.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Reviewed By: Gasoonjia

Differential Revision: D109590531
@Conarnar Conarnar force-pushed the export-D109590531 branch from 4e57af7 to d19c2d3 Compare June 25, 2026 23:34
@pytorch-bot

pytorch-bot Bot commented Jun 25, 2026

Copy link
Copy Markdown

Workflows were awaiting approval. CI has now been triggered for the ciflow labels on this PR.

Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 25, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `extension_cuda` dependencies to CMakeLists.txt.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Reviewed By: Gasoonjia

Differential Revision: D109590531
Copilot AI review requested due to automatic review settings June 25, 2026 23:46
@Conarnar Conarnar force-pushed the export-D109590531 branch from d19c2d3 to 665a2f8 Compare June 25, 2026 23:46
Conarnar added a commit to Conarnar/executorch that referenced this pull request Jun 25, 2026
Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `extension_cuda` dependencies to CMakeLists.txt.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Reviewed By: Gasoonjia

Differential Revision: D109590531
@Conarnar Conarnar force-pushed the export-D109590531 branch from 665a2f8 to ae9a55e Compare June 25, 2026 23:48

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment on lines +74 to +81
if (caller_stream) {
err = cudaMemcpyAsync(dst, src, nbytes, kind, *caller_stream);
if (err == cudaSuccess && kind == cudaMemcpyDeviceToHost) {
err = cudaStreamSynchronize(*caller_stream);
}
} else {
err = cudaMemcpy(dst, src, nbytes, kind);
}
Comment on lines +49 to +59
const auto caller_stream = executorch::extension::cuda::getCallerStream();
if (caller_stream) {
// TODO: validate caller stream device matches index.
// For now assert index is -1 or 0.
ET_CHECK_OR_RETURN_ERROR(
index == -1 || index == 0,
InvalidArgument,
"%s: with caller stream, only supports device 0 or -1 (current), got %d",
method,
static_cast<int>(index));
}
Copilot AI review requested due to automatic review settings June 25, 2026 23:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

Summary:

CudaAllocator memory copies now support async copy on a caller-provided CUDA stream. When a caller stream is available (via `getCallerStream()`), `copy_host_to_device` and `copy_device_to_host` use `cudaMemcpyAsync`. When no caller stream is set, the synchronous `cudaMemcpy` path is used as before.

Additionally:
- Added null pointer and zero-byte validation — null `dst`/`src` return `Error::InvalidArgument` instead of aborting in `cudaMemcpy`, and zero-byte copies return `Error::Ok` early.
- Assert single-GPU case (index 0 or -1) until multi-GPU stream validation is added.
- Wired `//executorch/extension/cuda:caller_stream` dependency in TARGETS.
- Added `extension_cuda` dependencies to CMakeLists.txt.
- Added `test_cuda_allocator` with coverage for sync/async paths and error handling.

Reviewed By: Gasoonjia

Differential Revision: D109590531
@Conarnar Conarnar force-pushed the export-D109590531 branch from ae9a55e to c657616 Compare June 26, 2026 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/cuda CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants