Merge branch 'branch-24.06' into jawe/ops-2621

rapidsai · Apr 19, 2024 · 9dfd907 · 9dfd907
2 parents ba03497 + 9e6db74
commit 9dfd907
Show file tree

Hide file tree

Showing 87 changed files with 260 additions and 226 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -14,14 +14,14 @@ repos:
       rev: 5.12.0
       hooks:
           - id: isort
-            args: ["--settings-path=python/pyproject.toml"]
+            args: ["--settings-path=python/rmm/pyproject.toml"]
             files: python/.*
             types_or: [python, cython, pyi]
     - repo: https://github.com/ambv/black
       rev: 22.3.0
       hooks:
           - id: black
-            args: ["--config=python/pyproject.toml"]
+            args: ["--config=python/rmm/pyproject.toml"]
     - repo: https://github.com/MarcoGorelli/cython-lint
       rev: v0.15.0
       hooks:

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,57 @@
+# RMM 24.04.00 (10 Apr 2024)
+
+## 🚨 Breaking Changes
+
+- Accept stream argument in DeviceMemoryResource allocate/deallocate ([#1494](https://github.com/rapidsai/rmm/pull/1494)) [@wence-](https://github.com/wence-)
+- Replace all internal usage of `get_upstream` with `get_upstream_resource` ([#1491](https://github.com/rapidsai/rmm/pull/1491)) [@miscco](https://github.com/miscco)
+- Deprecate rmm::mr::device_memory_resource::supports_streams() ([#1452](https://github.com/rapidsai/rmm/pull/1452)) [@harrism](https://github.com/harrism)
+- Remove deprecated rmm::detail::available_device_memory ([#1438](https://github.com/rapidsai/rmm/pull/1438)) [@harrism](https://github.com/harrism)
+- Make device_memory_resource::supports_streams() not pure virtual. Remove derived implementations and calls in RMM ([#1437](https://github.com/rapidsai/rmm/pull/1437)) [@harrism](https://github.com/harrism)
+- Deprecate rmm::mr::device_memory_resource::get_mem_info() and supports_get_mem_info(). ([#1436](https://github.com/rapidsai/rmm/pull/1436)) [@harrism](https://github.com/harrism)
+
+## 🐛 Bug Fixes
+
+- Fix search path for torch allocator in editable installs and ensure CUDA support is available ([#1498](https://github.com/rapidsai/rmm/pull/1498)) [@vyasr](https://github.com/vyasr)
+- Accept stream argument in DeviceMemoryResource allocate/deallocate ([#1494](https://github.com/rapidsai/rmm/pull/1494)) [@wence-](https://github.com/wence-)
+- Run STATISTICS_TEST and TRACKING_TEST in serial to avoid OOM errors. ([#1487](https://github.com/rapidsai/rmm/pull/1487)) [@bdice](https://github.com/bdice)
+
+## 📖 Documentation
+
+- Pin to recent breathe, to prevent getting an unsupported sphinx version. ([#1495](https://github.com/rapidsai/rmm/pull/1495)) [@bdice](https://github.com/bdice)
+
+## 🚀 New Features
+
+- Replace all internal usage of `get_upstream` with `get_upstream_resource` ([#1491](https://github.com/rapidsai/rmm/pull/1491)) [@miscco](https://github.com/miscco)
+- Add complete set of resource ref aliases ([#1479](https://github.com/rapidsai/rmm/pull/1479)) [@nvdbaranec](https://github.com/nvdbaranec)
+- Automate include grouping using clang-format ([#1463](https://github.com/rapidsai/rmm/pull/1463)) [@harrism](https://github.com/harrism)
+- Add `get_upstream_resource` to resource adaptors ([#1456](https://github.com/rapidsai/rmm/pull/1456)) [@miscco](https://github.com/miscco)
+- Deprecate rmm::mr::device_memory_resource::supports_streams() ([#1452](https://github.com/rapidsai/rmm/pull/1452)) [@harrism](https://github.com/harrism)
+- Remove duplicated memory_resource_tests ([#1451](https://github.com/rapidsai/rmm/pull/1451)) [@miscco](https://github.com/miscco)
+- Change `rmm::exec_policy` to take `async_resource_ref` ([#1449](https://github.com/rapidsai/rmm/pull/1449)) [@miscco](https://github.com/miscco)
+- Change `device_scalar` to take `async_resource_ref` ([#1447](https://github.com/rapidsai/rmm/pull/1447)) [@miscco](https://github.com/miscco)
+- Add device_async_resource_ref convenience alias ([#1441](https://github.com/rapidsai/rmm/pull/1441)) [@harrism](https://github.com/harrism)
+- Remove deprecated rmm::detail::available_device_memory ([#1438](https://github.com/rapidsai/rmm/pull/1438)) [@harrism](https://github.com/harrism)
+- Make device_memory_resource::supports_streams() not pure virtual. Remove derived implementations and calls in RMM ([#1437](https://github.com/rapidsai/rmm/pull/1437)) [@harrism](https://github.com/harrism)
+- Deprecate rmm::mr::device_memory_resource::get_mem_info() and supports_get_mem_info(). ([#1436](https://github.com/rapidsai/rmm/pull/1436)) [@harrism](https://github.com/harrism)
+- Support CUDA 12.2 ([#1419](https://github.com/rapidsai/rmm/pull/1419)) [@jameslamb](https://github.com/jameslamb)
+
+## 🛠️ Improvements
+
+- Use `conda env create --yes` instead of `--force` ([#1509](https://github.com/rapidsai/rmm/pull/1509)) [@bdice](https://github.com/bdice)
+- Add upper bound to prevent usage of NumPy 2 ([#1501](https://github.com/rapidsai/rmm/pull/1501)) [@bdice](https://github.com/bdice)
+- Remove hard-coding of RAPIDS version where possible ([#1496](https://github.com/rapidsai/rmm/pull/1496)) [@KyleFromNVIDIA](https://github.com/KyleFromNVIDIA)
+- Require NumPy 1.23+ ([#1488](https://github.com/rapidsai/rmm/pull/1488)) [@jakirkham](https://github.com/jakirkham)
+- Use `rmm::device_async_resource_ref` in multi_stream_allocation benchmark ([#1482](https://github.com/rapidsai/rmm/pull/1482)) [@miscco](https://github.com/miscco)
+- Update devcontainers to CUDA Toolkit 12.2 ([#1470](https://github.com/rapidsai/rmm/pull/1470)) [@trxcllnt](https://github.com/trxcllnt)
+- Add support for Python 3.11 ([#1469](https://github.com/rapidsai/rmm/pull/1469)) [@jameslamb](https://github.com/jameslamb)
+- target branch-24.04 for GitHub Actions workflows ([#1468](https://github.com/rapidsai/rmm/pull/1468)) [@jameslamb](https://github.com/jameslamb)
+- [FEA]: Use `std::optional` instead of `thrust::optional` ([#1464](https://github.com/rapidsai/rmm/pull/1464)) [@miscco](https://github.com/miscco)
+- Add environment-agnostic scripts for running ctests and pytests ([#1462](https://github.com/rapidsai/rmm/pull/1462)) [@trxcllnt](https://github.com/trxcllnt)
+- Ensure that `ctest` is called with `--no-tests=error`. ([#1460](https://github.com/rapidsai/rmm/pull/1460)) [@bdice](https://github.com/bdice)
+- Update ops-bot.yaml ([#1458](https://github.com/rapidsai/rmm/pull/1458)) [@AyodeAwe](https://github.com/AyodeAwe)
+- Adopt the `rmm::device_async_resource_ref` alias ([#1454](https://github.com/rapidsai/rmm/pull/1454)) [@miscco](https://github.com/miscco)
+- Refactor error.hpp out of detail ([#1439](https://github.com/rapidsai/rmm/pull/1439)) [@lamarrr](https://github.com/lamarrr)
+
 # RMM 24.02.00 (12 Feb 2024)
 
 ## 🚨 Breaking Changes

diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -114,7 +114,8 @@ endif()
 # * add tests --------------------------------------------------------------------------------------
 
 if(BUILD_TESTS AND CMAKE_PROJECT_NAME STREQUAL PROJECT_NAME)
-  include(cmake/thirdparty/get_gtest.cmake)
+  include(${rapids-cmake-dir}/cpm/gtest.cmake)
+  rapids_cpm_gtest(BUILD_STATIC)
   include(CTest) # calls enable_testing()
 
   add_subdirectory(tests)

diff --git a/README.md b/README.md
@@ -207,38 +207,7 @@ alignment argument. All allocations are required to be aligned to at least 256B.
 `device_memory_resource` adds an additional `cuda_stream_view` argument to allow specifying the stream
 on which to perform the (de)allocation.
 
-## `cuda_stream_view` and `cuda_stream`
-
-`rmm::cuda_stream_view` is a simple non-owning wrapper around a CUDA `cudaStream_t`. This wrapper's
-purpose is to provide strong type safety for stream types. (`cudaStream_t` is an alias for a pointer,
-which can lead to ambiguity in APIs when it is assigned `0`.)  All RMM stream-ordered APIs take a
-`rmm::cuda_stream_view` argument.
-
-`rmm::cuda_stream` is a simple owning wrapper around a CUDA `cudaStream_t`. This class provides
-RAII semantics (constructor creates the CUDA stream, destructor destroys it). An `rmm::cuda_stream`
-can never represent the CUDA default stream or per-thread default stream; it only ever represents
-a single non-default stream. `rmm::cuda_stream` cannot be copied, but can be moved.
-
-## `cuda_stream_pool`
-
-`rmm::cuda_stream_pool` provides fast access to a pool of CUDA streams. This class can be used to
-create a set of `cuda_stream` objects whose lifetime is equal to the `cuda_stream_pool`. Using the
-stream pool can be faster than creating the streams on the fly. The size of the pool is configurable.
-Depending on this size, multiple calls to `cuda_stream_pool::get_stream()` may return instances of
-`rmm::cuda_stream_view` that represent identical CUDA streams.
-
-### Thread Safety
-
-All current device memory resources are thread safe unless documented otherwise. More specifically,
-calls to memory resource `allocate()` and `deallocate()` methods are safe with respect to calls to
-either of these functions from other threads. They are _not_ thread safe with respect to
-construction and destruction of the memory resource object.
-
-Note that a class `thread_safe_resource_adapter` is provided which can be used to adapt a memory
-resource that is not thread safe to be thread safe (as described above). This adapter is not needed
-with any current RMM device memory resources.
-
-### Stream-ordered Memory Allocation
+## Stream-ordered Memory Allocation
 
 `rmm::mr::device_memory_resource` is a base class that provides stream-ordered memory allocation.
 This allows optimizations such as re-using memory deallocated on the same stream without the
@@ -270,39 +239,39 @@ For further information about stream-ordered memory allocation semantics, read
 Allocator](https://developer.nvidia.com/blog/using-cuda-stream-ordered-memory-allocator-part-1/)
 on the NVIDIA Developer Blog.
 
-### Available Resources
+## Available Device Resources
 
 RMM provides several `device_memory_resource` derived classes to satisfy various user requirements.
 For more detailed information about these resources, see their respective documentation.
 
-#### `cuda_memory_resource`
+### `cuda_memory_resource`
 
 Allocates and frees device memory using `cudaMalloc` and `cudaFree`.
 
-#### `managed_memory_resource`
+### `managed_memory_resource`
 
 Allocates and frees device memory using `cudaMallocManaged` and `cudaFree`.
 
 Note that `managed_memory_resource` cannot be used with NVIDIA Virtual GPU Software (vGPU, for use
 with virtual machines or hypervisors) because [NVIDIA CUDA Unified Memory is not supported by
 NVIDIA vGPU](https://docs.nvidia.com/grid/latest/grid-vgpu-user-guide/index.html#cuda-open-cl-support-vgpu).
 
-#### `pool_memory_resource`
+### `pool_memory_resource`
 
 A coalescing, best-fit pool sub-allocator.
 
-#### `fixed_size_memory_resource`
+### `fixed_size_memory_resource`
 
 A memory resource that can only allocate a single fixed size. Average allocation and deallocation
 cost is constant.
 
-#### `binning_memory_resource`
+### `binning_memory_resource`
 
 Configurable to use multiple upstream memory resources for allocations that fall within different
 bin sizes. Often configured with multiple bins backed by `fixed_size_memory_resource`s and a single
 `pool_memory_resource` for allocations larger than the largest bin size.
 
-### Default Resources and Per-device Resources
+## Default Resources and Per-device Resources
 
 RMM users commonly need to configure a `device_memory_resource` object to use for all allocations
 where another resource has not explicitly been provided. A common example is configuring a
@@ -327,7 +296,7 @@ Accessing and modifying the default resource is done through two functions:
      `get_current_device_resource()`
    - For more explicit control, you can use `set_per_device_resource()`, which takes a device ID.
 
-#### Example
+### Example
 
 ```c++
 rmm::mr::cuda_memory_resource cuda_mr;
@@ -339,7 +308,7 @@ rmm::mr::set_current_device_resource(&pool_mr); // Updates the current device re
 rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource(); // Points to `pool_mr`
 ```
 
-#### Multiple Devices
+### Multiple Devices
 
 A `device_memory_resource` should only be used when the active CUDA device is the same device
 that was active when the `device_memory_resource` was created. Otherwise behavior is undefined.
@@ -367,36 +336,116 @@ for(int i = 0; i < N; ++i) {
 
 Note that the CUDA device that is current when creating a `device_memory_resource` must also be
 current any time that `device_memory_resource` is used to deallocate memory, including in a
-destructor. This affects RAII classes like `rmm::device_buffer` and `rmm::device_uvector`. Here's an
-(incorrect) example that assumes the above example loop has been run to create a
-`pool_memory_resource` for each device. A correct example adds a call to `cudaSetDevice(0)` on the
-line of the error comment.
+destructor. The RAII class `rmm::device_buffer` and classes that use it as a backing store
+(`rmm::device_scalar` and `rmm::device_uvector`) handle this by storing the active device when the
+constructor is called, and then ensuring that the stored device is active whenever an allocation or
+deallocation is performed (including in the destructor). The user must therefore only ensure that
+the device active during _creation_ of an `rmm::device_buffer` matches the active device of the
+memory resource being used.
+
+Here is an incorrect example that creates a memory resource on device zero and then uses it to
+allocate a `device_buffer` on device one:
 
 ```c++
 {
   RMM_CUDA_TRY(cudaSetDevice(0));
-  rmm::device_buffer buf_a(16);
-
+  auto mr = rmm::mr::cuda_memory_resource{};
   {
     RMM_CUDA_TRY(cudaSetDevice(1));
-    rmm::device_buffer buf_b(16);
+    // Invalid, current device is 1, but MR is only valid for device 0
+    rmm::device_buffer buf(16, rmm::cuda_stream_default, &mr);
   }
+}
+```
 
-  // Error: when buf_a is destroyed, the current device must be 0, but it is 1
+A correct example creates the device buffer with device zero active. After that it is safe to switch
+devices and let the buffer go out of scope and destruct with a different device active. For example,
+this code is correct:
+
+```c++
+{
+  RMM_CUDA_TRY(cudaSetDevice(0));
+  auto mr = rmm::mr::cuda_memory_resource{};
+  rmm::device_buffer buf(16, rmm::cuda_stream_default, &mr);
+  RMM_CUDA_TRY(cudaSetDevice(1));
+  ...
+  // No need to switch back to device 0 before ~buf runs
 }
 ```
 
-### Allocators
+#### Use of `rmm::device_vector` with multiple devices
+
+`rmm:device_vector` uses an `rmm::mr::thrust_allocator` to enable `thrust::device_vector` to
+allocate and deallocate memory using RMM. As such, the usual rules for usage of the backing memory
+resource apply: the active device must match the active device at resource construction time. To
+facilitate use in an RAII setting, `rmm::mr::thrust_allocator` records the active device at
+construction time and ensures that device is active whenever it allocates or deallocates memory.
+Usage of `rmm::device_vector` with multiple devices is therefore the same as `rmm::device_buffer`.
+One must _create_ `device_vector`s with the correct device active, but it is safe to destroy them
+with a different active device.
+
+For example, recapitulating the previous example using `rmm::device_vector`:
+
+```c++
+{
+  RMM_CUDA_TRY(cudaSetDevice(0));
+  auto mr = rmm::mr::cuda_memory_resource{};
+  rmm::device_vector<int> vec(16, rmm::mr::thrust_allocator<int>(rmm::cuda_stream_default, &mr));
+  RMM_CUDA_TRY(cudaSetDevice(1));
+  ...
+  // No need to switch back to device 0 before ~vec runs
+}
+```
+
+> [!NOTE]
+> Although allocation and deallocation in the `thrust_allocator` run with the correct active device,
+> modification of `rmm::device_vector` might necessitate a kernel launch, and this must run with the
+> correct device active. For example, `.resize()` might both allocate _and_ launch a kernel to
+> initialize new elements: the user must arrange for this kernel launch to occur with the correct
+> device for the memory resource active.
+
+## `cuda_stream_view` and `cuda_stream`
+
+`rmm::cuda_stream_view` is a simple non-owning wrapper around a CUDA `cudaStream_t`. This wrapper's
+purpose is to provide strong type safety for stream types. (`cudaStream_t` is an alias for a pointer,
+which can lead to ambiguity in APIs when it is assigned `0`.)  All RMM stream-ordered APIs take a
+`rmm::cuda_stream_view` argument.
+
+`rmm::cuda_stream` is a simple owning wrapper around a CUDA `cudaStream_t`. This class provides
+RAII semantics (constructor creates the CUDA stream, destructor destroys it). An `rmm::cuda_stream`
+can never represent the CUDA default stream or per-thread default stream; it only ever represents
+a single non-default stream. `rmm::cuda_stream` cannot be copied, but can be moved.
+
+## `cuda_stream_pool`
+
+`rmm::cuda_stream_pool` provides fast access to a pool of CUDA streams. This class can be used to
+create a set of `cuda_stream` objects whose lifetime is equal to the `cuda_stream_pool`. Using the
+stream pool can be faster than creating the streams on the fly. The size of the pool is configurable.
+Depending on this size, multiple calls to `cuda_stream_pool::get_stream()` may return instances of
+`rmm::cuda_stream_view` that represent identical CUDA streams.
+
+## Thread Safety
+
+All current device memory resources are thread safe unless documented otherwise. More specifically,
+calls to memory resource `allocate()` and `deallocate()` methods are safe with respect to calls to
+either of these functions from other threads. They are _not_ thread safe with respect to
+construction and destruction of the memory resource object.
+
+Note that a class `thread_safe_resource_adapter` is provided which can be used to adapt a memory
+resource that is not thread safe to be thread safe (as described above). This adapter is not needed
+with any current RMM device memory resources.
+
+## Allocators
 
 C++ interfaces commonly allow customizable memory allocation through an [`Allocator`](https://en.cppreference.com/w/cpp/named_req/Allocator) object.
 RMM provides several `Allocator` and `Allocator`-like classes.
 
-#### `polymorphic_allocator`
+### `polymorphic_allocator`
 
 A [stream-ordered](#stream-ordered-memory-allocation) allocator similar to [`std::pmr::polymorphic_allocator`](https://en.cppreference.com/w/cpp/memory/polymorphic_allocator).
 Unlike the standard C++ `Allocator` interface, the `allocate` and `deallocate` functions take a `cuda_stream_view` indicating the stream on which the (de)allocation occurs.
 
-#### `stream_allocator_adaptor`
+### `stream_allocator_adaptor`
 
 `stream_allocator_adaptor` can be used to adapt a stream-ordered allocator to present a standard `Allocator` interface to consumers that may not be designed to work with a stream-ordered interface.
 
@@ -415,7 +464,7 @@ auto p = adapted.allocate(100);
 adapted.deallocate(p,100);
 ```
 
-#### `thrust_allocator`
+### `thrust_allocator`
 
 `thrust_allocator` is a device memory allocator that uses the strongly typed `thrust::device_ptr`, making it usable with containers like `thrust::device_vector`.
 
@@ -497,13 +546,13 @@ Similar to `device_memory_resource`, it has two key functions for (de)allocation
 Unlike `device_memory_resource`, the `host_memory_resource` interface and behavior is identical to
 `std::pmr::memory_resource`.
 
-### Available Resources
+## Available Host Resources
 
-#### `new_delete_resource`
+### `new_delete_resource`
 
 Uses the global `operator new` and `operator delete` to allocate host memory.
 
-#### `pinned_memory_resource`
+### `pinned_memory_resource`
 
 Allocates "pinned" host memory using `cuda(Malloc/Free)Host`.
 
@@ -611,7 +660,7 @@ resources are detectable with Compute Sanitizer Memcheck.
 It may be possible in the future to add support for memory bounds checking with other memory
 resources using NVTX APIs.
 
-## Using RMM in Python Code
+# Using RMM in Python
 
 There are two ways to use RMM in Python code:
 
@@ -622,7 +671,7 @@ There are two ways to use RMM in Python code:
 RMM provides a `MemoryResource` abstraction to control _how_ device
 memory is allocated in both the above uses.
 
-### DeviceBuffers
+## DeviceBuffer
 
 A DeviceBuffer represents an **untyped, uninitialized device memory
 allocation**.  DeviceBuffers can be created by providing the
@@ -662,7 +711,7 @@ host:
 array([1., 2., 3.])
 ```
 
-### MemoryResource objects
+## MemoryResource objects
 
 `MemoryResource` objects are used to configure how device memory allocations are made by
 RMM.