Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set mempool hw_decompress flag if driver supports it #1854

Merged
merged 18 commits into from
Mar 18, 2025

Conversation

wence-
Copy link
Contributor

@wence- wence- commented Mar 7, 2025

Description

If the driver supports the flag, unconditionally set the async memory pool usage property to include a request to support HW decompression.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@wence- wence- requested a review from a team as a code owner March 7, 2025 17:49
@wence- wence- requested review from rongou and bdice March 7, 2025 17:49
@github-actions github-actions bot added the cpp Pertains to C++ code label Mar 7, 2025
@wence- wence- added non-breaking Non-breaking change improvement Improvement / enhancement to an existing function and removed cpp Pertains to C++ code labels Mar 7, 2025
@@ -64,6 +64,13 @@ class cuda_async_memory_resource final : public device_memory_resource {
fabric = 0x8 ///< Allows a fabric handle to be used for exporting. (cudaMemFabricHandle_t)
};

enum class mempool_usage {
none = 0x0, ///< No specific usage.
hw_decompress = 0x2, ///< If set indicates that the memory will be

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would recommend "If set indicates that the memory can be"

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks right to me. Should we add a test that, if on driver >=12.8, the async MR’s pool has the right flags set?

@eschmidt-nvidia
Copy link

Looks right to me. Should we add a test that, if on driver >=12.8, the async MR’s pool has the right flags set?

Even better, you could check that an allocation from the pool is valid for decomp. I can provide an example for how to do this

@wence-
Copy link
Contributor Author

wence- commented Mar 7, 2025

Even better, you could check that an allocation from the pool is valid for decomp. I can provide an example for how to do this

This sounds like a good idea.

@github-actions github-actions bot added the cpp Pertains to C++ code label Mar 7, 2025
@eschmidt-nvidia
Copy link

bool ptrIsHwDecompressCapable;
cuPointerGetAttribute(&ptrIsHwDecompressCapable,
                                                CU_POINTER_ATTRIBUTE_IS_HW_DECOMPRESS_CAPABLE,
                                                ptr);

@wence- wence- requested a review from a team as a code owner March 10, 2025 12:24
@wence- wence- requested a review from vyasr March 10, 2025 12:24
@github-actions github-actions bot added the CMake label Mar 10, 2025
@wence-
Copy link
Contributor Author

wence- commented Mar 10, 2025

Did some cargo-cult cmake to link the driver in the async mr test.

CUDART STATIC)
ConfigureTest(CUDA_ASYNC_MR_SHARED_CUDART_TEST mr/device/cuda_async_mr_tests.cpp GPUS 1 PERCENT 60
CUDART SHARED)
ConfigureTest(CUDA_ASYNC_MR_STATIC_CUDART_TEST mr/device/cuda_async_mr_tests.cpp LINK_DRIVER GPUS 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix the overlinking check, you'll need to update conda/recipes/librmm/recipe.yaml and add cuda-driver-dev to the dependencies in cache.requirements.build:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to need some help from @rapidsai/packaging-codeowners here. AIUI, since one of the test files links directly against libcuda.so.1, I need to advertise a host dependency on cuda-driver-dev for the librmm-tests package.

I also should advertise a run dependency for cuda-driver-dev for the same package.

I believe I have done that correctly, but conda build still complains about overlinking. So I am at the end of my ability to cargo cult.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @wence- -- it looks like cuda-driver-dev has libcuda.so but not the libcuda.so.1 link to whatever driver version I think it links against.

It looks like cuda-compat-impl DOES have the files this might need? I'm unclear on the relationship with the cuda-compat package.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, ok, it looks like cuda-compat will pull in the appropriate cuda-compat-impl subpackage, so maybe add that to host

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean run? I need -lcuda (cuda-driver-dev) at compile-time (host?). But libcuda.so.1 (cuda-compat?) at runtime, I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did mean run

Copy link
Contributor

@gforsyth gforsyth Mar 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm, the finalized run dependencies don't show cuda-compat-impl despite it being a dependency of cuda-compat. That actually is probably sensible because otherwise the run dependencies would be a transitive explosion of stuff. However, given that libcuda.so.1 is actually in cuda-compat-impl, I wonder if that's why the overlinking check is still failing.

We could try adding cuda-compat-impl as the runtime dependency, and then either leave it that way (assuming the checks pass) or revert that change and add an explicit allowlist entry for libcuda.so.1 since we know that cuda-compat will (transitively) provide it.

I'm generally more for explicit dependencies than transitive ones, but meta- / helper-packages are a weird corner case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had one more go with that setup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying this out locally and also not finding libcuda.so.1 -- it looks like it gets installed not into lib/ within the conda environment but into a separate conda-compat/ directory:

󰕈 gforsyth  …/miniforge3/envs/cuda-compat-impl   14:55  
🐚 ls lib
libatomic.so        libgomp.so.1      libquadmath.so.0
libatomic.so.1      libgomp.so.1.0.0  libquadmath.so.0.0.0
libatomic.so.1.2.0  libitm.so         libstdc++.so
libgcc_s.so         libitm.so.1       libstdc++.so.6
libgcc_s.so.1       libitm.so.1.0.0   libstdc++.so.6.0.33
libgomp.so          libquadmath.so    

󰕈 gforsyth  …/miniforge3/envs/cuda-compat-impl   14:56  
🐚 ls cuda-compat/
libcuda.so                     libnvidia-nvvm.so
libcuda.so.1                   libnvidia-nvvm.so.4
libcuda.so.570.124.06          libnvidia-nvvm.so.570.124.06
libcudadebugger.so.1           libnvidia-ptxjitcompiler.so.1
libcudadebugger.so.570.124.06  libnvidia-ptxjitcompiler.so.570.124.06

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I'm out of ideas at this point, so feel free to poke at this

@wence- wence- requested a review from a team as a code owner March 10, 2025 13:22
@github-actions github-actions bot added the conda label Mar 10, 2025
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope these comments help work out the conda build issues.

@@ -114,6 +115,7 @@ outputs:
- ${{ pin_compatible("cuda-version", upper_bound="x", lower_bound="x") }}
- ${{ pin_subpackage("librmm", exact=True) }}
- rapids-logger =0.1
- cuda-compat-impl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should never need compat packages as a RAPIDS dependency. Why was this added?

Suggested change
- cuda-compat-impl

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I have no clue why conda is complaining about overlinking of the driver in the tests package, and @gforsyth suggested it. It doesn't work, but nothing I tried works, so 🤷

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot have load-time linkage to the CUDA driver. We need to replace this with a dlopen so that our packages remain importable on nodes without GPUs (with suitably delayed initialization). If we need driver APIs, we'll have to dlopen libcuda.so.

@wence-
Copy link
Contributor Author

wence- commented Mar 12, 2025

We cannot have load-time linkage to the CUDA driver. We need to replace this with a dlopen so that our packages remain importable on nodes without GPUs (with suitably delayed initialization). If we need driver APIs, we'll have to dlopen libcuda.so.

I only need the driver in the tests.

@wence-
Copy link
Contributor Author

wence- commented Mar 12, 2025

Hope these comments help work out the conda build issues.

Sadly not :( https://github.com/rapidsai/rmm/actions/runs/13811046966/job/38632706568?pr=1854#step:9:5571

@wence-
Copy link
Contributor Author

wence- commented Mar 13, 2025

My inclination here to actually support this at all is to just rip the test out and remove all the conda changes. If anyone has an idea how I can actually solve this overlinking error, I am all ears.

If we do want the tests and can solve the linking problem, I hope it is acceptable to link the tests to libcuda.so.1 (rather than dlopening), but if absolutely necessary I suppose I can do that too.

@vyasr
Copy link
Contributor

vyasr commented Mar 15, 2025

I think we simply have to allowlist the driver in this case. The linkage is intentional, but we also can't get the driver from conda. We have to rely on the test being installed into environments where the driver is already installed on the host (or in the container where the test is being run).

@bdice
Copy link
Contributor

bdice commented Mar 18, 2025

@wence- @vyasr Do we want to land this in 25.04? I saw this failure in the logs. I updated the branch in case there's something unexpected. I can help dig into this tomorrow if nobody else has time (but would happily take a quick review on #1864 in exchange).

>       with pytest.raises(err_catch, match="My alloc error"):
E       AssertionError: Regex pattern did not match.
E        Regex: 'My alloc error'
E        Input: 'std::bad_alloc'

rmm/tests/test_rmm.py:938: AssertionError

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This now looks good to me. CI passed on this round. @vyasr Can you approve, since you previously requested changes?
Please merge once you're happy with it.

@bdice bdice requested a review from vyasr March 18, 2025 22:39
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now, thanks!

@bdice
Copy link
Contributor

bdice commented Mar 18, 2025

/merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake conda cpp Pertains to C++ code improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[FEA] Support Blackwell decompression engine with async memory resource
6 participants