Implement the new tuning API for DeviceTransform #6914

bernhardmgruber · 2025-12-08T17:57:22Z

Fixes: #6919

CUB transform tests pass
CCCL.C transform tests pass
Rebase on top of Implement cuda::__all_arch_ids and cuda::__is_specific_arch #6916
Update benchmarks
No SASS difference for cub.bench.transform.babelstream.base
No SASS difference for cub.test.device.transform.lid_0
Compile-time comparison between before and after this PR (including clang in CUDA, because we don't have __CUDA_ARCH_LIST__.)

Compile time of cub.test.device.transform.lid_0 using nvcc 13.1 and clang 20 for sm86, sm120

branch:
2m8.741s
2m7.726s
2m7.949s

main:
2m7.661s
2m6.072s
2m9.804s

Using clang 20 in CUDA mode:

branch:
real 2m33.447s
real 2m35.653s
real 2m34.587s
(with further tricks down to to 1m50)

main:
real 1m39.273s
real 1m39.669s
real 1m39.835s

copy-pr-bot · 2025-12-08T17:57:26Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

bernhardmgruber · 2025-12-09T07:15:55Z

c/parallel/src/transform.cu

+auto make_iterator_info(cccl_iterator_t input_it) -> cdt::iterator_info
+{
+  return {static_cast<int>(input_it.value_type.size),
+          static_cast<int>(input_it.value_type.alignment),
+          /* trivially_relocatable */ true, // TODO(bgruber): how to check this properly?
+          input_it.type == CCCL_POINTER}; // TODO(bgruber): how to check this properly?


I would appreciate some cccl.c maintainer input here. How I do know whether the iterator's value type is trivially relocatable and the iterator is contiguous?

bernhardmgruber · 2025-12-09T07:20:29Z

c/parallel/src/transform.cu

+    std::unique_ptr<arch_policies<1>> rtp(static_cast<arch_policies<1>*>(build_ptr->runtime_policy)); // FIXME(bgruber):
+                                                                                                      // handle <2> as
+                                                                                                      // well


Is there any way in this function to distinguish whether we build the unary or binary transform?

cub/cub/device/dispatch/kernels/kernel_transform.cuh

miscco · 2025-12-10T08:34:47Z

cub/cub/device/dispatch/kernels/kernel_transform.cuh

 _CCCL_API constexpr int get_block_threads_helper()
 {
-  if constexpr (ActivePolicy::algorithm == Algorithm::prefetch)
+  constexpr transform_arch_policy policy = ArchPolicies{}(::cuda::arch_id{CUB_PTX_ARCH / 10});


I hate the arcane / 10 here with a passion

I would love to call ::cuda::current_arch_id() but it's not constexpr on NVHPC by design.

miscco · 2025-12-10T08:35:31Z

cub/cub/device/dispatch/kernels/kernel_transform.cuh

+#if _CCCL_HAS_CONCEPTS()
+  requires transform_policy_hub<ArchPolicies>
+#endif // _CCCL_HAS_CONCEPTS()


Nitpick: I believe we should either use the concept emulation or plain SFINAE in C++17 too

Hmm. We could also static_assert, but ArchPolicies is already used in the kernel attributes before we reach the body. And using a static_assert would only be evaluated in the device path.

How would I write that using concept emulation and have the concept check before the __launch_bounds__?

miscco · 2025-12-10T08:36:08Z

cub/cub/device/dispatch/kernels/kernel_transform.cuh

  _CCCL_ASSERT(blockDim.y == 1 && blockDim.z == 1, "transform_kernel only supports 1D blocks");

-  if constexpr (MaxPolicy::ActivePolicy::algorithm == Algorithm::prefetch)
+  static constexpr const transform_arch_policy policy = ArchPolicies{}(::cuda::arch_id{CUB_PTX_ARCH / 10});


Suggested change

static constexpr const transform_arch_policy policy = ArchPolicies{}(::cuda::arch_id{CUB_PTX_ARCH / 10});

static constexpr transform_arch_policy policy = ArchPolicies{}(::cuda::arch_id{CUB_PTX_ARCH / 10});

I had this before when I was still passing a static constexpr transform_arch_policy* as the template argument. The const was needed in addition to constexpr, for a reason that is beyond me.

miscco · 2025-12-10T08:38:37Z

cub/cub/device/dispatch/tuning/tuning_transform.cuh

+  _CCCL_API constexpr friend bool operator!=(const prefetch_policy& lhs, const prefetch_policy& rhs)
+  {
+    return !(lhs == rhs);
+  }


Nitpick: could be

Suggested change

_CCCL_API constexpr friend bool operator!=(const prefetch_policy& lhs, const prefetch_policy& rhs)

{

return !(lhs == rhs);

}

#if _CCCL_STD_VER <= 2017

_CCCL_API constexpr friend bool operator!=(const prefetch_policy& lhs, const prefetch_policy& rhs)

{

return !(lhs == rhs);

}

#endif // _CCCL_STD_VER <= 2017

Applies throughout

Nah, I feel this is a bit too much. We should just upgrade to C++20 and replace all of this by a defaulted spaceship.

cub/cub/device/dispatch/tuning/tuning_transform.cuh

miscco · 2025-12-10T08:45:51Z

cub/cub/device/dispatch/tuning/tuning_transform.cuh

+    bool all_inputs_contiguous                  = true;
+    bool all_input_values_trivially_reloc       = true;
+    bool can_memcpy_contiguous_inputs           = true;
+    bool all_value_types_have_power_of_two_size = ::cuda::is_power_of_two(output.value_type_size);
+    for (const auto& input : inputs)
+    {
+      all_inputs_contiguous &= input.is_contiguous;
+      all_input_values_trivially_reloc &= input.value_type_is_trivially_relocatable;
+      // the vectorized kernel supports mixing contiguous and non-contiguous iterators
+      can_memcpy_contiguous_inputs &= !input.is_contiguous || input.value_type_is_trivially_relocatable;
+      all_value_types_have_power_of_two_size &= ::cuda::is_power_of_two(input.value_type_size);
+    }


Nitpick: While it is technically more efficient, I believe it would improve readability if we did

const bool all_inputs_contiguous = ::cuda::std::all_of(input.begin(), input.end(), [](const auto& input) { return input.is_contiguous; })

cub/cub/device/dispatch/tuning/tuning_transform.cuh

bernhardmgruber · 2025-12-11T11:43:11Z

I see tiny changes in the generated SASS for cub.bench.transform.babelstream.base, notable in the filling kernels (no inputs) for complex<float>. The compiler now generates STG.E.ENL2.256, which it didn't do before.

The fill lernel for int128 seems to have degraded from generating STG.E.128 to a lot more STG.E.

All kernels with a functor marked as __callable_permitting_copied_arguments show no changes. That's good.

It feels a bit like the items per thread changed for the fill kernels.

bernhardmgruber · 2025-12-11T13:00:43Z

It feels a bit like the items per thread changed for the fill kernels.

They did. Before we had a tuning policy for sm_120, that was not taken into account :D This PR now uses it.

bernhardmgruber · 2025-12-11T13:19:45Z

I disabled the sm120 fill policy and now the only SASS diff for filling is on:

void cub::_V_300300_SM_1200::detail::transform::transform_kernel<cub::_V_300300_SM_1200::detail::transform::policy_hub<false, true, cuda::std::__4::tuple<cuda::__4::counting_iterator<long, 0, 0>>, unsigned long*>::policy1000, long, cub::_V_300300_SM_1200::detail::transform::always_true_predicate, cuda::__4::__callable_permitting_copied_arguments<(anonymous namespace)::lognormal_adjust_t<unsigned long>>, unsigned long*, cuda::__4::counting_iterator<long, 0, 0>>(long, int, bool, cub::_V_300300_SM_1200::detail::transform::always_true_predicate, cuda::__4::__callable_permitting_copied_arguments<(anonymous namespace)::lognormal_adjust_t<unsigned long>>, unsigned long*, cub::_V_300300_SM_1200::detail::transform::kernel_arg<cuda::__4::counting_iterator<long, 0, 0>>)

which is a thrust::tabulate of a counting_iterator<long> and an unsigned long*.

bernhardmgruber · 2025-12-11T16:45:54Z

Found the final issue with the fill kernels. Disabled the vectorized tunings when we have input streams (they were tuned for output only use cases). SASS of cub.bench.transform.fill.base now matches baseline on sm120.

SASS for cub.bench.transform.fill.base is not identical to baseline

github-actions · 2025-12-12T03:57:01Z

😬 CI Workflow Results

🟥 Finished in 6h 01m: Pass: 92%/27 | Total: 2d 13h | Max: 6h 00m | Hits: 79%/67690

See results here.

github-project-automation bot added this to CCCL Dec 8, 2025

github-project-automation bot moved this to Todo in CCCL Dec 8, 2025

cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Dec 8, 2025

bernhardmgruber force-pushed the tuning_transform branch from 4244463 to 43feb21 Compare December 8, 2025 22:44

bernhardmgruber commented Dec 9, 2025

View reviewed changes

bernhardmgruber marked this pull request as ready for review December 9, 2025 07:44

bernhardmgruber requested review from a team as code owners December 9, 2025 07:44

bernhardmgruber requested review from fbusato and gevtushenko December 9, 2025 07:44

cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Dec 9, 2025

bernhardmgruber force-pushed the tuning_transform branch from fca1221 to 2aade5f Compare December 9, 2025 08:03

This comment has been minimized.

Sign in to view

bernhardmgruber force-pushed the tuning_transform branch from 2aade5f to 57cc332 Compare December 10, 2025 08:21

bernhardmgruber requested a review from a team as a code owner December 10, 2025 08:48

miscco reviewed Dec 10, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

bernhardmgruber force-pushed the tuning_transform branch from cb0fac5 to 1d14a3e Compare December 10, 2025 17:49

This comment has been minimized.

Sign in to view

bernhardmgruber force-pushed the tuning_transform branch from 1d14a3e to a661d8f Compare December 11, 2025 11:09

gonidelis self-requested a review December 11, 2025 16:44

Implement the new tuning API for DeviceTransform

f1f7cfa

bernhardmgruber added 22 commits December 11, 2025 18:24

missing operators and other CCCL.C fixes

5f9499a

fixes for CCCL.C

8dcfdea

fixes for CCCL.C

5ece160

not needed

c7b3553

drop comment

118ce1f

Return policy by ref

a25155b

Cleanup

ea66665

Implement tuning query

3eb8714

Try to make babelstream tunable

fd107c2

Try to make other benchmarks tunable

8bac0f9

TUNE_BIF_BIAS

bdb31d3

Refactor

f1543db

fixes

27020b2

apply reviewer feedback

1375b61

Fix clang CUDA

7594e24

nvcc compiler crash workaround

f454f6a

gcc warning fix

28c469b

nvcc crash fixes

5c91f23

MSVC and a bit of renaming

d8d689c

Disable sm120 fill policy

2ad96ef

Disable vector fill policies when we have inputs

d7809f2

SASS for cub.bench.transform.fill.base is not identical to baseline

MSVC < 14.44 workaround

c8b2ef6

bernhardmgruber force-pushed the tuning_transform branch from 1139c44 to c8b2ef6 Compare December 11, 2025 17:24

bernhardmgruber added 3 commits December 11, 2025 19:59

MSVC: Replace lambda by struct

631e275

REMOVE ME: only test MSVC for now

3880d5b

Try to further simplify

8b01df7

bernhardmgruber requested a review from a team as a code owner December 11, 2025 21:48

bernhardmgruber requested review from jrhemstad and removed request for jrhemstad December 11, 2025 21:48

	static constexpr const transform_arch_policy policy = ArchPolicies{}(::cuda::arch_id{CUB_PTX_ARCH / 10});
	static constexpr transform_arch_policy policy = ArchPolicies{}(::cuda::arch_id{CUB_PTX_ARCH / 10});

Implement the new tuning API for DeviceTransform #6914

Are you sure you want to change the base?

Implement the new tuning API for DeviceTransform #6914

Conversation

bernhardmgruber commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Dec 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment has been minimized.

This comment has been minimized.

bernhardmgruber commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bernhardmgruber commented Dec 11, 2025

Uh oh!

bernhardmgruber commented Dec 11, 2025

Uh oh!

bernhardmgruber commented Dec 11, 2025

Uh oh!

github-actions bot commented Dec 12, 2025

😬 CI Workflow Results

🟥 Finished in 6h 01m: Pass: 92%/27 | Total: 2d 13h | Max: 6h 00m | Hits: 79%/67690

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bernhardmgruber commented Dec 8, 2025 •

edited

Loading

bernhardmgruber commented Dec 11, 2025 •

edited

Loading