Skip to content

Conversation

@bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Dec 8, 2025

Fixes: #6919

  • CUB transform tests pass
  • CCCL.C transform tests pass
  • Rebase on top of Implement cuda::__all_arch_ids and cuda::__is_specific_arch #6916
  • Update benchmarks
  • No SASS difference for cub.bench.transform.babelstream.base
  • No SASS difference for cub.test.device.transform.lid_0
  • Compile-time comparison between before and after this PR (including clang in CUDA, because we don't have __CUDA_ARCH_LIST__.)

Compile time of cub.test.device.transform.lid_0 using nvcc 13.1 and clang 20 for sm86, sm120

branch:
2m8.741s
2m7.726s
2m7.949s

main:
2m7.661s
2m6.072s
2m9.804s

Using clang 20 in CUDA mode:

branch:
real 2m33.447s
real 2m35.653s
real 2m34.587s
(with further tricks down to to 1m50)

main:
real 1m39.273s
real 1m39.669s
real 1m39.835s

@copy-pr-bot
Copy link
Contributor

copy-pr-bot bot commented Dec 8, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cccl-authenticator-app cccl-authenticator-app bot moved this from Todo to In Progress in CCCL Dec 8, 2025
Comment on lines 200 to 205
auto make_iterator_info(cccl_iterator_t input_it) -> cdt::iterator_info
{
return {static_cast<int>(input_it.value_type.size),
static_cast<int>(input_it.value_type.alignment),
/* trivially_relocatable */ true, // TODO(bgruber): how to check this properly?
input_it.type == CCCL_POINTER}; // TODO(bgruber): how to check this properly?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would appreciate some cccl.c maintainer input here. How I do know whether the iterator's value type is trivially relocatable and the iterator is contiguous?

Comment on lines +635 to +637
std::unique_ptr<arch_policies<1>> rtp(static_cast<arch_policies<1>*>(build_ptr->runtime_policy)); // FIXME(bgruber):
// handle <2> as
// well
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way in this function to distinguish whether we build the unary or binary transform?

@bernhardmgruber bernhardmgruber marked this pull request as ready for review December 9, 2025 07:44
@bernhardmgruber bernhardmgruber requested review from a team as code owners December 9, 2025 07:44
@cccl-authenticator-app cccl-authenticator-app bot moved this from In Progress to In Review in CCCL Dec 9, 2025
@github-actions

This comment has been minimized.

_CCCL_API constexpr int get_block_threads_helper()
{
if constexpr (ActivePolicy::algorithm == Algorithm::prefetch)
constexpr transform_arch_policy policy = ArchPolicies{}(::cuda::arch_id{CUB_PTX_ARCH / 10});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hate the arcane / 10 here with a passion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to call ::cuda::current_arch_id() but it's not constexpr on NVHPC by design.

Comment on lines +986 to +1010
#if _CCCL_HAS_CONCEPTS()
requires transform_policy_hub<ArchPolicies>
#endif // _CCCL_HAS_CONCEPTS()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: I believe we should either use the concept emulation or plain SFINAE in C++17 too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. We could also static_assert, but ArchPolicies is already used in the kernel attributes before we reach the body. And using a static_assert would only be evaluated in the device path.

How would I write that using concept emulation and have the concept check before the __launch_bounds__?

_CCCL_ASSERT(blockDim.y == 1 && blockDim.z == 1, "transform_kernel only supports 1D blocks");

if constexpr (MaxPolicy::ActivePolicy::algorithm == Algorithm::prefetch)
static constexpr const transform_arch_policy policy = ArchPolicies{}(::cuda::arch_id{CUB_PTX_ARCH / 10});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
static constexpr const transform_arch_policy policy = ArchPolicies{}(::cuda::arch_id{CUB_PTX_ARCH / 10});
static constexpr transform_arch_policy policy = ArchPolicies{}(::cuda::arch_id{CUB_PTX_ARCH / 10});

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had this before when I was still passing a static constexpr transform_arch_policy* as the template argument. The const was needed in addition to constexpr, for a reason that is beyond me.

Comment on lines 103 to 106
_CCCL_API constexpr friend bool operator!=(const prefetch_policy& lhs, const prefetch_policy& rhs)
{
return !(lhs == rhs);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: could be

Suggested change
_CCCL_API constexpr friend bool operator!=(const prefetch_policy& lhs, const prefetch_policy& rhs)
{
return !(lhs == rhs);
}
#if _CCCL_STD_VER <= 2017
_CCCL_API constexpr friend bool operator!=(const prefetch_policy& lhs, const prefetch_policy& rhs)
{
return !(lhs == rhs);
}
#endif // _CCCL_STD_VER <= 2017

Applies throughout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, I feel this is a bit too much. We should just upgrade to C++20 and replace all of this by a defaulted spaceship.

Comment on lines 358 to 376
bool all_inputs_contiguous = true;
bool all_input_values_trivially_reloc = true;
bool can_memcpy_contiguous_inputs = true;
bool all_value_types_have_power_of_two_size = ::cuda::is_power_of_two(output.value_type_size);
for (const auto& input : inputs)
{
all_inputs_contiguous &= input.is_contiguous;
all_input_values_trivially_reloc &= input.value_type_is_trivially_relocatable;
// the vectorized kernel supports mixing contiguous and non-contiguous iterators
can_memcpy_contiguous_inputs &= !input.is_contiguous || input.value_type_is_trivially_relocatable;
all_value_types_have_power_of_two_size &= ::cuda::is_power_of_two(input.value_type_size);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: While it is technically more efficient, I believe it would improve readability if we did

    const bool all_inputs_contiguous = ::cuda::std::all_of(input.begin(), input.end(), [](const auto& input) { return input.is_contiguous; })

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@bernhardmgruber
Copy link
Contributor Author

bernhardmgruber commented Dec 11, 2025

I see tiny changes in the generated SASS for cub.bench.transform.babelstream.base, notable in the filling kernels (no inputs) for complex<float>. The compiler now generates STG.E.ENL2.256, which it didn't do before.

The fill lernel for int128 seems to have degraded from generating STG.E.128 to a lot more STG.E.

All kernels with a functor marked as __callable_permitting_copied_arguments show no changes. That's good.

It feels a bit like the items per thread changed for the fill kernels.

@bernhardmgruber
Copy link
Contributor Author

It feels a bit like the items per thread changed for the fill kernels.

They did. Before we had a tuning policy for sm_120, that was not taken into account :D This PR now uses it.

@bernhardmgruber
Copy link
Contributor Author

I disabled the sm120 fill policy and now the only SASS diff for filling is on:

void cub::_V_300300_SM_1200::detail::transform::transform_kernel<cub::_V_300300_SM_1200::detail::transform::policy_hub<false, true, cuda::std::__4::tuple<cuda::__4::counting_iterator<long, 0, 0>>, unsigned long*>::policy1000, long, cub::_V_300300_SM_1200::detail::transform::always_true_predicate, cuda::__4::__callable_permitting_copied_arguments<(anonymous namespace)::lognormal_adjust_t<unsigned long>>, unsigned long*, cuda::__4::counting_iterator<long, 0, 0>>(long, int, bool, cub::_V_300300_SM_1200::detail::transform::always_true_predicate, cuda::__4::__callable_permitting_copied_arguments<(anonymous namespace)::lognormal_adjust_t<unsigned long>>, unsigned long*, cub::_V_300300_SM_1200::detail::transform::kernel_arg<cuda::__4::counting_iterator<long, 0, 0>>)

which is a thrust::tabulate of a counting_iterator<long> and an unsigned long*.

@gonidelis gonidelis self-requested a review December 11, 2025 16:44
@bernhardmgruber
Copy link
Contributor Author

Found the final issue with the fill kernels. Disabled the vectorized tunings when we have input streams (they were tuned for output only use cases). SASS of cub.bench.transform.fill.base now matches baseline on sm120.

@bernhardmgruber bernhardmgruber requested a review from a team as a code owner December 11, 2025 21:48
@bernhardmgruber bernhardmgruber requested review from jrhemstad and removed request for jrhemstad December 11, 2025 21:48
@github-actions
Copy link
Contributor

😬 CI Workflow Results

🟥 Finished in 6h 01m: Pass: 92%/27 | Total: 2d 13h | Max: 6h 00m | Hits: 79%/67690

See results here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Implement the new tuning API for DeviceTransform

2 participants