Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

<bit>: Improve has_single_bit() codegen #5367

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

pps83
Copy link
Contributor

@pps83 pps83 commented Mar 27, 2025

code avoids branches, provides 20% speed up on x64 targets.

x64 results

---------------------------------------
Benchmark                          Time
---------------------------------------
Rand16_has_single_bit            284 ns
Rand16_has_single_bit_new        232 ns
Rand32_has_single_bit            306 ns
Rand32_has_single_bit_new        246 ns
Rand64_has_single_bit            289 ns
Rand64_has_single_bit_new        233 ns

partially addresses #5359

@pps83 pps83 requested a review from a team as a code owner March 27, 2025 00:47
@pps83
Copy link
Contributor Author

pps83 commented Mar 27, 2025

seems like code is better for x86/x64/arm/arm64 : https://godbolt.org/z/fjshqMf4d

the only potential drawback: old code would check for _Val = 0 and jump. New code has no jumps (except on arm)

@StephanTLavavej StephanTLavavej changed the title <bit>: Improve std::has_single_bit <bit>: Improve has_single_bit() codegen Mar 27, 2025
@StephanTLavavej StephanTLavavej added the performance Must go faster label Mar 27, 2025
@AlexGuteniev

This comment was marked as resolved.

@pps83

This comment was marked as resolved.

@pps83 pps83 force-pushed the bithas_single_bit-faster branch 6 times, most recently from 0c35bcb to cae8296 Compare March 27, 2025 15:09
@pps83 pps83 force-pushed the bithas_single_bit-faster branch from cae8296 to b96db7a Compare March 27, 2025 15:18
Copy link
Contributor

@AlexGuteniev AlexGuteniev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, some nitpicks...

@pps83 pps83 force-pushed the bithas_single_bit-faster branch from 4c7e0e7 to 16d3c6a Compare March 27, 2025 15:35
@pps83 pps83 force-pushed the bithas_single_bit-faster branch from 16d3c6a to f546b3b Compare March 27, 2025 15:47
@pps83 pps83 force-pushed the bithas_single_bit-faster branch from 86f7638 to 14656fc Compare March 27, 2025 15:54
@AlexGuteniev

This comment was marked as outdated.

@pps83
Copy link
Contributor Author

pps83 commented Mar 27, 2025

Can you please try this local change also:

_NODISCARD constexpr bool has_single_bit(const _Ty _Val) noexcept {
#if 1 // _POPCNT_INTRINSICS_ALWAYS_AVAILABLE
#if _HAS_CXX20
    if (_STD is_constant_evaluated())
        return 1 == _Popcount_fallback(_Val);
#endif // _HAS_CXX20
    return 1 == _Unchecked_popcount(_Val);
#else
    return (_Val ^ (_Val - 1)) > _Val - 1;
#endif
}

@AlexGuteniev
Copy link
Contributor

AlexGuteniev commented Mar 27, 2025

Turns out this particular benchmark is very much affected by #4496 and popcnt version is slower.

That's why I've hidden the previous results, and obtained the updated results by changing also this line to say Release instead of RelWithDebInfo

set(CMAKE_BUILD_TYPE RelWithDebInfo)

i5-1235U

x64

P-core

Benchmark Before After popcnt
has_single_bit_if<uint16_t> 11.5 ns 10.7 ns 6.68 ns
has_single_bit_if<uint32_t> 11.3 ns 11.3 ns 10.6 ns
has_single_bit_if<uint64_t> 6.98 ns 11.7 ns 6.57 ns
has_single_bit<uint16_t> 6.36 ns 4.51 ns 3.51 ns
has_single_bit<uint32_t> 6.08 ns 4.71 ns 3.52 ns
has_single_bit<uint64_t> 5.27 ns 4.71 ns 3.53 ns

E-core

Benchmark Before After popcnt
has_single_bit_if<uint16_t> 16.2 ns 14.1 ns 14.1 ns
has_single_bit_if<uint32_t> 16.7 ns 14.0 ns 14.0 ns
has_single_bit_if<uint64_t> 16.2 ns 16.2 ns 14.0 ns
has_single_bit<uint16_t> 12.2 ns 9.07 ns 7.24 ns
has_single_bit<uint32_t> 11.9 ns 9.07 ns 7.24 ns
has_single_bit<uint64_t> 11.6 ns 9.07 ns 7.25 ns

Apparently popcnt is a good idea. Even for the "if" case.


#4496 wasn't a severe obstacle for most previous optimization, but at such micro level it is really disrupting, we'll need to get into that.

@AlexGuteniev
Copy link
Contributor

Can you please try this local change also:

Before you go with this in the PR:

  • We don't compare like 1 == function() in new code, although there are some relic precedents. In new code, we use natural order function() == 1
  • Always control flow braces {} even around single statement in if
  • You don't need to check #if _HAS_CXX20. It is C++20 header, and the check is already there at the very beginning for the whole content

@pps83
Copy link
Contributor Author

pps83 commented Mar 27, 2025

Before you go with this in the PR...

This PR doesn't have popcnt change. The popcnt change I suggest above doesn't use the dispatcher, it's the unchecked version that does the popcnt opcode directly and it should only be used when building for AVX+. The default (version with this PR) is better than using dispatcher imo. _Checked_popcount will use the dispatcher.

I can add popcnt it as a separate PR, i think it's better that way.

@StephanTLavavej StephanTLavavej self-assigned this Mar 27, 2025
@StephanTLavavej
Copy link
Member

For the record, ARM32 performance is indeed utterly irrelevant. Windows ripped out their ARM32 build, and will even be dropping support for ARM32 binaries running on ARM64 machines, see https://learn.microsoft.com/en-us/windows/arm/arm32-to-arm64 .

ARM32 still needs to behave correctly (until DevDiv/MSVC drops support for targeting it, which I am constantly asking about), but we don't want to waste any more time on it.

@StephanTLavavej StephanTLavavej removed their assignment Apr 1, 2025
@StephanTLavavej
Copy link
Member

Thanks! I pushed some changes to the benchmark for consistency with repo conventions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Status: Ready To Merge
Development

Successfully merging this pull request may close these issues.

3 participants