Skip to content

Search for first duplicate in vectorized unique #5363

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 10, 2025

Conversation

AlexGuteniev
Copy link
Contributor

@AlexGuteniev AlexGuteniev commented Mar 25, 2025

🔍 Missing piece

Vectorized remove has carefully considered find before actual removal loop. But the similar adjacent_find was missed in unique.

The effect is at least performance. Find is faster as vectorization, and it also avoids extra writes. I doubt about whether they have correctness impact though.

I've added the missing piece in separately compiled code. This is not the same as in remove, where it is in the header. But unique has in-place search for duplicates in scalar part that is tightly coupled with the rest of it, so I thought it is better to avoid disrupting that.

🏁 Benchmark

The modified benchmark adds two regions without duplicates: at the beginning and at the end of the range. The beginning one is avoidable with adjacent_find.

I've ran the modified benchmark for the main before #5092, to have the updated non-vectorized baseline to compare against

⏱️ Benchmark results

Benchmark Before vectorization Before PR After PR
u<alg_type::std_fn, std::uint8_t> 1181 ns 540 ns 217 ns
u<alg_type::std_fn, std::uint16_t> 1075 ns 649 ns 208 ns
u<alg_type::std_fn, std::uint32_t> 1204 ns 775 ns 285 ns
u<alg_type::std_fn, std::uint64_t> 1110 ns 1482 ns 551 ns
u<alg_type::rng, std::uint8_t> 1233 ns 540 ns 221 ns
u<alg_type::rng, std::uint16_t> 1059 ns 653 ns 207 ns
u<alg_type::rng, std::uint32_t> 1298 ns 785 ns 285 ns
u<alg_type::rng, std::uint64_t> 1216 ns 1490 ns 629 ns

💡 Results interpretation

The effect is way more than expected. in particular, the "Before PR" column has shown that vectorization performance has severely degraded after adding already-unique ranges parts to the input.

Apparently it is due to reading just written output. Before the first removed element we read back last element previously written along with newly read elements. The mix of cache data and store buffer data causes the CPU to stall until the store buffer entry reaches the cache. So when we skip to the first non-unique, it is no longer a problem.

@AlexGuteniev AlexGuteniev requested a review from a team as a code owner March 25, 2025 19:35
@github-project-automation github-project-automation bot moved this to Initial Review in STL Code Reviews Mar 25, 2025
@StephanTLavavej StephanTLavavej added the performance Must go faster label Mar 25, 2025
@StephanTLavavej StephanTLavavej self-assigned this Mar 25, 2025
@StephanTLavavej StephanTLavavej removed their assignment Apr 1, 2025
@StephanTLavavej StephanTLavavej moved this from Initial Review to Ready To Merge in STL Code Reviews Apr 1, 2025
@StephanTLavavej StephanTLavavej moved this from Ready To Merge to Merging in STL Code Reviews Apr 9, 2025
@StephanTLavavej StephanTLavavej self-assigned this Apr 9, 2025
@StephanTLavavej
Copy link
Member

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

@StephanTLavavej StephanTLavavej merged commit d3c9bf9 into microsoft:main Apr 10, 2025
39 checks passed
@github-project-automation github-project-automation bot moved this from Merging to Done in STL Code Reviews Apr 10, 2025
@StephanTLavavej
Copy link
Member

🏇 🏎️ 🚀

@AlexGuteniev AlexGuteniev deleted the piano-in-the-bushes branch April 11, 2025 03:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants