Skip to content

Conversation

@Vishwanatha-HD
Copy link
Contributor

@Vishwanatha-HD Vishwanatha-HD commented Nov 21, 2025

Rationale for this change

This PR is intended to enable Parquet DB support on Big-endian (s390x) systems. The fix in this PR fixes the Bloom Filter logic.

What changes are included in this PR?

The fix includes changes to following file:
cpp/src/parquet/bloom_filter.cc

Are these changes tested?

Yes. The changes are tested on s390x arch to make sure things are working fine. The fix is also tested on x86 arch, to make sure there is no new regression introduced.

Are there any user-facing changes?

No

GitHub main Issue link: #48151

@github-actions
Copy link

⚠️ GitHub issue #48210 has been automatically assigned in GitHub to PR creator.

@kou kou changed the title GH-48210 Fix Bloom Filter logic to enable Parquet DB support on s390x GH-48210: [C++][Parquet] Fix Bloom Filter logic to enable Parquet DB support on s390x Nov 22, 2025
@k8ika0s
Copy link

k8ika0s commented Nov 23, 2025

@Vishwanatha-HD

Bloom filters are one of those parts of Parquet where tiny byte-order details end up mattering way more than you’d expect, so it’s good to see attention landing here.

Something I ran into on s390x is that the xxhash input/output tends to stay a lot more predictable if the bitset words are kept in a single canonical order (LE in our case) and the reader/writer treat them as such. In my own experiments I normalized the bitset once at the boundary and let the rest of the logic operate on native values.

In this patch, the per-word FromLittleEndian/ToLittleEndian inside the find/insert loops definitely keeps things correct, though it does create a slightly tighter coupling between the hashing logic and the byte-swapping. I only mention it because it can sometimes show up in profiling when bloom filters are exercised heavily over wide row groups.

Not calling this out as a problem — the behavior you’re targeting here lines up with what I’ve seen on s390x, especially around making sure the mask checks behave the same across BE/LE hosts. Just sharing observations in case it’s useful while these pieces get polished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants