Skip to content

Conversation

@Vishwanatha-HD
Copy link

@Vishwanatha-HD Vishwanatha-HD commented Nov 21, 2025

…t DB support on s390x

Rationale for this change

This PR is intended to enable Parquet DB support on Big-endian (s390x) systems. The fix in this PR fixes the encoder & decoder logic. Encoders and Decoders are the main part of most of the parquet & arrow-parquet testcases and needs fix for variaous encoding & decoding types.

What changes are included in this PR?

The fix includes changes to following files:
cpp/src/parquet/decoder.cc
cpp/src/parquet/encoder.cc

Are these changes tested?

Yes. The changes are tested on s390x arch to make sure things are working fine. The fix is also tested on x86 arch, to make sure there is no new regression introduced.

Are there any user-facing changes?

No

GitHub main Issue link: #48151

@github-actions
Copy link

⚠️ GitHub issue #48202 has been automatically assigned in GitHub to PR creator.

@k8ika0s
Copy link

k8ika0s commented Nov 23, 2025

@Vishwanatha-HD Hey! Really appreciate you taking on the encoder/decoder paths for s390x — these two files are where a lot of the subtle BE issues first show up.

One thing I ran into on real s390x hardware is that Arrow’s array buffers already store their scalars in canonical little-endian format. Because of that, per-value swapping inside the Plain/Arrow fast paths can sometimes lead to an unintended double-swap, especially when mixing Arrow-originated inputs with non-Arrow callers (e.g., DeltaBitPack or ByteStreamSplit feeding into the same decode path).

A couple of spots I’m curious about here:

• PlainDecoder → Arrow fast path
The #if ARROW_LITTLE_ENDIAN branches check out for host-native buffers, but Arrow itself always hands you LE data. Does this path avoid re-swapping values that are already canonical LE from Arrow? I’ve seen that cause subtle mismatches on BE when DeltaBitPack or BSS push Arrow-arrays directly into the decoder.

• PlainEncoder (primitive path)
On BE, the per-value ToLittleEndian write works for correctness, though I’ve found cases where staging to a single LE scratch buffer helps avoid partial/mixed-endian outputs when builders and sinks run back-to-back.

• ByteStreamSplit
Here the code assumes DoSplitStreams handles endianness, but BSS usually expects inputs to already be in canonical LE order before the streams are interleaved. With native-order buffers on BE, the shuffle sometimes produces different stats/dictionary bytes across architectures. Curious if you’ve tested mixed Arrow/non-Arrow inputs through this path?

None of this blocks the PR — just sharing things I hit in BE testing across the encode/decode → stats → page-index chain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants