GH-48202: [C++][Parquet] Fix encoder & decoder logic to enable Parque… #48203

Vishwanatha-HD · 2025-11-21T13:54:05Z

…t DB support on s390x

Rationale for this change

This PR is intended to enable Parquet DB support on Big-endian (s390x) systems. The fix in this PR fixes the encoder & decoder logic. Encoders and Decoders are the main part of most of the parquet & arrow-parquet testcases and needs fix for variaous encoding & decoding types.

What changes are included in this PR?

The fix includes changes to following files:
cpp/src/parquet/decoder.cc
cpp/src/parquet/encoder.cc

Are these changes tested?

Yes. The changes are tested on s390x arch to make sure things are working fine. The fix is also tested on x86 arch, to make sure there is no new regression introduced.

Are there any user-facing changes?

No

GitHub main Issue link: #48151

GitHub Issue: [C++][Parquet] Fix encoder and decoder logic to enable Parquet DB support on Big-Endian (s390x) systems #48202

github-actions · 2025-11-21T13:54:28Z

⚠️ GitHub issue #48202 has been automatically assigned in GitHub to PR creator.

…Parquet DB support on s390x

k8ika0s · 2025-11-23T21:50:51Z

@Vishwanatha-HD Hey! Really appreciate you taking on the encoder/decoder paths for s390x — these two files are where a lot of the subtle BE issues first show up.

One thing I ran into on real s390x hardware is that Arrow’s array buffers already store their scalars in canonical little-endian format. Because of that, per-value swapping inside the Plain/Arrow fast paths can sometimes lead to an unintended double-swap, especially when mixing Arrow-originated inputs with non-Arrow callers (e.g., DeltaBitPack or ByteStreamSplit feeding into the same decode path).

A couple of spots I’m curious about here:

• PlainDecoder → Arrow fast path
The #if ARROW_LITTLE_ENDIAN branches check out for host-native buffers, but Arrow itself always hands you LE data. Does this path avoid re-swapping values that are already canonical LE from Arrow? I’ve seen that cause subtle mismatches on BE when DeltaBitPack or BSS push Arrow-arrays directly into the decoder.

• PlainEncoder (primitive path)
On BE, the per-value ToLittleEndian write works for correctness, though I’ve found cases where staging to a single LE scratch buffer helps avoid partial/mixed-endian outputs when builders and sinks run back-to-back.

• ByteStreamSplit
Here the code assumes DoSplitStreams handles endianness, but BSS usually expects inputs to already be in canonical LE order before the streams are interleaved. With native-order buffers on BE, the shuffle sometimes produces different stats/dictionary bytes across architectures. Curious if you’ve tested mixed Arrow/non-Arrow inputs through this path?

None of this blocks the PR — just sharing things I hit in BE testing across the encode/decode → stats → page-index chain.

Vishwanatha-HD requested a review from wgtmac as a code owner November 21, 2025 13:54

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Nov 21, 2025

Vishwanatha-HD mentioned this pull request Nov 21, 2025

[C++][Parquet] Fix encoder and decoder logic to enable Parquet DB support on Big-Endian (s390x) systems #48202

Open

k8ika0s mentioned this pull request Nov 21, 2025

GH-48213: [C++][Parquet] Fix endianness and test failures on s390x (big-endian) (supersedes partial fixes) #48212

Open

Vishwanatha-HD mentioned this pull request Nov 21, 2025

[C++][Parquet] Enable Parquet DB support on Big Endian (IBM Z) systems #48151

Open

apacheGH-48202: [C++][Parquet] Fix encoder & decoder logic to enable …

8c99313

…Parquet DB support on s390x

Vishwanatha-HD force-pushed the fixEncoderDecoder branch from 4efffd9 to 8c99313 Compare November 22, 2025 04:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-48202: [C++][Parquet] Fix encoder & decoder logic to enable Parque… #48203

GH-48202: [C++][Parquet] Fix encoder & decoder logic to enable Parque… #48203

Vishwanatha-HD commented Nov 21, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

k8ika0s commented Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GH-48202: [C++][Parquet] Fix encoder & decoder logic to enable Parque… #48203

Are you sure you want to change the base?

GH-48202: [C++][Parquet] Fix encoder & decoder logic to enable Parque… #48203

Conversation

Vishwanatha-HD commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

k8ika0s commented Nov 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Vishwanatha-HD commented Nov 21, 2025 •

edited

Loading