GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x #48205

Vishwanatha-HD · 2025-11-21T14:18:53Z

Rationale for this change

This PR is intended to enable Parquet DB support on Big-endian (s390x) systems. The fix in this PR fixes the column reader & writer logic. Column Reader & Writer are the main part of most of the parquet & arrow-parquet testcases.

What changes are included in this PR?

The fix includes changes to following files:
cpp/src/parquet/column_reader.cc
cpp/src/parquet/column_writer.cc
cpp/src/parquet/column_writer.h

Are these changes tested?

Yes. The changes are tested on s390x arch to make sure things are working fine. The fix is also tested on x86 arch, to make sure there is no new regression introduced.

Are there any user-facing changes?

No

GitHub main Issue link: #48151

GitHub Issue: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on Big-Endian (s390x) systems #48204

github-actions · 2025-11-21T14:19:19Z

⚠️ GitHub issue #48204 has been automatically assigned in GitHub to PR creator.

cpp/src/parquet/column_writer.h

kou · 2025-11-22T13:01:21Z

cpp/src/parquet/column_writer.h

  auto last_day_nanos = last_day_units * NanosecondsPerUnit;
+#if ARROW_LITTLE_ENDIAN
  // impala_timestamp will be unaligned every other entry so do memcpy instead
  // of assign and reinterpret cast to avoid undefined behavior.
  std::memcpy(impala_timestamp, &last_day_nanos, sizeof(int64_t));
+#else
+  (*impala_timestamp).value[0] = static_cast<uint32_t>(last_day_nanos);
+  (*impala_timestamp).value[1] = static_cast<uint32_t>(last_day_nanos >> 32);


Can we use the following instead of #if?

auto last_day_nanos = last_day_units * NanosecondsPerUnit; auto last_day_nanos_little_endian = ::arrow::bit_util::ToLittleEndian(last_day_nanos); std::memcpy(impala_timestamp, &last_day_nanos_little_endian, sizeof(int64_t));

cpp/src/parquet/column_writer.cc

kou · 2025-11-22T13:33:45Z

cpp/src/parquet/column_reader.cc

+#if ARROW_LITTLE_ENDIAN
      if (num_bytes < 0 || num_bytes > data_size - 4) {
+#else
+      if (num_bytes < 0 || num_bytes > data_size) {
+#endif


@pitrou You added - 4 in #6848. Do you think that we need - 4 with big endian too?

Thanks for the ping @kou . I've re-read through this code and I now think the original change was a mistake. I'll submit a separate issue/PR to fix it.

@Vishwanatha-HD Can you rebase/merge from git main and remove this change?

k8ika0s · 2025-11-23T22:24:34Z

@Vishwanatha-HD

Working through this one, I’m reminded how many odd little corners show up when Arrow’s layout meets Parquet’s expectations — especially around levels, decimals, and the legacy INT96 bits.

Looking at the pieces that overlap with the work I’ve been doing, the overall direction makes sense. A few notes from what I’ve seen on real s390x hardware:

• BIT_PACKED level headers
Your patch keeps the data_size - 4 bound under ARROW_LITTLE_ENDIAN, whereas my tree leans on accepting the full BIT_PACKED buffer and logging failures rather than early-bounding. Neither approach is wrong, but on BE machines I’ve found that the “minus 4” guard sometimes rejects buffers that are actually fine, depending on how many values the upstream encoder produced.

• Decimal serialization
This is one of the trickier spots. Parquet expects decimals in a big-endian 128-bit payload, but Arrow materializes them in little-endian limbs even on BE hardware. In my implementation I reverse the Arrow words ([low, high] → [high, low]) before handing them to the writer so the final byte stream matches the canonical Parquet format.
Your patch uses ToBigEndian on each limb directly in host order, which works for many cases but can produce a differently ordered representation when Arrow’s in-memory layout doesn’t match the 128-bit big-endian wire format. Just sharing that in case you’ve seen similar behavior when mixing different decimal widths.

• Half-floats in FLBA
The BE path you added with ToLittleEndian(values[i]) aligns with the intent. I ended up staging the FLBA structs and the 2-byte payloads together in one scratch buffer, mostly because some downstream consumers treat the pointer lifetime very strictly. Either way, normalizing those 2-byte halves before page assembly helps avoid the cross-architecture drift I’ve run into.

• Paging / DoInBatches
Your rewrite to enforce max_rows_per_page is a meaningful cleanup. My patches didn’t touch this area, so no conflicts there — but just to mention it, keeping the paging logic predictable on BE made debugging the level stream quite a bit easier for me.

• INT96 (Impala timestamp)
Your implementation writes host-order limbs on BE and memcpy on LE. In my case I leaned heavily on always emitting LE limbs so the decode path doesn’t have to branch on architecture. Both approaches work as long as the corresponding reader expects the same convention.

None of this is blocking — just trying to pass along the details I’ve seen crop up when running the full parquet-encode → parquet-decode cycle on big-endian hardware.

pitrou · 2025-11-24T09:46:06Z

cpp/src/parquet/column_writer.cc

      if constexpr (std::is_same_v<ArrowType, ::arrow::Decimal64Type>) {
        *p++ = ::arrow::bit_util::ToBigEndian(u64_in[0]);
      } else if constexpr (std::is_same_v<ArrowType, ::arrow::Decimal128Type>) {
+#if ARROW_LITTLE_ENDIAN


Please take a step back and read the comments above:

// Requires a custom serializer because decimal in parquet are in big-endian // format. Thus, a temporary local buffer is required.

If we're on a big-endian system, this entire code is unnecessary and we can just use the FIXED_LEN_BYTE_ARRAY SerializeFunctor.

@pitrou.. Thanks for your review comments.. I will probably work on this change in the next pass. Thanks..

@Vishwanatha-HD Please don't resolve discussions until they are actually resolved. This one hasn't been addressed.

@pitrou.. Ok.. sure.. thanks..

@Vishwanatha-HD Please don't resolve discussions until they are actually resolved. This one hasn't been addressed.

@pitrou.. I have rebased this to git main and removed the below piece of code..

-#if ARROW_LITTLE_ENDIAN - if (num_bytes < 0 || num_bytes > data_size - 4) { -#else if (num_bytes < 0 || num_bytes > data_size) { ----------->>>>> Only retaining this line now.. -#endif

cpp/src/parquet/column_writer.h

pitrou · 2025-11-24T09:58:06Z

I'm frankly surprised that so few changes are required, given that Parquet C++ was never successfully tested on BE systems before. @Vishwanatha-HD Did you try to read the files in https://github.com/apache/parquet-testing/tree/master/data and check the contents were properly decoded?

Vishwanatha-HD · 2025-11-24T10:36:05Z

I'm frankly surprised that so few changes are required, given that Parquet C++ was never successfully tested on BE systems before. @Vishwanatha-HD Did you try to read the files in https://github.com/apache/parquet-testing/tree/master/data and check the contents were properly decoded?

Hi @pitrou..
thanks for your comments.. Please note that this is not just the changes that are required to enable support on s390x.. There are other 12 PRs that I have raised.. The main issue link is: #48151
Please check that and you will get the links to all other remaining PRs.

pitrou · 2025-11-24T11:15:40Z

Thanks @Vishwanatha-HD , and thanks for splitting up like this.

Vishwanatha-HD · 2025-11-24T20:28:31Z

Thanks @Vishwanatha-HD , and thanks for splitting up like this.

@pitrou.. Sure.. my pleasure.. Thanks alot for spending so much time and reviewing the code change and give your comments.. I appreciate it.. !!

Vishwanatha-HD

I have addressed all the review comments.. Thanks..

…support on s390x

Vishwanatha-HD requested a review from wgtmac as a code owner November 21, 2025 14:18

github-actions bot added Component: Parquet Component: C++ awaiting review Awaiting review labels Nov 21, 2025

k8ika0s mentioned this pull request Nov 21, 2025

GH-48213: [C++][Parquet] Fix endianness and test failures on s390x (big-endian) (supersedes partial fixes) #48212

Closed

Vishwanatha-HD mentioned this pull request Nov 21, 2025

[C++][Parquet] Enable Parquet DB support on Big Endian (IBM Z) systems #48151

Open

Vishwanatha-HD force-pushed the fixColumnReaderWriter branch from 82d9390 to 9959543 Compare November 22, 2025 05:00

kou changed the title ~~GH-48204 Fix Column Reader & Writer logic to enable Parquet DB suppor…~~ GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x Nov 22, 2025

kou reviewed Nov 22, 2025

View reviewed changes

pitrou reviewed Nov 24, 2025

View reviewed changes

cpp/src/parquet/column_writer.h Show resolved Hide resolved

pitrou reviewed Nov 24, 2025

View reviewed changes

cpp/src/parquet/column_writer.h Outdated Show resolved Hide resolved

github-actions bot added awaiting committer review Awaiting committer review awaiting changes Awaiting changes and removed awaiting review Awaiting review awaiting committer review Awaiting committer review labels Nov 24, 2025

Vishwanatha-HD force-pushed the fixColumnReaderWriter branch from 9959543 to dec2111 Compare November 24, 2025 20:31

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Nov 24, 2025

Vishwanatha-HD commented Nov 26, 2025

View reviewed changes

Vishwanatha-HD force-pushed the fixColumnReaderWriter branch 2 times, most recently from 3e0b644 to f05e139 Compare November 28, 2025 15:26

apacheGH-48204 Fix Column Reader & Writer logic to enable Parquet DB …

dce3855

…support on s390x

Vishwanatha-HD force-pushed the fixColumnReaderWriter branch from f05e139 to dce3855 Compare November 29, 2025 13:19

GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x #48205

Are you sure you want to change the base?

GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x #48205

Conversation

Vishwanatha-HD commented Nov 21, 2025 • edited by kou Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Nov 21, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

k8ika0s commented Nov 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Vishwanatha-HD Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pitrou commented Nov 24, 2025

Uh oh!

Vishwanatha-HD commented Nov 24, 2025

Uh oh!

pitrou commented Nov 24, 2025

Uh oh!

Vishwanatha-HD commented Nov 24, 2025

Uh oh!

Vishwanatha-HD left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Vishwanatha-HD commented Nov 21, 2025 •

edited by kou

Loading

Vishwanatha-HD Nov 28, 2025 •

edited

Loading