-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-48204: [C++][Parquet] Fix Column Reader & Writer logic to enable Parquet DB support on s390x #48205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
82d9390 to
9959543
Compare
| auto last_day_nanos = last_day_units * NanosecondsPerUnit; | ||
| #if ARROW_LITTLE_ENDIAN | ||
| // impala_timestamp will be unaligned every other entry so do memcpy instead | ||
| // of assign and reinterpret cast to avoid undefined behavior. | ||
| std::memcpy(impala_timestamp, &last_day_nanos, sizeof(int64_t)); | ||
| #else | ||
| (*impala_timestamp).value[0] = static_cast<uint32_t>(last_day_nanos); | ||
| (*impala_timestamp).value[1] = static_cast<uint32_t>(last_day_nanos >> 32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use the following instead of #if?
auto last_day_nanos = last_day_units * NanosecondsPerUnit;
auto last_day_nanos_little_endian = ::arrow::bit_util::ToLittleEndian(last_day_nanos);
std::memcpy(impala_timestamp, &last_day_nanos_little_endian, sizeof(int64_t));
cpp/src/parquet/column_reader.cc
Outdated
| #if ARROW_LITTLE_ENDIAN | ||
| if (num_bytes < 0 || num_bytes > data_size - 4) { | ||
| #else | ||
| if (num_bytes < 0 || num_bytes > data_size) { | ||
| #endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the ping @kou . I've re-read through this code and I now think the original change was a mistake. I'll submit a separate issue/PR to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Vishwanatha-HD Can you rebase/merge from git main and remove this change?
|
Working through this one, I’m reminded how many odd little corners show up when Arrow’s layout meets Parquet’s expectations — especially around levels, decimals, and the legacy INT96 bits. Looking at the pieces that overlap with the work I’ve been doing, the overall direction makes sense. A few notes from what I’ve seen on real s390x hardware: • BIT_PACKED level headers • Decimal serialization • Half-floats in FLBA • Paging / DoInBatches • INT96 (Impala timestamp) None of this is blocking — just trying to pass along the details I’ve seen crop up when running the full parquet-encode → parquet-decode cycle on big-endian hardware. |
| if constexpr (std::is_same_v<ArrowType, ::arrow::Decimal64Type>) { | ||
| *p++ = ::arrow::bit_util::ToBigEndian(u64_in[0]); | ||
| } else if constexpr (std::is_same_v<ArrowType, ::arrow::Decimal128Type>) { | ||
| #if ARROW_LITTLE_ENDIAN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please take a step back and read the comments above:
// Requires a custom serializer because decimal in parquet are in big-endian
// format. Thus, a temporary local buffer is required.If we're on a big-endian system, this entire code is unnecessary and we can just use the FIXED_LEN_BYTE_ARRAY SerializeFunctor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou.. Thanks for your review comments.. I will probably work on this change in the next pass. Thanks..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Vishwanatha-HD Please don't resolve discussions until they are actually resolved. This one hasn't been addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou.. Ok.. sure.. thanks..
@Vishwanatha-HD Please don't resolve discussions until they are actually resolved. This one hasn't been addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pitrou.. I have rebased this to git main and removed the below piece of code..
-#if ARROW_LITTLE_ENDIAN
- if (num_bytes < 0 || num_bytes > data_size - 4) {
-#else
if (num_bytes < 0 || num_bytes > data_size) { ----------->>>>> Only retaining this line now..
-#endif
|
I'm frankly surprised that so few changes are required, given that Parquet C++ was never successfully tested on BE systems before. @Vishwanatha-HD Did you try to read the files in https://github.com/apache/parquet-testing/tree/master/data and check the contents were properly decoded? |
Hi @pitrou.. |
|
Thanks @Vishwanatha-HD , and thanks for splitting up like this. |
@pitrou.. Sure.. my pleasure.. Thanks alot for spending so much time and reviewing the code change and give your comments.. I appreciate it.. !! |
9959543 to
dec2111
Compare
Vishwanatha-HD
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have addressed all the review comments.. Thanks..
3e0b644 to
f05e139
Compare
…support on s390x
f05e139 to
dce3855
Compare
Rationale for this change
This PR is intended to enable Parquet DB support on Big-endian (s390x) systems. The fix in this PR fixes the column reader & writer logic. Column Reader & Writer are the main part of most of the parquet & arrow-parquet testcases.
What changes are included in this PR?
The fix includes changes to following files:
cpp/src/parquet/column_reader.cc
cpp/src/parquet/column_writer.cc
cpp/src/parquet/column_writer.h
Are these changes tested?
Yes. The changes are tested on s390x arch to make sure things are working fine. The fix is also tested on x86 arch, to make sure there is no new regression introduced.
Are there any user-facing changes?
No
GitHub main Issue link: #48151