-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-48213: [C++][Parquet] Fix endianness and test failures on s390x (big-endian) (supersedes partial fixes) #48212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… supported image (apache#47730) ### Rationale for this change Old image fails due to debian update ### What changes are included in this PR? Use newer image ### Are these changes tested? Will submit crossbow run ### Are there any user-facing changes? No * GitHub Issue: apache#47705 Authored-by: Nic Crane <[email protected]> Signed-off-by: Nic Crane <[email protected]>
…he#47727) ### Rationale for this change apache#45964 changed paths of pre-built Apache Arrow C++ binaries for R. But we forgot to update the nightly upload job. ### What changes are included in this PR? Update paths in the nightly upload job. ### Are these changes tested? No... ### Are there any user-facing changes? Yes. * GitHub Issue: apache#47704 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Nic Crane <[email protected]>
…ation (apache#47743) ### Rationale for this change Valgrind would report memory leaks induced by protobuf initialization on library load, for example: ``` ==14628== 414 bytes in 16 blocks are possibly lost in loss record 22 of 26 ==14628== at 0x4914EFF: operator new(unsigned long) (vg_replace_malloc.c:487) ==14628== by 0x8D0B6CA: void std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_construct<char*>(char*, char*, std::forward_iterator_tag) [clone .isra.0] (in /opt/conda/envs/arrow/lib/libprotobuf.so.25.3.0) ==14628== by 0x8D33E62: google::protobuf::DescriptorPool::Tables::Tables() (in /opt/conda/envs/arrow/lib/libprotobuf.so.25.3.0) ==14628== by 0x8D340E2: google::protobuf::DescriptorPool::DescriptorPool(google::protobuf::DescriptorDatabase*, google::protobuf::DescriptorPool::ErrorCollector*) (in /opt/conda/envs/arrow/lib/libprotobuf.so.25.3.0) ==14628== by 0x8D341A2: google::protobuf::DescriptorPool::internal_generated_pool() (in /opt/conda/envs/arrow/lib/libprotobuf.so.25.3.0) ==14628== by 0x8D34277: google::protobuf::DescriptorPool::InternalAddGeneratedFile(void const*, int) (in /opt/conda/envs/arrow/lib/libprotobuf.so.25.3.0) ==14628== by 0x8D9C56F: google::protobuf::internal::AddDescriptorsRunner::AddDescriptorsRunner(google::protobuf::internal::DescriptorTable const*) (in /opt/conda/envs/arrow/lib/libprotobuf.so.25.3.0) ==14628== by 0x40D147D: call_init.part.0 (dl-init.c:70) ==14628== by 0x40D1567: call_init (dl-init.c:33) ==14628== by 0x40D1567: _dl_init (dl-init.c:117) ==14628== by 0x40EB2C9: ??? (in /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2) ``` This was triggered by the `libprotobuf` upgrade on conda-forge from 3.21.12 to 4.25.3. ### What changes are included in this PR? Add a Valgrind suppression for these leak reports, as there is probably not much we can do about them. ### Are these changes tested? Yes, by existing CI test. ### Are there any user-facing changes? No. * GitHub Issue: apache#47742 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…valid Parquet data (apache#47741) ### Rationale for this change Fix issues found by OSS-Fuzz when invalid Parquet data is fed to the Parquet reader: * https://issues.oss-fuzz.com/issues/447262173 * https://issues.oss-fuzz.com/issues/447480433 * https://issues.oss-fuzz.com/issues/447490896 * https://issues.oss-fuzz.com/issues/447693724 * https://issues.oss-fuzz.com/issues/447693728 * https://issues.oss-fuzz.com/issues/449498800 ### Are these changes tested? Yes, using the updated fuzz regression files from apache/arrow-testing#115 ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.) * GitHub Issue: apache#47740 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…pache#47766) ### Rationale for this change Mimalloc default generates LSE atomic instructions only work on armv8.1. This causes illegal instruction on armv8.0 platforms like Raspberry4. This PR sets mimalloc build flag -DMI_NO_OPT_ARCH=ON to disable LSE instruction. Please note even with flag set, compiler and libc will replace the atmoic call with an ifunc that matches hardware best at runtime. That means LSE is used only if the running platform supports it. ### What changes are included in this PR? Force mimalloc build flag -DMI_NO_OPT_ARCH=ON. ### Are these changes tested? Manually tested. ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** Fixes crashes on Armv8.0 platform. * GitHub Issue: apache#47229 Lead-authored-by: Yibo Cai <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change According to microsoft/mimalloc#1073 , mimalloc v3 is preferred over v2 for production usage. There are reports of higher than expected memory consumption with mimalloc 2.2.x, notably when reading Parquet data (example: apacheGH-47266). ### What changes are included in this PR? Bump to mimalloc 3.1.5, which is the latest mimalloc 3.1.x release as of this writing. ### Are these changes tested? Yes, by existing tests and CI. ### Are there any user-facing changes? Hopefully not, besides a potential reduction in memory usage due to improvements in mimalloc v3. * GitHub Issue: apache#47588 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change There are link errors with build options for JNI on macOS. ### What changes are included in this PR? `ARROW_BUNDLED_STATIC_LIBS` has CMake target names defined in Apache Arrow not `find_package()`-ed target names. So we should use `aws-c-common` not `AWS::aws-c-common`. Recent aws-c-common or something use the Network framework. So add `Network` to `Arrow::arrow_bundled_dependencies` dependencies. Don't use `compute/kernels/temporal_internal.cc` in `libarrow.dylib` and `libarrow_compute.dylib` to avoid duplicated symbols error. ### Are these changes tested? Yes. ### Are there any user-facing changes? Yes. * GitHub Issue: apache#47748 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
### Rationale for this change This is for preventing to break Apache Arrow Java JNI use case on Linux. ### What changes are included in this PR? * Add a CI job that uses build options for JNI use case * Install more packages in manylinux image that is also used by JNI build ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: apache#47632 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…che#47796) ### Rationale for this change `archery docker push` doesn't support custom Docker registry such as ghcr.io. ### What changes are included in this PR? Parse Docker image tag and specify Docker registry name to `docker push` if it's specified in the tag. Docker image tag format: `[HOST[:PORT]/]NAMESPACE/REPOSITORY[:TAG]` See also: https://docs.docker.com/reference/cli/docker/image/tag/#description ### Are these changes tested? Yes. ### Are there any user-facing changes? No. * GitHub Issue: apache#47795 Authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…3.14 (apache#47616) ### Rationale for this change Python 3.14 is currently in a prerelease status and is expected to have a final release in October this year (https://peps.python.org/pep-0745/). We should ensure we are fully ready to support Python 3.14 for the PyArrow 22 release. ### What changes are included in this PR? This PR updates wheels for Python 3.14. ### Are these changes tested? Tested in the CI and with extended builds. ### Are there any user-facing changes? No, but users will be able to use PyArrow with Python 3.14. * GitHub Issue: apache#47438 --- Todo: - Update the image revision name in `.env` - Add 3.14 conda build ([arrow/dev/tasks/tasks.yml](https://github.com/apache/arrow/blob/d803afcc43f5d132506318fd9e162d33b2c3d4cd/dev/tasks/tasks.yml#L809)) when conda-forge/pyarrow-feedstock#156 is merged Follow-ups: - apache#47437 Authored-by: AlenkaF <[email protected]> Signed-off-by: AlenkaF <[email protected]>
…data (apache#47804) Found by OSS-Fuzz, should fix https://issues.oss-fuzz.com/issues/451150486. Ensure RLE run is within bounds before reading it. Yes, by fuzz regression test in ASAN/UBSAN build. No. **This PR contains a "Critical Fix".** (If the changes fix either (a) a security vulnerability, (b) a bug that caused incorrect or invalid data to be produced, or (c) a bug that causes a crash (even when the API contract is upheld), please provide explanation. If not, you can remove this.) * GitHub Issue: apache#47803 Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
### Rationale for this change Summarise changes for release ### What changes are included in this PR? Update NEWS file ### Are these changes tested? No ### Are there any user-facing changes? No * GitHub Issue: apache#47738 Authored-by: Nic Crane <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
…install patch from conda (apache#47810) ### Rationale for this change Our verify-rc-source Windows job is failing due to patch not being available for Windows. ### What changes are included in this PR? Move patch requirement from `conda_env_cpp.txt` to `conda_env_unix.txt` ### Are these changes tested? Yes via CI and archery. ### Are there any user-facing changes? No * GitHub Issue: apache#47809 Authored-by: Raúl Cumplido <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
…ges on release branch push (apache#47826) ### Rationale for this change We require the Linux package jobs to be triggered on RC tag creation. For example for 22.0.0, we currently push the tag `apache-arrow-22.0.0-rc0` and the release branch `release-22.0.0-rc0`. Those events are triggering builds over the same commit and the tag event gets cancelled due to a "high priority task" triggering the same jobs. This causes jobs to fail on the branch because the ARROW_VERSION is not generated. If we manually re-trigger the jobs on the tag they are successful. ### What changes are included in this PR? Remove the `release-*` branches from triggering the event to allow only the tag to run the jobs so they don't get cancelled. ### Are these changes tested? No ### Are there any user-facing changes? No * GitHub Issue: apache#47819 Authored-by: Raúl Cumplido <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]>
… to align with the variant spec (apache#47835) ### Rationale for this change According to the [Variant specification](https://github.com/apache/parquet-format/blob/master/VariantEncoding.md), the specification_version field must be set to 1 to indicate Variant encoding version 1. Currently, this field defaults to 0, which violates the specification. Parquet readers that strictly enforce specification version validation will fail to read files containing Variant types. <img width="624" height="185" alt="image" src="https://github.com/user-attachments/assets/b0f1deb9-0301-4b94-a472-17fd9cc0df5d" /> ### What changes are included in this PR? The change includes defaulting the specification version to 1. ### Are these changes tested? The change is covered by unit test. ### Are there any user-facing changes? The Parquet files produced the variant logical type annotation `VARIANT(1)`. ``` Schema: message schema { optional group V (VARIANT(1)) = 1 { required binary metadata; required binary value; } } ``` * GitHub Issue: apache#47838 Lead-authored-by: Aihua <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
|
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format? or See also: |
|
|
Rationale for this change
[linkedin article]
Running Apache Arrow and Parquet 22.0.0 on s390x (big-endian) exposed a wide class of endianness-related correctness bugs and several test fragilities.
These included:
• Encoders emitting host-order bytes instead of canonical little-endian.
• Decoders double-swapping values or interpreting host-order stats pages incorrectly.
• INT96 timestamps mixing host-order and little-endian limbs.
• Page index expectations using host-order bytes.
• Half-float (Float16) serialization producing non-canonical layouts.
• Dictionary and BYTE_ARRAY length prefixes encoded inconsistently.
• Bit-packing and VLQ/ZigZag paths assuming little-endian layout even on BE hosts.
• Several tests comparing against macOS/ARM IPC snapshots, which do not match the canonical bytes produced on big-endian architectures.
• Synthetic OOM tests causing allocator-level aborts (particularly with mimalloc on large-memory s390x systems).
This PR makes Arrow+Parquet fully correct and stable on big-endian platforms by enforcing canonical little-endian encoding/decoding across all Parquet IO paths and updating tests to match the intended canonical format. The changes also improve debug-time instrumentation, correctness guarantees, and future maintainability.
⸻
What changes are included in this PR?
Canonical Little-Endian I/O (Parquet & Arrow)
• Introduces ARROW_ENSURE_S390X_ENDIANNESS (default ON on s390x) to force Parquet to treat big-endian hosts as requiring byte-swap shims.
• Adds parquet/endian_internal.h with portable helpers (ToLittleEndianValue, LoadLittleEndianScalar, ConvertLittleEndianInPlace, etc.).
• All Parquet encoders/decoders—Plain, ByteStreamSplit, Dictionary, DeltaBitPack, and FLBA—now unconditionally stage or convert values through these shims.
• Binary and BYTE_ARRAY 4-byte length prefixes always read/write little-endian values.
• Bit-packing utilities on BE fall back to portable implementations instead of LE-optimized unpack64* fast paths.
INT96 Canonicalization
• INT96 nanos and day fields written via ToLittleEndianValue, decoded via FromLittleEndian.
• Statistics, comparator logic, and all test fixtures updated to expect canonical LE limb ordering.
Page Index & Float16 Fixes
• Page index helpers encode expected min/max via LE.
• Half-float serialization uses a single staging buffer copying Arrow’s canonical little-endian bytes.
Level Headers & RLE Prefix Consistency
• All RLE length prefixes (definition/repetition levels, BIT_PACKED pages, and test fixtures) now written through ToLittleEndianValue.
Dictionary Stability
• Dictionary byte-array length prefixes written in LE, payload copying remains host-safe.
WKB Geospatial Fixes
• WKB parsing and stats extraction now use the unified little-endian shim.
IPC Byte-Identical Tests
• IPC message byte-identical test updated to a canonical s390x snapshot, with hex-diffs on mismatch for future maintainability.
Test Fixture Corrections
• All affected tests (INT96, page index, Float16, dictionary/BYTE_ARRAY, dataset stats, geospatial, bit utilities) updated to serialize expected values via ToLittleEndianValue.
OOM Test Stability
• Synthetic OOM tests clamp allocations to 1 << 48 and skip mimalloc to avoid backend-level aborts on large-memory s390x hosts.
• Tests still verify Arrow’s OutOfMemory behavior for supported allocators.
Debugging Enhancements
• Optional debug macros (PARQUET_DEBUG_DELTA_BITPACK, PARQUET_DEBUG_BYTE_STREAM_SPLIT) summarize miniblock widths, deltas, Arrow-buffer conversion, and page behavior without flooding logs.
• VLQ/ZigZag and miniblock debug helpers consolidated and corrected.
⸻
Are these changes tested?
Yes.
• All Parquet and Arrow C++ test suites now pass on s390x, including:
• parquet-reader/writer
• parquet-chunker
• parquet-arrow reader/writer
• page index tests
• ByteStreamSplit/DeltaBitPack encoding/decoding
• geospatial/WKB tests
• IPC message tests
• Arrow compute, bit utilities, and dataset stats
• INT96, page index, Float16, RLE, dictionary, and decimal cases validated against canonical byte expectations.
• OOM tests exercise the OOM path where applicable; mimalloc cases safely skipped.
• Manual verification performed for canonical IPC bytes and expected Parquet page layouts.
⸻
Are there any user-facing changes?
No API changes.
Behavior changes on big-endian platforms:
• Parquet files written on s390x are now canonical little-endian (correct and interoperable), fixing previous corruption and double-swap decode bugs.
• IPC message bytes are now stable and match canonical output.
• Synthetic OOM tests skip when mimalloc is in use.
No public interfaces or APIs were removed or modified.
⸻
Breaking API Changes?
No.
No public API or user-visible structure was altered.
⸻
Does this PR contain a “Critical Fix”?
Yes.
This PR resolves multiple correctness bugs that previously caused:
• Wrong bytes written to Parquet pages (statistics, INT96, level headers, dictionary, ByteStreamSplit, DeltaBitPack).
• Incorrect min/max values and mis-sorted row groups.
• Invalid or truncated binary pages on big-endian hosts.
• Incorrect decoding of INT96 timestamps, floats/doubles, DeltaBitPack blocks, and page index metadata.
• Crashes in synthetic OOM tests (allocator aborts).
These issues produce invalid or non-portable Parquet files, mis-decoded data, and crashes—qualifying as critical correctness and stability fixes.
Commit structure
This branch is rebased on top of the Arrow 22.0.0 release series; the first 18 commits in this PR are upstream commits pulled from
mainto get a clean baseline.All s390x-related functional changes are contained in the final commit:
7d6941d arrow_endianness_fixReviewers who prefer to focus only on the new logic can start from that commit.
Relationship to other s390x PRs
There are several narrower s390x-related PRs that touch overlapping areas:
This PR is intended as the comprehensive, end-to-end fix for big-endian (s390x) correctness:
parquet/endian_internal.h) and CMake option (ARROW_ENSURE_S390X_ENDIANNESS) so that all Parquet I/O paths consistently produce canonical little-endian bytes on big-endian hosts.I’m very happy to collaborate on converging this into the final shape the Arrow maintainers prefer (single PR or split), but the goal is to have one coherent invariants-based solution rather than a series of partial patches that might interact in surprising ways.
Closes #48213.