Skip to content

Conversation

@ArnavBalyan
Copy link
Member

@ArnavBalyan ArnavBalyan commented Nov 23, 2025

Rationale for this change

  • Adds support for FSST Encoding for Parquet

What changes are included in this PR?

  • FSST Encoder/Decoder and third party dependencies for Parquet

Are these changes tested?

  • Unit tests, tested with live jobs

Are there any user-facing changes?

  • Yes, new encoding

@ArnavBalyan ArnavBalyan marked this pull request as draft November 23, 2025 14:08
@github-actions
Copy link

⚠️ GitHub issue #48231 has been automatically assigned in GitHub to PR creator.

@ArnavBalyan ArnavBalyan changed the title GH-48231 [C++] [Parquet] Add FSST encoding support for Parquet GH-48231 [C++][Parquet] Add FSST encoding support for Parquet Nov 23, 2025
@ArnavBalyan
Copy link
Member Author

cc @julienledem @emkornfield will fix builds soon thanks

@wgtmac
Copy link
Member

wgtmac commented Nov 25, 2025

Thanks for creating the PoC implementation! I haven't yet checked the detail about the FSST algorithm. IMHO it is generally fine to directly depend on https://github.com/cwida/fsst for PoC and benchmark. I'm not sure how much effort is required to write our own FSST implementation. We care about maintainability and are strict with adding a new 3rd party dependency, especially when we have already depended on xsimd to optimize vectorization.

cc @pitrou

@ArnavBalyan
Copy link
Member Author

Thanks for taking a look, yeah this is something we briefly discussed in the Parquet sync, generally https://github.com/cwida/fsst should be reliable for FSST. Re-implementing it might require some duplication, will discuss in Parquet sync if we can get a consensus on the dependency.

@emkornfield
Copy link
Contributor

Re-implementing it might require some duplication, will discuss in Parquet sync if we can get a consensus on the dependency.

Apologies, i haven't had a chance to look at this yet, but a reminder the sync is not an official place to come to consensus (official decisions should be discussed and finalized on the mailing list). Another option is to vendor/copy most of the FSST library in the source tree. This also impacts Arrow should probably be brought up on both mailing lists.

@ArnavBalyan
Copy link
Member Author

ArnavBalyan commented Nov 25, 2025

Re-implementing it might require some duplication, will discuss in Parquet sync if we can get a consensus on the dependency.

Apologies, i haven't had a chance to look at this yet, but a reminder the sync is not an official place to come to consensus (official decisions should be discussed and finalized on the mailing list). Another option is to vendor/copy most of the FSST library in the source tree. This also impacts Arrow should probably be brought up on both mailing lists.

Sure that works too! Just wanted to get a consensus with the community, will start a mail thread instead. Let me check the vendor/copy option, should be 6-7 files from FSST if we opt to duplicate relevant code.

@ArnavBalyan ArnavBalyan marked this pull request as ready for review November 25, 2025 11:43
@ArnavBalyan
Copy link
Member Author

Eliminated the direct dependency on fsst, which is working well. Will check on email thread the feedback from community, and update if needed.

@ArnavBalyan
Copy link
Member Author

cc @wgtmac could you please re-run the test. Just checking if it's not related by fsst by any chance thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we bundle a dependency (copy to cpp/src/arrow/vendored/), we don't need to change this file.
See also:

set(ARROW_VENDORED_SRCS
vendored/base64.cpp
vendored/datetime.cpp
vendored/double-conversion/bignum-dtoa.cc
vendored/double-conversion/bignum.cc
vendored/double-conversion/cached-powers.cc
vendored/double-conversion/double-to-string.cc
vendored/double-conversion/fast-dtoa.cc
vendored/double-conversion/fixed-dtoa.cc
vendored/double-conversion/string-to-double.cc
vendored/double-conversion/strtod.cc
vendored/musl/strptime.c
vendored/uriparser/UriCommon.c
vendored/uriparser/UriCompare.c
vendored/uriparser/UriEscape.c
vendored/uriparser/UriFile.c
vendored/uriparser/UriIp4.c
vendored/uriparser/UriIp4Base.c
vendored/uriparser/UriMemory.c
vendored/uriparser/UriNormalize.c
vendored/uriparser/UriNormalizeBase.c
vendored/uriparser/UriParse.c
vendored/uriparser/UriParseBase.c
vendored/uriparser/UriQuery.c
vendored/uriparser/UriRecompose.c
vendored/uriparser/UriResolve.c
vendored/uriparser/UriShorten.c)
if(APPLE)
list(APPEND ARROW_VENDORED_SRCS vendored/datetime/ios.mm)
endif()
set_source_files_properties(vendored/datetime.cpp PROPERTIES SKIP_UNITY_BUILD_INCLUSION
ON)
arrow_add_object_library(ARROW_VENDORED ${ARROW_VENDORED_SRCS})
# Disable DLL exports in vendored uriparser library
foreach(ARROW_VENDORED_TARGET ${ARROW_VENDORED_TARGETS})
target_compile_definitions(${ARROW_VENDORED_TARGET} PRIVATE URI_STATIC_BUILD)
endforeach()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to bundle a dependency, could you use cpp/src/arrow/vendored/ instead of cpp/thirdparty/?
https://github.com/apache/arrow/tree/main/cpp/src/arrow/vendored

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Nov 27, 2025
@emkornfield
Copy link
Contributor

I think before we spend a lot of time reviewing this we should try to close out on the overall design on the parquet mailing list. Could we maybe mark this as a draft?

@ArnavBalyan ArnavBalyan marked this pull request as draft November 28, 2025 06:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants