-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-48231 [C++][Parquet] Add FSST encoding support for Parquet #48232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
|
cc @julienledem @emkornfield will fix builds soon thanks |
|
Thanks for creating the PoC implementation! I haven't yet checked the detail about the FSST algorithm. IMHO it is generally fine to directly depend on https://github.com/cwida/fsst for PoC and benchmark. I'm not sure how much effort is required to write our own FSST implementation. We care about maintainability and are strict with adding a new 3rd party dependency, especially when we have already depended on cc @pitrou |
|
Thanks for taking a look, yeah this is something we briefly discussed in the Parquet sync, generally https://github.com/cwida/fsst should be reliable for FSST. Re-implementing it might require some duplication, will discuss in Parquet sync if we can get a consensus on the dependency. |
Apologies, i haven't had a chance to look at this yet, but a reminder the sync is not an official place to come to consensus (official decisions should be discussed and finalized on the mailing list). Another option is to vendor/copy most of the FSST library in the source tree. This also impacts Arrow should probably be brought up on both mailing lists. |
Sure that works too! Just wanted to get a consensus with the community, will start a mail thread instead. Let me check the vendor/copy option, should be 6-7 files from FSST if we opt to duplicate relevant code. |
|
Eliminated the direct dependency on fsst, which is working well. Will check on email thread the feedback from community, and update if needed. |
|
cc @wgtmac could you please re-run the test. Just checking if it's not related by fsst by any chance thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we bundle a dependency (copy to cpp/src/arrow/vendored/), we don't need to change this file.
See also:
arrow/cpp/src/arrow/CMakeLists.txt
Lines 447 to 483 in 79d6458
| set(ARROW_VENDORED_SRCS | |
| vendored/base64.cpp | |
| vendored/datetime.cpp | |
| vendored/double-conversion/bignum-dtoa.cc | |
| vendored/double-conversion/bignum.cc | |
| vendored/double-conversion/cached-powers.cc | |
| vendored/double-conversion/double-to-string.cc | |
| vendored/double-conversion/fast-dtoa.cc | |
| vendored/double-conversion/fixed-dtoa.cc | |
| vendored/double-conversion/string-to-double.cc | |
| vendored/double-conversion/strtod.cc | |
| vendored/musl/strptime.c | |
| vendored/uriparser/UriCommon.c | |
| vendored/uriparser/UriCompare.c | |
| vendored/uriparser/UriEscape.c | |
| vendored/uriparser/UriFile.c | |
| vendored/uriparser/UriIp4.c | |
| vendored/uriparser/UriIp4Base.c | |
| vendored/uriparser/UriMemory.c | |
| vendored/uriparser/UriNormalize.c | |
| vendored/uriparser/UriNormalizeBase.c | |
| vendored/uriparser/UriParse.c | |
| vendored/uriparser/UriParseBase.c | |
| vendored/uriparser/UriQuery.c | |
| vendored/uriparser/UriRecompose.c | |
| vendored/uriparser/UriResolve.c | |
| vendored/uriparser/UriShorten.c) | |
| if(APPLE) | |
| list(APPEND ARROW_VENDORED_SRCS vendored/datetime/ios.mm) | |
| endif() | |
| set_source_files_properties(vendored/datetime.cpp PROPERTIES SKIP_UNITY_BUILD_INCLUSION | |
| ON) | |
| arrow_add_object_library(ARROW_VENDORED ${ARROW_VENDORED_SRCS}) | |
| # Disable DLL exports in vendored uriparser library | |
| foreach(ARROW_VENDORED_TARGET ${ARROW_VENDORED_TARGETS}) | |
| target_compile_definitions(${ARROW_VENDORED_TARGET} PRIVATE URI_STATIC_BUILD) | |
| endforeach() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to bundle a dependency, could you use cpp/src/arrow/vendored/ instead of cpp/thirdparty/?
https://github.com/apache/arrow/tree/main/cpp/src/arrow/vendored
|
I think before we spend a lot of time reviewing this we should try to close out on the overall design on the parquet mailing list. Could we maybe mark this as a draft? |
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?