-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Enable split_file_groups_by_statistics
by default
#10336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Example test coverage we should add I think: #9593 (comment) |
I'd like to help it. 🙌 |
THank you @yyy1000 🙏 I think a good place to start would be to write some sqllogic level tests to cover the important cases Perhaos for the first test:
I think we could extend https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/parquet_sorted_statistics.slt cc @suremarc |
One thing I've noticed is that after DataFusion 40 this actually works in my use case, likely thanks to the statistics code getting fixed, so good news there! It does require additionally setting However for my entirely sorted and non-overlapping dataset it did make Parquet scanning single-threaded ( The consequence to this issue being that turning this on by default would regress performance for users that have |
Leaving some thoughts here as I was asked in another issue about what it would take to turn this feature on, and I don't want to take over that thread --
|
Fyi, I'm working on it. |
Should this issue have been closed? Did #15473 change default behaviour? |
|
I think one of the asks in the original post was additional tests. I think some of the asks are already covered in the sqllogictest (parquet_sorted_statistics.slt), some not, so I'll try to summarize here: Case 1: Flexible file schemas
As far as I know this isn't covered in any tests, based on my understanding it shouldn't break anything but obviously we'd love to have that verified in a test 😄 Case 2: Order by subset of columns
This is covered in basically every single query in the sqllogictest, so I think this is fine. Case 3: Order by non-ORDER BY columns
I believe this is missing, if I understand correctly expected behavior here is failure. Case 4: Files start out of order
I think this is probably covered by the sqllogictests, specifically the ones doing descending ordering. However it should be pretty easy to add a single new test case to the unit tests (not sqllogictests) for -- I realize we're eager to get this feature out, but I think this is one of the first optimizations that rely on statistics for correctness, so it's important we get this right and ensure a healthy amount of tests are in place. cc @alamb as I know you asked specifically for these tests |
Also, there are two other issues I'd like to call out: Unit tests for
|
Oh, checked the issue again and got it lol |
Is your feature request related to a problem or challenge?
In #9593, @suremarc added a way to reorganize input files in a ListingTable to avoid a merge, if the sort key ranges do not overlap
This feature is behind a feature flag,
split_file_groups_by_statistics
which defaults tofalse
as I think there needs to be some more tests in place before we turn it onDescribe the solution you'd like
Add additional tests and then enable
split_file_groups_by_statistics
by defaultDescribe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: