Skip to content

Enable split_file_groups_by_statistics by default #10336

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Tracked by #10313 ...
alamb opened this issue May 1, 2024 · 12 comments · Fixed by #15473
Open
Tracked by #10313 ...

Enable split_file_groups_by_statistics by default #10336

alamb opened this issue May 1, 2024 · 12 comments · Fixed by #15473
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented May 1, 2024

Is your feature request related to a problem or challenge?

In #9593, @suremarc added a way to reorganize input files in a ListingTable to avoid a merge, if the sort key ranges do not overlap

This feature is behind a feature flag, split_file_groups_by_statistics which defaults to false as I think there needs to be some more tests in place before we turn it on

Describe the solution you'd like

Add additional tests and then enable split_file_groups_by_statistics by default

Describe alternatives you've considered

No response

Additional context

No response

@alamb alamb added the enhancement New feature or request label May 1, 2024
@alamb
Copy link
Contributor Author

alamb commented May 1, 2024

Example test coverage we should add I think: #9593 (comment)

@yyy1000
Copy link
Contributor

yyy1000 commented May 4, 2024

I'd like to help it. 🙌

@alamb
Copy link
Contributor Author

alamb commented May 4, 2024

THank you @yyy1000 🙏

I think a good place to start would be to write some sqllogic level tests to cover the important cases

Perhaos for the first test:

  1. Create files: file1.parquet, file2.parquet both sorted on a but file 1 has the columns in the order a, b, c and file has the columns in the order c, b, a. The keyranges of values of a should be non overlapping
  2. Create an external table a, b, c with explicit order by a, and then query SELECT ... ORDER BY a and make sure the output plan doesn't use sort preserving merge

I think we could extend https://github.com/apache/datafusion/blob/main/datafusion/sqllogictest/test_files/parquet_sorted_statistics.slt

cc @suremarc

@leoyvens
Copy link
Contributor

One thing I've noticed is that after DataFusion 40 this actually works in my use case, likely thanks to the statistics code getting fixed, so good news there! It does require additionally setting execution.collect_statistics = true, which makes sense.

However for my entirely sorted and non-overlapping dataset it did make Parquet scanning single-threaded (ParquetScan with a single file group), which is a big performance regression. So it didn't really help me, maybe I actually want #10316.

The consequence to this issue being that turning this on by default would regress performance for users that have execution.collect_statistics = true. Maybe the flag should be merged with prefer_existing_sort, which has the semantics of avoiding sorts at the cost of limiting parallelism. Or maybe just wait for #10316, so we can both avoid the sort and still have a parallel ParquetExec.

@alamb
Copy link
Contributor Author

alamb commented Jul 29, 2024

Sorry for the delay @leoyvens and thank you for this analysis

#11170

I would personally love to take this approach

@suremarc
Copy link
Contributor

suremarc commented Mar 27, 2025

Leaving some thoughts here as I was asked in another issue about what it would take to turn this feature on, and I don't want to take over that thread --

  1. As pointed out in @leoyvens's comment, the current impl is too aggressive about limiting parallelism, so we will need to modify the first fit algorithm to distribute into at least k groups. This can maybe be done if we initialize k empty groups and then use some sort of priority queue in the first fit algorithm that prioritizes the smallest groups.
  2. I am worried about the performance of this feature for large ListingTables with tens or hundreds of thousands of files. Collecting statistics for that many files and sorting by the minimum values is going to take up a measurable amount of planning time at that scale, so we may cause a performance regression. IMO, we need to set up a benchmark on a large ListingTable to quantify the impact.

@xudong963
Copy link
Member

Fyi, I'm working on it.

@leoyvens
Copy link
Contributor

leoyvens commented Apr 4, 2025

Should this issue have been closed? Did #15473 change default behaviour?

@xudong963
Copy link
Member

xudong963 commented Apr 5, 2025

I'll open a follow-up PR to make it default

@xudong963 xudong963 reopened this Apr 5, 2025
@suremarc
Copy link
Contributor

suremarc commented Apr 5, 2025

I'll open a follow-up PR to make it default

I think one of the asks in the original post was additional tests. I think some of the asks are already covered in the sqllogictest (parquet_sorted_statistics.slt), some not, so I'll try to summarize here:

Case 1: Flexible file schemas

The schema of the files is different but compatible (e.g. one file as (time, date, symbol) but the other file had (date, symbol, time) for example (source)

Create files: file1.parquet, file2.parquet both sorted on a but file 1 has the columns in the order a, b, c and file has the columns in the order c, b, a. The keyranges of values of a should be non overlapping (source)

As far as I know this isn't covered in any tests, based on my understanding it shouldn't break anything but obviously we'd love to have that verified in a test 😄

Case 2: Order by subset of columns

The query orders by a subset of the columns (e.g. ORDER BY time) (source)

This is covered in basically every single query in the sqllogictest, so I think this is fine.

Case 3: Order by non-ORDER BY columns

The query orders by a subset of the columns that is not the sort order (ORDER BY date) (source)

I believe this is missing, if I understand correctly expected behavior here is failure.

Case 4: Files start out of order

I think all these tests also always have the first file with the minimum stastistics value -- can you possibly also test what happens when it is not (aka add a test that runs this test with file ids 2, 1, 0)? (source)

I think this is probably covered by the sqllogictests, specifically the ones doing descending ordering. However it should be pretty easy to add a single new test case to the unit tests (not sqllogictests) for FileScanConfig::split_groups_by_statistics, which are located here

--

I realize we're eager to get this feature out, but I think this is one of the first optimizations that rely on statistics for correctness, so it's important we get this right and ensure a healthy amount of tests are in place.

cc @alamb as I know you asked specifically for these tests

@suremarc
Copy link
Contributor

suremarc commented Apr 5, 2025

Also, there are two other issues I'd like to call out:

Unit tests for FileScanConfig::split_groups_by_statistics_with_target_partitions

There are some table-driven unit tests for FileScanConfig::split_groups_by_statistics located here but they don't cover the new FileScanConfig::split_groups_by_statistics_with_target_partitions method which is the one that is actually called after #15473. I think we could extend the test in the following manner:

  1. Run FileScanConfig::split_groups_by_statistics
  2. Count the number of file groups produced, call it N. This is the minimum number of file groups needed
  3. Run FileScanConfig::split_groups_by_statistics_with_target_partitions with N as the target partitions. This should produce the exact same result

Inexact statistics

After rereading the code, I do not think the MinMaxStatistics data structure actually differentiates between exact and inexact statistics. Technically this is fine as long as the min/max values are treated as conservative estimates, which I believe is how they are used in the FileScanConfig-level statistics with all of the built-in file formats. However, it seems possible that someone could implement their own DataSource/FileSource that doesn't respect this.

TBH I am tempted to consider such a case a bug on the user's side, but it is technically within the semantics of Statistics as of today. So we would need to wait for migration to StatisticsV2 (migrating from Precision to Distribution)

@xudong963
Copy link
Member

I think one of the asks in the original post was additional tests.

Oh, checked the issue again and got it lol

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants