WIP: example lexical range handling #63

wiedld · 2025-04-01T11:24:08Z

Sharing some code of how we are extracting the lexical ranges, and using it to reorder (ungrouped) partitions.

It's likely more complex than needed, since we're handling legacy (in-house) bits and we've been constraining ourselves to currently available (as of ver 44) APIs. I did some minor tweaks updates to the code to get it compiling and tests passing.

This is not intended as a proposed solution. Rather, a description of the problem space and how we've been solving it (within our constraints). I think we all want a better upstream solution 🙏🏼 .

wiedld · 2025-04-01T11:27:54Z

datafusion/physical-optimizer/src/progressive_evaluation.rs

+        plan: Arc<dyn ExecutionPlan>,
+        _config: &datafusion_common::config::ConfigOptions,
+    ) -> Result<Arc<dyn ExecutionPlan>> {
+        plan.transform_up(|plan| {


This is the start point of the code, and providing an overview.

wiedld · 2025-04-01T11:29:28Z

datafusion/physical-optimizer/src/progressive_evaluation/util.rs

+pub fn split_parquet_files(
+    plan: Arc<dyn ExecutionPlan>,
+    ordering_req: &LexOrdering,
+) -> Result<Transformed<Arc<dyn ExecutionPlan>>> {


Here is where we ungroup the file source, prior to any calculation of lexical ranges.

We do look at the lexical range, at the file src, to decide how to ungroup. But then it gets transformed in the DAG before it hits the SPM. Therefore we re-calculate the lexical ranges seen at the SPM (which may get replaced with the ProgressiveEvalExec).

wiedld · 2025-04-01T11:34:02Z

datafusion/physical-optimizer/src/progressive_evaluation/extract_ranges.rs

+pub fn extract_disjoint_ranges_from_plan(
+    exprs: &LexOrdering,
+    input_plan: &Arc<dyn ExecutionPlan>,
+) -> Result<Option<NonOverlappingOrderedLexicalRanges>> {


In this function, the lexical ranges (per sort key) are extracted as min/maxes without any other sense of how we will order them. (Just that we need the min/max value per partition).

It's then the NonOverlappingOrderedLexicalRanges which has a sense of proper ordering.

wiedld · 2025-04-01T11:35:03Z

datafusion/physical-optimizer/src/progressive_evaluation/statistics.rs

+    fn f_down(&mut self, node: &'n Self::Node) -> DatafusionResult<TreeNodeRecursion> {
+        if !is_supported(node) {
+            self.statistics = None;
+            Ok(TreeNodeRecursion::Stop)
+        } else if is_leaf(node) {
+            self.statistics = self.extract_from_data_source(node)?;
+            Ok(TreeNodeRecursion::Stop)
+        } else if should_merge_partition_statistics(node) {
+            self.statistics = self.merge_column_statistics_across_partitions(node)?;
+            Ok(TreeNodeRecursion::Jump)
+        } else if should_pass_thru_partition_statistics(node) {
+            self.statistics =
+                self.find_stats_per_partition_within_multiple_children(node)?;
+            Ok(TreeNodeRecursion::Stop)
+        } else {
+            self.statistics = None;
+            Ok(TreeNodeRecursion::Stop)
+        }


This is where I tried to generalize the rules of how to extract partition min/max, without special casing each node type.

I prefer the approach suggested by xudong to have a caller on the ExecutionPlan::statistics_per_partition.

I'm gonna merge the PR: apache#15852, then you can update the PR to use it.

Thanks @xudong963 .

I ended up quickly porting over our own statistics_per_partition code, since it handles a few use cases we need. (Note: it doesn't handle joins and a few other use cases, which you have covered.)

I pushed it all up anyways, for the purposes of sharing code.

datafusion/physical-optimizer/src/progressive_evaluation/lexical_ranges.rs

datafusion/physical-optimizer/src/progressive_evaluation/extract_ranges.rs

…weeks

wiedld · 2025-05-08T22:24:43Z

The dependency pathing is a bit weird (and wrong), since I'm just quickly porting over code from our internal codebase. Actual implementation should not have this issue.

Test cases are all passing locally. Not bothering to debugging the 1 test case which fails only in remote CI; hopefully we can fix that as we make it into anything real.

xudong963 · 2025-05-15T08:17:35Z

datafusion/physical-optimizer/src/progressive_evaluation/statistics/util.rs

+/// Merge a collection of [`Statistics`] into a single stat.
+///
+/// This takes statistics references, which may or may not be arc'ed.
+pub(super) fn merge_stats_collection<


DF has the simillar logic now

xudong963

Thanks for the excellent work!

Maybe we can split the PR to some parts and port them to upstream separately, such as the ProgressiveEvalExec node, the InsertProgressiveEval rule and the statistics features that are needed. (I can help add features about statistics).

feat: example lexical range handling

4b06546

github-actions bot added the optimizer label Apr 1, 2025

wiedld commented Apr 1, 2025

View reviewed changes

datafusion/physical-optimizer/src/progressive_evaluation/lexical_ranges.rs Outdated Show resolved Hide resolved

wiedld commented Apr 1, 2025

View reviewed changes

datafusion/physical-optimizer/src/progressive_evaluation/extract_ranges.rs Outdated Show resolved Hide resolved

alamb mentioned this pull request Apr 1, 2025

Analysis to supportSortPreservingMerge --> ProgressiveEval apache/datafusion#15191

Open

wiedld added 2 commits April 30, 2025 14:03

Merge branch 'main' into wiedld/lexical-range-code

ae080c7

Merge branch 'main' into wiedld/lexical-range-code

6a8a0c5

wiedld force-pushed the wiedld/lexical-range-code branch 3 times, most recently from 5e05a0f to 4871fdd Compare May 8, 2025 19:09

refactor: port to upstream the massive refactoring from the past few …

4871fdd

…weeks

xudong963 reviewed May 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: example lexical range handling #63

WIP: example lexical range handling #63

wiedld commented Apr 1, 2025 •

edited

Loading

wiedld Apr 1, 2025

wiedld Apr 1, 2025 •

edited

Loading

wiedld Apr 1, 2025 •

edited

Loading

wiedld Apr 1, 2025 •

edited

Loading

xudong963 Apr 29, 2025

wiedld May 8, 2025

wiedld commented May 8, 2025 •

edited

Loading

xudong963 May 15, 2025

xudong963 left a comment

WIP: example lexical range handling #63

Are you sure you want to change the base?

WIP: example lexical range handling #63

Conversation

wiedld commented Apr 1, 2025 • edited Loading

wiedld Apr 1, 2025

Choose a reason for hiding this comment

wiedld Apr 1, 2025 • edited Loading

Choose a reason for hiding this comment

wiedld Apr 1, 2025 • edited Loading

Choose a reason for hiding this comment

wiedld Apr 1, 2025 • edited Loading

Choose a reason for hiding this comment

xudong963 Apr 29, 2025

Choose a reason for hiding this comment

wiedld May 8, 2025

Choose a reason for hiding this comment

wiedld commented May 8, 2025 • edited Loading

xudong963 May 15, 2025

Choose a reason for hiding this comment

xudong963 left a comment

Choose a reason for hiding this comment

wiedld commented Apr 1, 2025 •

edited

Loading

wiedld Apr 1, 2025 •

edited

Loading

wiedld Apr 1, 2025 •

edited

Loading

wiedld Apr 1, 2025 •

edited

Loading

wiedld commented May 8, 2025 •

edited

Loading