Refactor EnforceDistribution test cases to demonstrate dependencies across optimizer runs. #15074

wiedld · 2025-03-07T21:11:31Z

Which issue does this PR close?

Part of Improve EnforceDistribution testings. #15003
Follow on to Refactor test suite in EnforceDistribution, to use standard test config. #15010

Rationale for this change

Enable us to write test cases using:

the same TestConfig
then apply the optimizer runs in any order chosen

The benefits of this approach are:

explicitly demonstrate when there is an ordering dependency in which optimizer has to run first (sorting or distribution)
quickly enable us to write test cases addressing idempotency of an optimizer run
(hopefully) helps identify strengths/gaps in the current test coverage. 🙏🏼

What changes are included in this PR?

The above, as we as a bit of an increase in test coverage to demonstrate exactly when our test cases (the plan results we see) are ordering dependent.

Are these changes tested?

Yes

Are there any user-facing changes?

No

…and highlight when the same testing setup, but different ordering of optimizer runs, effect the outcome.

wiedld · 2025-03-07T21:12:11Z

datafusion/core/tests/physical_optimizer/enforce_distribution.rs

+    /// Perform a series of runs using the current [`TestConfig`],
+    /// assert the expected plan result,
+    /// and return the result plan (for potentional subsequent runs).
+    fn run(
+        &self,
+        expected_lines: &[&str],
+        plan: Arc<dyn ExecutionPlan>,
+        optimizers_to_run: Vec<Run>,
+    ) -> Result<Arc<dyn ExecutionPlan>> {


The reason this returns the plan, is so that we can easily write test cases for idempotency.

let plan_after_first_run = test_config.run( expected, plan, vec![Run::Distribution], )?; let plan_after_second_run = test_config.run( expected, plan_after_first_run.clone(), vec![Run::Distribution], // exact same run again )?; assert_eq!( get_plan_string(&plan_after_first_run), get_plan_string(&plan_after_second_run), );

@xudong963 may find this new test setup useful.

wiedld · 2025-03-07T21:12:42Z

datafusion/core/tests/physical_optimizer/enforce_distribution.rs

+const DISTRIB_DISTRIB_SORT: [Run; 3] =
+    [Run::Distribution, Run::Distribution, Run::Sorting];


Equivalent to main:

datafusion/datafusion/core/tests/physical_optimizer/enforce_distribution.rs

Lines 459 to 469 in 9a4c9d5

// Run enforce distribution rule first:

let optimizer = EnforceDistribution::new();

let optimized = optimizer.optimize(optimized, &config)?;

// The rule should be idempotent.

// Re-running this rule shouldn't introduce unnecessary operators.

let optimizer = EnforceDistribution::new();

let optimized = optimizer.optimize(optimized, &config)?;

// Run the enforce sorting rule:

let optimizer = EnforceSorting::new();

let optimized = optimizer.optimize(optimized, &config)?;

optimized

wiedld · 2025-03-07T21:13:00Z

datafusion/core/tests/physical_optimizer/enforce_distribution.rs

+const SORT_DISTRIB_DISTRIB: [Run; 3] =
+    [Run::Sorting, Run::Distribution, Run::Distribution];


Equivalent to main:

datafusion/datafusion/core/tests/physical_optimizer/enforce_distribution.rs

Lines 471 to 481 in 9a4c9d5

// Run the enforce sorting rule first:

let optimizer = EnforceSorting::new();

let optimized = optimizer.optimize(optimized, &config)?;

// Run enforce distribution rule:

let optimizer = EnforceDistribution::new();

let optimized = optimizer.optimize(optimized, &config)?;

// The rule should be idempotent.

// Re-running this rule shouldn't introduce unnecessary operators.

let optimizer = EnforceDistribution::new();

let optimized = optimizer.optimize(optimized, &config)?;

optimized

Thank you -- I find this much clearer than having to look in a macro to know the runs are being invoked in a certain order

wiedld · 2025-03-07T21:13:30Z

datafusion/core/tests/physical_optimizer/enforce_distribution.rs

+        // TODO(wiedld): show different test result if enforce distribution first.
+        test_config.run(
+            &expected_first_sort_enforcement,
            top_join,
-            &TestConfig::new(DoFirst::Sorting).with_prefer_existing_sort()
-        );
+            SORT_DISTRIB_DISTRIB.into(),
+        )?;


I left a few TODO(wiedld) in the test multi_smj_joins.

In some cases the test case is showing the outcome for the sort occurring first BEFORE the distribution run, which does not match the ordering of the optimizer in the default planner -- and I thought that should be highlighted in the test cases.

wiedld · 2025-03-07T21:13:54Z

datafusion/core/tests/physical_optimizer/enforce_distribution.rs


+    // Test: result IS DIFFERENT, if EnforceSorting is run first:


Highlighting the dependency of run ordering in the existing test cases.

alamb

Thank you @wiedld -- I think this PR really improves the readability of these tests and I look forward to working with this code from now on

A note to other reviewers that I found looking at the whitespace blind diff was easier to understand https://github.com/apache/datafusion/pull/15074/files?w=1

FYI @berkaysynnada

alamb · 2025-03-08T10:14:55Z

datafusion/core/tests/physical_optimizer/enforce_distribution.rs

+const SORT_DISTRIB_DISTRIB: [Run; 3] =
+    [Run::Sorting, Run::Distribution, Run::Distribution];


Thank you -- I find this much clearer than having to look in a macro to know the runs are being invoked in a certain order

datafusion/core/tests/physical_optimizer/enforce_distribution.rs

alamb · 2025-03-08T10:30:02Z

datafusion/core/tests/physical_optimizer/enforce_distribution.rs

+        plan_parquet.clone(),
+        DISTRIB_DISTRIB_SORT.into(),
+    )?;
+    let expected_parquet_first_sort_enforcement = &[


This looks like you added some coverage too (which is good, I just wanted to verify that was the case)

alamb · 2025-03-10T14:35:43Z

I'll plan to merge this later today unless anyone else would like more time to review

berkaysynnada

I took a quick look and this is clearly an improvement for understandability, making it easier to extend, and simplifying debugging. Thanks @wiedld and @alamb

A few TODOs stand out, are you planning to address them in a follow-up PR?

btw, I haven't forgotten debugging the issue in influxdata#58, but I still need to finish some urgent tasks.

alamb · 2025-03-10T17:10:22Z

Thanks for the review @berkaysynnada ! We'll keep adding more coverage too

wiedld · 2025-03-10T17:17:11Z

A few TODOs stand out, are you planning to address them in a follow-up PR?

@berkaysynnada -- Yes, absolutely. I have one higher priority item to fix first, then I'll switch back to address these TODOs this week.

wiedld added 2 commits March 7, 2025 13:04

refactor(15003): permit any combination of runs desired

e4a2dc1

refactor(15003): convert macro to a function call on the TestConfig, …

9411528

…and highlight when the same testing setup, but different ordering of optimizer runs, effect the outcome.

github-actions bot added the core Core DataFusion crate label Mar 7, 2025

wiedld commented Mar 7, 2025

View reviewed changes

wiedld marked this pull request as ready for review March 7, 2025 21:39

alamb mentioned this pull request Mar 8, 2025

Make code more concise influxdata/arrow-datafusion#62

Closed

alamb approved these changes Mar 8, 2025

View reviewed changes

wiedld added 2 commits March 9, 2025 22:00

chore: remove unneeded comments

19ede37

test: update test harness to use passed ref

0ead8d2

berkaysynnada approved these changes Mar 10, 2025

View reviewed changes

alamb merged commit 80cb0af into apache:main Mar 10, 2025
24 checks passed

wiedld mentioned this pull request Mar 10, 2025

Improve EnforceDistribution testings. #15003

Open

9 tasks

alamb deleted the 15003/remove-macro branch March 10, 2025 17:10

		const DISTRIB_DISTRIB_SORT: [Run; 3] =
		[Run::Distribution, Run::Distribution, Run::Sorting];

	// Run enforce distribution rule first:
	let optimizer = EnforceDistribution::new();
	let optimized = optimizer.optimize(optimized, &config)?;
	// The rule should be idempotent.
	// Re-running this rule shouldn't introduce unnecessary operators.
	let optimizer = EnforceDistribution::new();
	let optimized = optimizer.optimize(optimized, &config)?;
	// Run the enforce sorting rule:
	let optimizer = EnforceSorting::new();
	let optimized = optimizer.optimize(optimized, &config)?;
	optimized

		const SORT_DISTRIB_DISTRIB: [Run; 3] =
		[Run::Sorting, Run::Distribution, Run::Distribution];

	// Run the enforce sorting rule first:
	let optimizer = EnforceSorting::new();
	let optimized = optimizer.optimize(optimized, &config)?;
	// Run enforce distribution rule:
	let optimizer = EnforceDistribution::new();
	let optimized = optimizer.optimize(optimized, &config)?;
	// The rule should be idempotent.
	// Re-running this rule shouldn't introduce unnecessary operators.
	let optimizer = EnforceDistribution::new();
	let optimized = optimizer.optimize(optimized, &config)?;
	optimized


		// Test: result IS DIFFERENT, if EnforceSorting is run first:

Refactor EnforceDistribution test cases to demonstrate dependencies across optimizer runs. #15074

Refactor EnforceDistribution test cases to demonstrate dependencies across optimizer runs. #15074

Uh oh!

Conversation

wiedld commented Mar 7, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Mar 10, 2025

Uh oh!

berkaysynnada left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Mar 10, 2025

Uh oh!

Uh oh!

wiedld commented Mar 10, 2025

Uh oh!

Uh oh!

wiedld commented Mar 7, 2025 •

edited by alamb

Loading