Skip to content

Refactor EnforceDistribution test cases to demonstrate dependencies across optimizer runs. #15074

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Mar 10, 2025

Conversation

wiedld
Copy link
Contributor

@wiedld wiedld commented Mar 7, 2025

Which issue does this PR close?

Rationale for this change

Enable us to write test cases using:

  • the same TestConfig
  • then apply the optimizer runs in any order chosen

The benefits of this approach are:

  • explicitly demonstrate when there is an ordering dependency in which optimizer has to run first (sorting or distribution)
  • quickly enable us to write test cases addressing idempotency of an optimizer run
  • (hopefully) helps identify strengths/gaps in the current test coverage. 🙏🏼

What changes are included in this PR?

The above, as we as a bit of an increase in test coverage to demonstrate exactly when our test cases (the plan results we see) are ordering dependent.

Are these changes tested?

Yes

Are there any user-facing changes?

No

wiedld added 2 commits March 7, 2025 13:04
…and highlight when the same testing setup, but different ordering of optimizer runs, effect the outcome.
@github-actions github-actions bot added the core Core DataFusion crate label Mar 7, 2025
Comment on lines 449 to 457
/// Perform a series of runs using the current [`TestConfig`],
/// assert the expected plan result,
/// and return the result plan (for potentional subsequent runs).
fn run(
&self,
expected_lines: &[&str],
plan: Arc<dyn ExecutionPlan>,
optimizers_to_run: Vec<Run>,
) -> Result<Arc<dyn ExecutionPlan>> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this returns the plan, is so that we can easily write test cases for idempotency.

let plan_after_first_run = test_config.run(
    expected,
    plan,
    vec![Run::Distribution],
)?;
let plan_after_second_run = test_config.run(
    expected,
    plan_after_first_run.clone(),
    vec![Run::Distribution], // exact same run again
)?;
assert_eq!(
    get_plan_string(&plan_after_first_run),
    get_plan_string(&plan_after_second_run),
);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xudong963 may find this new test setup useful.

Comment on lines +404 to +405
const DISTRIB_DISTRIB_SORT: [Run; 3] =
[Run::Distribution, Run::Distribution, Run::Sorting];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Equivalent to main:

// Run enforce distribution rule first:
let optimizer = EnforceDistribution::new();
let optimized = optimizer.optimize(optimized, &config)?;
// The rule should be idempotent.
// Re-running this rule shouldn't introduce unnecessary operators.
let optimizer = EnforceDistribution::new();
let optimized = optimizer.optimize(optimized, &config)?;
// Run the enforce sorting rule:
let optimizer = EnforceSorting::new();
let optimized = optimizer.optimize(optimized, &config)?;
optimized

Comment on lines +406 to +407
const SORT_DISTRIB_DISTRIB: [Run; 3] =
[Run::Sorting, Run::Distribution, Run::Distribution];
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Equivalent to main:

// Run the enforce sorting rule first:
let optimizer = EnforceSorting::new();
let optimized = optimizer.optimize(optimized, &config)?;
// Run enforce distribution rule:
let optimizer = EnforceDistribution::new();
let optimized = optimizer.optimize(optimized, &config)?;
// The rule should be idempotent.
// Re-running this rule shouldn't introduce unnecessary operators.
let optimizer = EnforceDistribution::new();
let optimized = optimizer.optimize(optimized, &config)?;
optimized

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you -- I find this much clearer than having to look in a macro to know the runs are being invoked in a certain order

Comment on lines 1565 to 1570
// TODO(wiedld): show different test result if enforce distribution first.
test_config.run(
&expected_first_sort_enforcement,
top_join,
&TestConfig::new(DoFirst::Sorting).with_prefer_existing_sort()
);
SORT_DISTRIB_DISTRIB.into(),
)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few TODO(wiedld) in the test multi_smj_joins.

In some cases the test case is showing the outcome for the sort occurring first BEFORE the distribution run, which does not match the ordering of the optimizer in the default planner -- and I thought that should be highlighted in the test cases.


// Test: result IS DIFFERENT, if EnforceSorting is run first:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Highlighting the dependency of run ordering in the existing test cases.

@wiedld wiedld marked this pull request as ready for review March 7, 2025 21:39
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @wiedld -- I think this PR really improves the readability of these tests and I look forward to working with this code from now on

A note to other reviewers that I found looking at the whitespace blind diff was easier to understand https://github.com/apache/datafusion/pull/15074/files?w=1

FYI @berkaysynnada

Comment on lines +406 to +407
const SORT_DISTRIB_DISTRIB: [Run; 3] =
[Run::Sorting, Run::Distribution, Run::Distribution];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you -- I find this much clearer than having to look in a macro to know the runs are being invoked in a certain order

plan_parquet.clone(),
DISTRIB_DISTRIB_SORT.into(),
)?;
let expected_parquet_first_sort_enforcement = &[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like you added some coverage too (which is good, I just wanted to verify that was the case)

@alamb
Copy link
Contributor

alamb commented Mar 10, 2025

I'll plan to merge this later today unless anyone else would like more time to review

Copy link
Contributor

@berkaysynnada berkaysynnada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a quick look and this is clearly an improvement for understandability, making it easier to extend, and simplifying debugging. Thanks @wiedld and @alamb

A few TODOs stand out, are you planning to address them in a follow-up PR?

btw, I haven't forgotten debugging the issue in influxdata#58, but I still need to finish some urgent tasks.

@alamb
Copy link
Contributor

alamb commented Mar 10, 2025

Thanks for the review @berkaysynnada ! We'll keep adding more coverage too

@alamb alamb merged commit 80cb0af into apache:main Mar 10, 2025
24 checks passed
@alamb alamb deleted the 15003/remove-macro branch March 10, 2025 17:10
@wiedld
Copy link
Contributor Author

wiedld commented Mar 10, 2025

A few TODOs stand out, are you planning to address them in a follow-up PR?

@berkaysynnada -- Yes, absolutely. I have one higher priority item to fix first, then I'll switch back to address these TODOs this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants