-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Refactor EnforceDistribution test cases to demonstrate dependencies across optimizer runs. #15074
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…and highlight when the same testing setup, but different ordering of optimizer runs, effect the outcome.
/// Perform a series of runs using the current [`TestConfig`], | ||
/// assert the expected plan result, | ||
/// and return the result plan (for potentional subsequent runs). | ||
fn run( | ||
&self, | ||
expected_lines: &[&str], | ||
plan: Arc<dyn ExecutionPlan>, | ||
optimizers_to_run: Vec<Run>, | ||
) -> Result<Arc<dyn ExecutionPlan>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason this returns the plan, is so that we can easily write test cases for idempotency.
let plan_after_first_run = test_config.run(
expected,
plan,
vec![Run::Distribution],
)?;
let plan_after_second_run = test_config.run(
expected,
plan_after_first_run.clone(),
vec![Run::Distribution], // exact same run again
)?;
assert_eq!(
get_plan_string(&plan_after_first_run),
get_plan_string(&plan_after_second_run),
);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xudong963 may find this new test setup useful.
const DISTRIB_DISTRIB_SORT: [Run; 3] = | ||
[Run::Distribution, Run::Distribution, Run::Sorting]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Equivalent to main:
datafusion/datafusion/core/tests/physical_optimizer/enforce_distribution.rs
Lines 459 to 469 in 9a4c9d5
// Run enforce distribution rule first: | |
let optimizer = EnforceDistribution::new(); | |
let optimized = optimizer.optimize(optimized, &config)?; | |
// The rule should be idempotent. | |
// Re-running this rule shouldn't introduce unnecessary operators. | |
let optimizer = EnforceDistribution::new(); | |
let optimized = optimizer.optimize(optimized, &config)?; | |
// Run the enforce sorting rule: | |
let optimizer = EnforceSorting::new(); | |
let optimized = optimizer.optimize(optimized, &config)?; | |
optimized |
const SORT_DISTRIB_DISTRIB: [Run; 3] = | ||
[Run::Sorting, Run::Distribution, Run::Distribution]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Equivalent to main:
datafusion/datafusion/core/tests/physical_optimizer/enforce_distribution.rs
Lines 471 to 481 in 9a4c9d5
// Run the enforce sorting rule first: | |
let optimizer = EnforceSorting::new(); | |
let optimized = optimizer.optimize(optimized, &config)?; | |
// Run enforce distribution rule: | |
let optimizer = EnforceDistribution::new(); | |
let optimized = optimizer.optimize(optimized, &config)?; | |
// The rule should be idempotent. | |
// Re-running this rule shouldn't introduce unnecessary operators. | |
let optimizer = EnforceDistribution::new(); | |
let optimized = optimizer.optimize(optimized, &config)?; | |
optimized |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you -- I find this much clearer than having to look in a macro to know the runs are being invoked in a certain order
// TODO(wiedld): show different test result if enforce distribution first. | ||
test_config.run( | ||
&expected_first_sort_enforcement, | ||
top_join, | ||
&TestConfig::new(DoFirst::Sorting).with_prefer_existing_sort() | ||
); | ||
SORT_DISTRIB_DISTRIB.into(), | ||
)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few TODO(wiedld)
in the test multi_smj_joins.
In some cases the test case is showing the outcome for the sort occurring first BEFORE the distribution run, which does not match the ordering of the optimizer in the default planner -- and I thought that should be highlighted in the test cases.
|
||
// Test: result IS DIFFERENT, if EnforceSorting is run first: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Highlighting the dependency of run ordering in the existing test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @wiedld -- I think this PR really improves the readability of these tests and I look forward to working with this code from now on
A note to other reviewers that I found looking at the whitespace blind diff was easier to understand https://github.com/apache/datafusion/pull/15074/files?w=1
FYI @berkaysynnada
const SORT_DISTRIB_DISTRIB: [Run; 3] = | ||
[Run::Sorting, Run::Distribution, Run::Distribution]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you -- I find this much clearer than having to look in a macro to know the runs are being invoked in a certain order
datafusion/core/tests/physical_optimizer/enforce_distribution.rs
Outdated
Show resolved
Hide resolved
datafusion/core/tests/physical_optimizer/enforce_distribution.rs
Outdated
Show resolved
Hide resolved
plan_parquet.clone(), | ||
DISTRIB_DISTRIB_SORT.into(), | ||
)?; | ||
let expected_parquet_first_sort_enforcement = &[ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like you added some coverage too (which is good, I just wanted to verify that was the case)
I'll plan to merge this later today unless anyone else would like more time to review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick look and this is clearly an improvement for understandability, making it easier to extend, and simplifying debugging. Thanks @wiedld and @alamb
A few TODOs stand out, are you planning to address them in a follow-up PR?
btw, I haven't forgotten debugging the issue in influxdata#58, but I still need to finish some urgent tasks.
Thanks for the review @berkaysynnada ! We'll keep adding more coverage too |
@berkaysynnada -- Yes, absolutely. I have one higher priority item to fix first, then I'll switch back to address these TODOs this week. |
Which issue does this PR close?
Rationale for this change
Enable us to write test cases using:
TestConfig
The benefits of this approach are:
What changes are included in this PR?
The above, as we as a bit of an increase in test coverage to demonstrate exactly when our test cases (the plan results we see) are ordering dependent.
Are these changes tested?
Yes
Are there any user-facing changes?
No