Fix predicate pushdown for custom SchemaAdapters #15263

adriangb · 2025-03-17T01:23:37Z

Needed for Support default values for columns in SchemaAdapter #15220.

datafusion/core/src/datasource/mod.rs

adriangb · 2025-03-17T01:24:23Z

datafusion/core/src/datasource/physical_plan/parquet.rs

+    #[tokio::test]
+    async fn test_pushdown_with_missing_column_in_file() {


Replacing the unit test with a more e2e tests that shows that things work as expected

adriangb · 2025-03-17T01:25:34Z

datafusion/datasource-parquet/src/row_filter.rs

-        // ArrowPredicate::evaluate is passed columns in the order they appear in the file
-        // If the predicate has multiple columns, we therefore must project the columns based
-        // on the order they appear in the file
-        let projection = match candidate.projection.len() {
-            0 | 1 => vec![],
-            2.. => remap_projection(&candidate.projection),
-        };


I think this is no longer necessary and is handled by the SchemaAdapter. Might be nice to have a test to point to to confirm.

See #15263 (comment)

datafusion/datasource-parquet/src/row_filter.rs

adriangb · 2025-03-17T01:26:12Z

datafusion/datasource-parquet/src/row_filter.rs

-        let batch = self.schema_mapping.map_partial_batch(batch)?;
+    fn evaluate(&mut self, batch: RecordBatch) -> ArrowResult<BooleanArray> {
+        let batch = self.schema_mapping.map_batch(batch)?;


Here is where we ditch map_partial_batch in favor of map_batch

adriangb · 2025-03-17T01:31:23Z

datafusion/datasource-parquet/src/row_filter.rs

-/// Computes the projection required to go from the file's schema order to the projected
-/// order expected by this filter
-///
-/// Effectively this computes the rank of each element in `src`
-fn remap_projection(src: &[usize]) -> Vec<usize> {


I believe this is taken care of by SchemaAdapter now 😄. Again it would be nice to be able to point at a (maybe existing) test to confirm. Maybe I need to try removing this on main and confirming which tests break.

Okay I can confirm that

datafusion/datafusion/core/tests/parquet/filter_pushdown.rs

Line 69 in 87eec43

async fn single_file() {

fails on main if I replace remap_projection with a no-op. So I think this change is 👍🏻

adriangb · 2025-03-17T01:32:38Z

datafusion/datasource-parquet/src/row_filter.rs

+    let file_schema = Arc::new(file_schema.clone());
+    let table_schema = Arc::new(table_schema.clone());


We could change the signature of build_row_filter since the caller might have an Arc'd version already, but since it's pub that would introduce more breaking changes and the clone seemed cheap enough. Open to doing that though.

I think you can avoid cloning the schema with a pretty simple change. Here is a proposal:

Reduce Schema clones in parquet row filter creation pydantic/datafusion#9

adriangb · 2025-03-17T01:32:56Z

datafusion/datasource-parquet/src/row_filter.rs

-    // If a column exists in the table schema but not the file schema it should be rewritten to a null expression
-    #[test]
-    fn test_filter_candidate_builder_rewrite_missing_column() {


See newly added e2e test

adriangb · 2025-03-17T01:33:51Z

datafusion/datasource-parquet/src/row_filter.rs

        )
        .expect("creating filter predicate");

+        let mut parquet_reader = parquet_reader_builder
+            .with_projection(row_filter.projection().clone())


Moved down because it needs access to the row filter's projection

adriangb · 2025-03-17T01:34:28Z

datafusion/datasource-parquet/src/row_filter.rs

        let table_schema = get_basic_table_schema();

-        let file_schema = Schema::new(vec![Field::new("str_col", DataType::Utf8, true)]);
+        let file_schema =
+            Schema::new(vec![Field::new("string_col", DataType::Utf8, true)]);


There is no str_col in the data returned by get_basic_table_schema() but there is string_col.

adriangb · 2025-03-17T01:46:24Z

cc @jeffreyssmith2nd as the original author of map_partial_batch in #10716 to see if this makes sense to you

adriangb · 2025-03-17T01:50:18Z

I'll point out that I think this is synergistic with my work in #15057 in the sense that they both introduce this concept of a "filter"'s schema, although #15057 does it more formally and rigourously.

alamb

Thank you @adriangb -- it is the mark of a great engineer to fix bugs by deleting code in my mind

I think the only thing this PR needs is a few more tests (I specified what they are below). I do think pydantic#9 is worth considering too though.

FYI @itsjunetime who worked on #12135 and @jeffreyssmith2nd who worked on #10716

datafusion/core/src/datasource/physical_plan/parquet.rs

alamb · 2025-03-17T20:05:35Z

datafusion/datasource-parquet/src/row_filter.rs

-    /// After visiting all children, rewrite column references to nulls if
-    /// they are not in the file schema.
-    /// We do this because they won't be relevant if they're not in the file schema, since that's
-    /// the only thing we're dealing with here as this is only used for the parquet pushdown during
-    /// scanning
-    fn f_up(
-        &mut self,
-        expr: Arc<dyn PhysicalExpr>,
-    ) -> Result<Transformed<Arc<dyn PhysicalExpr>>> {


I agree adding an API for stats on the new column would be 💯

alamb · 2025-03-17T20:06:18Z

datafusion/datasource-parquet/src/row_filter.rs

-    /// After visiting all children, rewrite column references to nulls if
-    /// they are not in the file schema.
-    /// We do this because they won't be relevant if they're not in the file schema, since that's
-    /// the only thing we're dealing with here as this is only used for the parquet pushdown during
-    /// scanning
-    fn f_up(
-        &mut self,
-        expr: Arc<dyn PhysicalExpr>,
-    ) -> Result<Transformed<Arc<dyn PhysicalExpr>>> {


I do think in general we need to be correct first, then fast.

As someone once told me "if you don't constraint it (the compiler) to be corrrect, I'll make it as fast as you want!"

alamb · 2025-03-17T20:16:34Z

datafusion/datasource-parquet/src/row_filter.rs

+    let file_schema = Arc::new(file_schema.clone());
+    let table_schema = Arc::new(table_schema.clone());


I think you can avoid cloning the schema with a pretty simple change. Here is a proposal:

Reduce Schema clones in parquet row filter creation pydantic/datafusion#9

alamb

Thank you @adriangb I think this one is ready to go except for the datafusion-testing pin

Without fixing the pin, I think the extended tests are going to fail on main

For example, running

INCLUDE_SQLITE=true nice cargo test --profile release-nonlto --test sqllogictests

I think will error

Here is a PR to revert the change

Fix datafusion testing pin pydantic/datafusion#10

alamb · 2025-03-18T14:57:24Z

datafusion/core/src/datasource/physical_plan/parquet.rs

+
+        let file_schema = Arc::new(Schema::new(vec![
+            Field::new("c3", DataType::Int32, true),
+            Field::new("c3", DataType::Int32, true),


was it intentional to repeat the "c3" column here?

Yes because that's what you suggested in #15263 (comment) (and a file schema like c3, c3), I was confused by what you meant so maybe I misunderstood but I just ran with it. Maybe c3,c1 is a better test?

Fix datafusion testing pin

alamb

Thank you @adriangb

It seems that clippy is now failing due to the new lint added in #15284

adriangb · 2025-03-18T17:49:38Z

2620c6a 😄

alamb · 2025-03-18T19:46:34Z

Thanks for bearing with me here

alamb · 2025-03-19T19:27:35Z

🚀

adriangb · 2025-03-19T19:41:50Z

Amazing thank you so much for pushing this forward Andrew!

alamb

🤦 somehow I didn't see the datafusion-testing link got deleted in this PR

I made a PR to fix it: https://github.com/apache/datafusion/pull/15318/files

adriangb · 2025-03-19T19:56:32Z

Oh I'm terribly sorry that's probably my bad... I constantly have issues with those submodules and have not yet spent the time to figure out how to avoid it.

* wip * wip * wip * add tests * wip * wip * fix * fix * fix * better test * more reverts * fix * Reduce Schema clones in predicate * add more tests * add another test * Fix datafusion testing pin * fix clippy --------- Co-authored-by: Andrew Lamb <[email protected]>

alamb · 2025-03-21T19:15:52Z

Oh I'm terribly sorry that's probably my bad... I constantly have issues with those submodules and have not yet spent the time to figure out how to avoid it.

no worries -- I think the trick for me is if I ever see a change to the submodule do git submodule update to get the most recent version

adriangb added 10 commits March 14, 2025 12:12

wip

f68cdd2

wip

58057f5

wip

6bfe3d4

add tests

6ecf25d

wip

e581bc4

wip

a14e412

fix

b3396db

fix

210616c

fix

896eb8b

better test

43e2123

github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels Mar 17, 2025

adriangb commented Mar 17, 2025

View reviewed changes

more reverts

369e385

fix

3f49dce

adriangb mentioned this pull request Mar 17, 2025

Add hooks to SchemaAdapter to add custom column generators #15261

Open

alamb mentioned this pull request Mar 17, 2025

Weekly Plan (Andrew Lamb) March 17, 2025 #15274

Closed

10 tasks

adriangb mentioned this pull request Mar 17, 2025

Enable parquet filter pushdown by default #3463

Open

6 tasks

Reduce Schema clones in predicate

09062dd

alamb mentioned this pull request Mar 17, 2025

Reduce Schema clones in parquet row filter creation pydantic/datafusion#9

Merged

alamb reviewed Mar 17, 2025

View reviewed changes

alamb mentioned this pull request Mar 17, 2025

Minor: consistently apply clippy::clone_on_ref_ptr in all crates #15284

Merged

adriangb and others added 4 commits March 17, 2025 15:58

Reduce Schema clones in parquet row filter creation (#9)

57c5c46

add more tests

8252b62

add another test

5ec3c15

Fix datafusion testing pin

37b4e03

alamb mentioned this pull request Mar 18, 2025

Fix datafusion testing pin pydantic/datafusion#10

Merged

alamb reviewed Mar 18, 2025

View reviewed changes

Merge pull request #10 from alamb/alamb/fix_pin

77cdc12

Fix datafusion testing pin

alamb approved these changes Mar 18, 2025

View reviewed changes

fix clippy

2620c6a

alamb merged commit 8e2bfa4 into apache:main Mar 19, 2025
27 checks passed

alamb mentioned this pull request Mar 19, 2025

Fix extended tests by restore datafusion-testing submodule #15318

Merged

alamb reviewed Mar 19, 2025

View reviewed changes

alamb mentioned this pull request Mar 21, 2025

Add doc for the statistics_from_parquet_meta_calc method #15330

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix predicate pushdown for custom SchemaAdapters #15263

Fix predicate pushdown for custom SchemaAdapters #15263

adriangb commented Mar 17, 2025 •

edited by alamb

Loading

adriangb Mar 17, 2025

adriangb Mar 17, 2025

adriangb Mar 17, 2025

adriangb Mar 17, 2025

adriangb Mar 17, 2025

adriangb Mar 17, 2025

adriangb Mar 17, 2025

alamb Mar 17, 2025

adriangb Mar 17, 2025

adriangb Mar 17, 2025

adriangb Mar 17, 2025

adriangb commented Mar 17, 2025 •

edited

Loading

adriangb commented Mar 17, 2025

alamb left a comment

alamb Mar 17, 2025

alamb Mar 17, 2025

alamb Mar 17, 2025

alamb left a comment

alamb Mar 18, 2025

adriangb Mar 18, 2025 •

edited

Loading

alamb left a comment

adriangb commented Mar 18, 2025

alamb commented Mar 18, 2025

alamb commented Mar 19, 2025

adriangb commented Mar 19, 2025

alamb left a comment

adriangb commented Mar 19, 2025

alamb commented Mar 21, 2025

		#[tokio::test]
		async fn test_pushdown_with_missing_column_in_file() {

		let file_schema = Arc::new(file_schema.clone());
		let table_schema = Arc::new(table_schema.clone());

Fix predicate pushdown for custom SchemaAdapters #15263

Fix predicate pushdown for custom SchemaAdapters #15263

Conversation

adriangb commented Mar 17, 2025 • edited by alamb Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriangb commented Mar 17, 2025 • edited Loading

adriangb commented Mar 17, 2025

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adriangb Mar 18, 2025 • edited Loading

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

adriangb commented Mar 18, 2025

alamb commented Mar 18, 2025

alamb commented Mar 19, 2025

adriangb commented Mar 19, 2025

alamb left a comment

Choose a reason for hiding this comment

adriangb commented Mar 19, 2025

alamb commented Mar 21, 2025

adriangb commented Mar 17, 2025 •

edited by alamb

Loading

adriangb commented Mar 17, 2025 •

edited

Loading

adriangb Mar 18, 2025 •

edited

Loading