Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FileScanConfigBuilder #15352

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

blaginin
Copy link
Contributor

@blaginin blaginin commented Mar 21, 2025

Related to #14685 (comment)

Rationale for this change

FileScanConfig now violates single responsibility from SOLID. It serves two conflicting roles:

  • As a builder, though this should be changed as discussed in

    // TODO: This function should be moved into DataSourceExec once FileScanConfig moved out of datafusion/core

  • As a business logic provider (e.g., fn project, impl DataSource, etc.)

These conflicting roles lead to issues like #14905 and #14679, where provider features are accessed even before the build process is complete.

What changes are included in this PR?

I've added FileScanConfigBuilder and deprecated builder approach for FileScanConfig

Are these changes tested?

Yes, updated exiting tests into the new interface

Are there any user-facing changes?

Yes, new builder interface - but the switch is quite easy (all builder-methods from FileScanConfig are supported)

@github-actions github-actions bot added core Core DataFusion crate proto Related to proto crate datasource Changes to the datasource crate labels Mar 21, 2025
Comment on lines -247 to +250
FileScanConfig::new(object_store_url, self.schema(), source)
FileScanConfigBuilder::new(object_store_url, self.schema(), source)
.with_projection(projection.cloned())
.with_limit(limit);
.with_limit(limit)
.build();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mertak-synnada, hey, this PR is still WIP but I was wondering if you're happy with this approach. That's what we've discussed in #14685 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good so far, thank you! I haven't been able to test it thoroughly yet, but the legacy ParquetExecBuilder could be helpful for understanding specific cases, just FYI.

@github-actions github-actions bot added the substrait Changes to the substrait crate label Mar 24, 2025
@blaginin blaginin changed the title WIP: Add FileScanConfigBuilder and switch some cases WIP: Add FileScanConfigBuilder Mar 24, 2025
@blaginin blaginin marked this pull request as ready for review March 24, 2025 21:22
@blaginin
Copy link
Contributor Author

also fyi @AdamGS 👀

@blaginin blaginin changed the title WIP: Add FileScanConfigBuilder Add FileScanConfigBuilder Mar 24, 2025
@blaginin blaginin requested a review from mertak-synnada March 24, 2025 21:24

// Finally, put it all together into a DataSourceExec
Ok(file_scan_config.build())
Ok(Arc::new(DataSourceExec::new(Arc::new(file_scan_config))))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it possible to have this return as a function? Is it because of import cycles?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean as aDataSourceExec function? It also looks a bit verbose to me, but the inner Arc is needed for dynamic dispatch, and the outer one makes the return type more explicit. Happy to make it DataSourceExec::new_arc if you want, but i don't think we use that a lot in datafusion

@AdamGS
Copy link
Contributor

AdamGS commented Mar 25, 2025

my 2c - this looks great, I would love to also rename FileScanConfig to something else but I get the backwards compatibility concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate datasource Changes to the datasource crate proto Related to proto crate substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants