Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Move Memtable to catalog #15459

Merged
merged 9 commits into from
Mar 31, 2025
Merged

Conversation

logan-keede
Copy link
Contributor

@logan-keede logan-keede commented Mar 27, 2025

Which issue does this PR close?

Rationale for this change

now that catalog is dependent on datasource, MemTable can go to catalog.

What changes are included in this PR?

move In-Memory Table Provider to catalog.
and statistics to datasource.

Are these changes tested?

By Github CI.

Are there any user-facing changes?

you can access datafusion::datasource::memory::{DataSourceExec, MemorySourceConfig, Tableproviders.., CatalogProviders.., SchemaProviders..},
Reason:

  1. Backward Compatibility DataSourceExec, MemorySourceConfig, MemTable.
  2. it is intuitive to access Tableproviders.., CatalogProviders.., SchemaProviders.. together

@github-actions github-actions bot added core Core DataFusion crate catalog Related to the catalog crate labels Mar 27, 2025
@github-actions github-actions bot added the datasource Changes to the datasource crate label Mar 27, 2025
@logan-keede logan-keede marked this pull request as ready for review March 27, 2025 19:20
}

#[async_trait]
impl DataSink for MemSink {
Copy link
Contributor

@jayzhan211 jayzhan211 Mar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think MemSink should be in datasource crate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you would specifically like to make it public (which is also reasonable).
I think it is fine to have it in catalog because MemSink is a private struct used only by MemTable even though it logically makes more sense to have it with its DataSource implementation to like other DataSink impls.

I have pushed a commit feel free to drop it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Jay. We shouldn't try to push every piece of code to the higher level crates, as maintaining integrity and logical structure is more important to keep each crate meaningful.

@alamb alamb merged commit 97bd84a into apache:main Mar 31, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Mar 31, 2025

Maybe as a follow on PR we can make a new crate catalog-memory to hold the in memory catalog implementations from the catalog traits.

Something I noticed recently while building datafusion is that compiling the catalog crate itself takes non trivial time too (7s on my laptop as I recall)

I left more notes here: #13814 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
catalog Related to the catalog crate core Core DataFusion crate datasource Changes to the datasource crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants