Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor: add FileGroup structure for Vec<PartitionedFile> #15379

Merged
merged 5 commits into from
Mar 26, 2025

Conversation

xudong963
Copy link
Member

@xudong963 xudong963 commented Mar 24, 2025

Which issue does this PR close?

Rationale for this change

  1. FileGroup encapsulates both files and their statistics, enabling more effective query planning and optimization based on file group characteristics. Moving from raw Vec<PartitionedFile> to a dedicated structure creates a proper abstraction that bundles related functionality together.
  2. We can add the FileGroup's specific API in the future, currently, one that comes into my mind is part of the split_groups_by_statistics method, it make any two files in a file group are ordered.
  3. Make the file_groups clearer, from pub file_groups: Vec<Vec<PartitionedFile>> to pub file_groups: Vec<FileGroup>

What changes are included in this PR?

Introduce FileGroup and implement the related APIs, replace all Vec<PartitionedFile> with FileGroup

Are these changes tested?

Yes

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate substrait Changes to the substrait crate proto Related to proto crate datasource Changes to the datasource crate labels Mar 24, 2025
/// The files in this group
pub files: Vec<PartitionedFile>,
/// Optional statistics for all files in the group
pub statistics: Option<Statistics>,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can compute the statistics in the method of ListingTable

async fn list_files_for_scan<'a>(
&'a self,
ctx: &'a dyn Session,
filters: &'a [Expr],
limit: Option<usize>,
) -> Result<(Vec<Vec<PartitionedFile>>, Statistics)> {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend:

  1. Remove the pub from the fields (so we can potentially change the representation later)
  2. Add a into_inner() method that returns the inner Vec

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done: 7415f7c

) -> Result<(Vec<PartitionedFile>, Statistics)> {
let mut result_files = vec![];
) -> Result<(FileGroup, Statistics)> {
let mut result_files = FileGroup::default();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the difference between the statistics in FileGroup and the returned value here? Do we need statistics in FileGroup, if so, could we add the returned Statistics inside FileGroup?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The returned Statistics here is all files' merged statistics, or we can say it's the FileGroups statistics.

This is an issue that may need the FileGroup's statistic: #10316 and a corresponding PR: #13296

Copy link
Contributor

@jayzhan211 jayzhan211 Mar 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we return result_files.with_statistics(statistics)? I don't see any other place where with_statistics is called. The inner optional statistics in FileGroup is a bit confusing to me.

Copy link
Member Author

@xudong963 xudong963 Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jayzhan211 , yeah, it's a todo task, I'll compute the FileGroup's statistic in the method and use the with_statistics method to set statistics for FileGroup later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fyi: #15432

@alamb alamb added the api change Changes the API exposed to users of the crate label Mar 24, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @xudong963

I think encapsulating Vec<PartitionedFile> into FileGroup is a very nice idea and long in coming. However I do think it is an API change so I have maked this PR with that tag.

I also added this to the list of API changes that should have entries in the upgrade guide on

@@ -121,40 +120,6 @@ pub fn expr_applicable_for_cols(col_names: &[&str], expr: &Expr) -> bool {
/// The maximum number of concurrent listing requests
const CONCURRENCY_LIMIT: usize = 100;

/// Partition the list of files into `n` groups
pub fn split_files(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an API change -- can you perhaps leave the function here and mark it #deprecated per https://datafusion.apache.org/contributor-guide/api-health.html#deprecation-guidelines

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done 7415f7c

/// The files in this group
pub files: Vec<PartitionedFile>,
/// Optional statistics for all files in the group
pub statistics: Option<Statistics>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend:

  1. Remove the pub from the fields (so we can potentially change the representation later)
  2. Add a into_inner() method that returns the inner Vec

}

/// Partition the list of files into `n` groups
pub fn split_files(&mut self, n: usize) -> Vec<FileGroup> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than taking &mut self here, I think this would be easier to use if it took mut self

As written it will leave self empty which is not super obvious. If it took mut self then it would be clear that the FileGroup is consumed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes a lot sense

let files = iter.into_iter().collect();
FileGroup::new(files)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also recommend a From impl for the vec

Suggested change
}
}
impl From<Vec<PartitionedFile>> for FileGroup {
...
}

Copy link
Member Author

@xudong963 xudong963 Mar 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, added in 7415f7c

Now I have three ways to create FileGroup by Vec<PartitionedFile>.

let files = vec![...];

let file_group = FileGroup::new(files);

let file_group = files.into();

let file_group = FileGroup::from(files);

I don't strongly prefer which method to use, but the second is simpler!

}

/// Partition the list of files into `n` groups
pub fn split_files(&mut self, n: usize) -> Vec<FileGroup> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than taking &mut self here, I think this would be easier to use if it took mut self

As written it will leave self empty which is not super obvious. If it took mut self then it would be clear that the FileGroup is consumed

@alamb alamb changed the title Refactor: add FileGroup structure Refactor: add FileGroup structure for Vec<PartitionedFile> Mar 24, 2025
@xudong963
Copy link
Member Author

However I do think it is an API change so I have maked this PR with that tag.

I also added this to the list of API changes that should have entries in the upgrade guide on

Yes, I agree. Thank you for adding it.

@xudong963
Copy link
Member Author

Thanks for your review. I'll merge it after @alamb has another look!

@@ -410,7 +419,7 @@ impl FileGroup {
}

/// Partition the list of files into `n` groups
pub fn split_files(&mut self, n: usize) -> Vec<FileGroup> {
pub fn split_files(mut self, n: usize) -> Vec<FileGroup> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@jayzhan211 jayzhan211 merged commit de8bcc5 into apache:main Mar 26, 2025
28 checks passed
@jayzhan211
Copy link
Contributor

Thanks @xudong963 @alamb

@xudong963 xudong963 deleted the add_file_group branch March 26, 2025 08:59
qstommyshu pushed a commit to qstommyshu/datafusion that referenced this pull request Mar 27, 2025
…e#15379)

* Refactor: add FileGroup structure

* update

* address comments

* address comments from alamb

* fix clippy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate datasource Changes to the datasource crate proto Related to proto crate substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants