-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor: add FileGroup
structure for Vec<PartitionedFile>
#15379
Conversation
/// The files in this group | ||
pub files: Vec<PartitionedFile>, | ||
/// Optional statistics for all files in the group | ||
pub statistics: Option<Statistics>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can compute the statistics
in the method of ListingTable
datafusion/datafusion/core/src/datasource/listing/table.rs
Lines 1086 to 1091 in 5210a2b
async fn list_files_for_scan<'a>( | |
&'a self, | |
ctx: &'a dyn Session, | |
filters: &'a [Expr], | |
limit: Option<usize>, | |
) -> Result<(Vec<Vec<PartitionedFile>>, Statistics)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend:
- Remove the
pub
from the fields (so we can potentially change the representation later) - Add a
into_inner()
method that returns the inner Vec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done: 7415f7c
cda1c47
to
1dc5153
Compare
1dc5153
to
7df36ed
Compare
) -> Result<(Vec<PartitionedFile>, Statistics)> { | ||
let mut result_files = vec![]; | ||
) -> Result<(FileGroup, Statistics)> { | ||
let mut result_files = FileGroup::default(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the difference between the statistics
in FileGroup
and the returned value here? Do we need statistics
in FileGroup, if so, could we add the returned Statistics
inside FileGroup
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we return result_files.with_statistics(statistics)
? I don't see any other place where with_statistics is called. The inner optional statistics in FileGroup is a bit confusing to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jayzhan211 , yeah, it's a todo task, I'll compute the FileGroup
's statistic in the method and use the with_statistics
method to set statistics for FileGroup
later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fyi: #15432
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @xudong963
I think encapsulating Vec<PartitionedFile>
into FileGroup
is a very nice idea and long in coming. However I do think it is an API change so I have maked this PR with that tag.
I also added this to the list of API changes that should have entries in the upgrade guide on
@@ -121,40 +120,6 @@ pub fn expr_applicable_for_cols(col_names: &[&str], expr: &Expr) -> bool { | |||
/// The maximum number of concurrent listing requests | |||
const CONCURRENCY_LIMIT: usize = 100; | |||
|
|||
/// Partition the list of files into `n` groups | |||
pub fn split_files( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an API change -- can you perhaps leave the function here and mark it #deprecated
per https://datafusion.apache.org/contributor-guide/api-health.html#deprecation-guidelines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done 7415f7c
/// The files in this group | ||
pub files: Vec<PartitionedFile>, | ||
/// Optional statistics for all files in the group | ||
pub statistics: Option<Statistics>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend:
- Remove the
pub
from the fields (so we can potentially change the representation later) - Add a
into_inner()
method that returns the inner Vec
} | ||
|
||
/// Partition the list of files into `n` groups | ||
pub fn split_files(&mut self, n: usize) -> Vec<FileGroup> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than taking &mut self
here, I think this would be easier to use if it took mut self
As written it will leave self
empty which is not super obvious. If it took mut self
then it would be clear that the FileGroup is consumed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, makes a lot sense
let files = iter.into_iter().collect(); | ||
FileGroup::new(files) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also recommend a From impl for the vec
} | |
} | |
impl From<Vec<PartitionedFile>> for FileGroup { | |
... | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, added in 7415f7c
Now I have three ways to create FileGroup
by Vec<PartitionedFile>
.
let files = vec![...];
let file_group = FileGroup::new(files);
let file_group = files.into();
let file_group = FileGroup::from(files);
I don't strongly prefer which method to use, but the second is simpler!
} | ||
|
||
/// Partition the list of files into `n` groups | ||
pub fn split_files(&mut self, n: usize) -> Vec<FileGroup> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than taking &mut self
here, I think this would be easier to use if it took mut self
As written it will leave self
empty which is not super obvious. If it took mut self
then it would be clear that the FileGroup is consumed
FileGroup
structureFileGroup
structure for Vec<PartitionedFile>
Yes, I agree. Thank you for adding it. |
Thanks for your review. I'll merge it after @alamb has another look! |
@@ -410,7 +419,7 @@ impl FileGroup { | |||
} | |||
|
|||
/// Partition the list of files into `n` groups | |||
pub fn split_files(&mut self, n: usize) -> Vec<FileGroup> { | |||
pub fn split_files(mut self, n: usize) -> Vec<FileGroup> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
Thanks @xudong963 @alamb |
…e#15379) * Refactor: add FileGroup structure * update * address comments * address comments from alamb * fix clippy
Which issue does this PR close?
Rationale for this change
FileGroup
encapsulates both files and their statistics, enabling more effective query planning and optimization based on file group characteristics. Moving from rawVec<PartitionedFile>
to a dedicated structure creates a proper abstraction that bundles related functionality together.FileGroup
's specific API in the future, currently, one that comes into my mind is part of thesplit_groups_by_statistics
method, it make any two files in a file group are ordered.file_groups
clearer, frompub file_groups: Vec<Vec<PartitionedFile>>
topub file_groups: Vec<FileGroup>
What changes are included in this PR?
Introduce
FileGroup
and implement the related APIs, replace allVec<PartitionedFile>
withFileGroup
Are these changes tested?
Yes
Are there any user-facing changes?