-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add doc for the statistics_from_parquet_meta_calc method
#15330
Add doc for the statistics_from_parquet_meta_calc method
#15330
Conversation
/// For columns without statistics, | ||
/// - For min/max, there are two situations: | ||
/// 1. The column isn't in arrow schema, then min/max values are set to Precision::Absent | ||
/// 2. The column is in arrow schema, but not in parquet schema due to schema revolution, min/max values are set to Precision::Exact(null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, I have questions about this behavior, shouldn't it be Precision::Absent
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this case, the default schema adapter will fill in the constant value null for all columns like this so Precision::Exact(null) is correct
However, as @adriangb found in #15263 and elsewhere when users use custom Schema adapters a value other than NULL is filled in
Maybe this is another place where the schema adapter could/should be used 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, I'll try to make the potential bug surface, thanks @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much @xudong963 -- I think this makes the code better and describes how the code works that matches my understanding
Your question is a good one -- and I think it may be pointing at a potential bug here when using custom schema adapters. Nice!
/// | ||
/// The statistics are calculated for each column in the table schema | ||
/// using the row group statistics in the parquet metadata. | ||
/// | ||
/// # Key behaviors: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
/// For columns without statistics, | ||
/// - For min/max, there are two situations: | ||
/// 1. The column isn't in arrow schema, then min/max values are set to Precision::Absent | ||
/// 2. The column is in arrow schema, but not in parquet schema due to schema revolution, min/max values are set to Precision::Exact(null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this case, the default schema adapter will fill in the constant value null for all columns like this so Precision::Exact(null) is correct
However, as @adriangb found in #15263 and elsewhere when users use custom Schema adapters a value other than NULL is filled in
Maybe this is another place where the schema adapter could/should be used 🤔
Which issue does this PR close?
any
instead offor_each
#15289Rationale for this change
I'm refactoring the method
statistics_from_parquet_meta_calc
in #15289, to make sure we're on the same page, I think it's better to write the doc in detail for the method.What changes are included in this PR?
Add doc
Are these changes tested?
Are there any user-facing changes?