feat(parquet): add BloomFilterPropertiesBuilder#9877
feat(parquet): add BloomFilterPropertiesBuilder#9877CuteChuanChuan wants to merge 2 commits intoapache:mainfrom
Conversation
- BloomFilterPropertiesBuilder with with_fpp / with_max_ndv / build / try_build. Two entry points: BloomFilterProperties::builder() and BloomFilterPropertiesBuilder::new(). - New WriterPropertiesBuilder setters: set_bloom_filter_properties and set_column_bloom_filter_properties. NDV from the supplied struct is honoured and not overridden by the row-group-size fallback. - Renamed set_bloom_filter_ndv to set_bloom_filter_max_ndv; kept old names but marked deprecated. - fpp range check extracted into a single shared helper.
alamb
left a comment
There was a problem hiding this comment.
Thank you @CuteChuanChuan -- I think this is a grreat step forward and is not an breaking API change (so it is good!)
#9667 says
Make BloomFilterProperties fields non pub
add relevant accessors
It might be worth considering making fpp/ndv NOT public fields that can only be set via accessors, but we can do that as a follow on PR
Also FYI @adriangb
| self | ||
| } | ||
|
|
||
| /// Deprecated alias for [`Self::set_bloom_filter_max_ndv`]. |
| /// | ||
| /// Use [`BloomFilterProperties::builder`] or [`BloomFilterPropertiesBuilder::new`] | ||
| /// as the entry point. Unset fields fall back to [`DEFAULT_BLOOM_FILTER_FPP`] | ||
| /// and [`DEFAULT_BLOOM_FILTER_NDV`] at build time. |
There was a problem hiding this comment.
the note about unset fields seems like an implmentation detail and is redundant with the comments on with_fpp and with_ndv
I recommend removing this mention
| /// and [`DEFAULT_BLOOM_FILTER_NDV`] at build time. | |
| /// as the entry point. |
There was a problem hiding this comment.
I do think we should mention somewhere what the defaults are. In particular what fpp we target by default and what the memory usage will be during building if you don't provide a custom / known max ndv.
There was a problem hiding this comment.
Good call — added concrete defaults and a worked example to with_max_ndv in 6a2eacb:
with_fppnow states the default is0.05(5%).with_max_ndvdocuments the default of1_048_576and shows that with the default fpp this reserves ~1 MiB per column for the filter bitset, with the full derivation:num_bits = -8 * ndv / ln(1 - fpp^(1/8)), thennext_power_of_two
There was a problem hiding this comment.
the note about unset fields seems like an implmentation detail and is redundant with the comments on with_fpp and with_ndv
I recommend removing this mention
Removed this mentions. Thanks!
| /// Panics if the configured `fpp` is not in `(0.0, 1.0)` exclusive. | ||
| /// Use [`Self::try_build`] for a non-panicking alternative. | ||
| pub fn build(self) -> BloomFilterProperties { | ||
| self.try_build().unwrap_or_else(|e| panic!("{e}")) |
There was a problem hiding this comment.
this is not a new panic (it just moved from set_bloom_filter_fpp)
adriangb
left a comment
There was a problem hiding this comment.
Great thank you for working on this!
| /// | ||
| /// Use [`BloomFilterProperties::builder`] or [`BloomFilterPropertiesBuilder::new`] | ||
| /// as the entry point. Unset fields fall back to [`DEFAULT_BLOOM_FILTER_FPP`] | ||
| /// and [`DEFAULT_BLOOM_FILTER_NDV`] at build time. |
There was a problem hiding this comment.
I do think we should mention somewhere what the defaults are. In particular what fpp we target by default and what the memory usage will be during building if you don't provide a custom / known max ndv.
- Drop the redundant note on the struct-level doc that mentioned `DEFAULT_BLOOM_FILTER_FPP` / `DEFAULT_BLOOM_FILTER_NDV` fall-back - Spell out the concrete defaults on `with_fpp` (0.05 = 5%) and `with_max_ndv` (1,048,576)
Which issue does this PR close?
Rationale for this change
No builder exists for
BloomFilterProperties, so callers writeBloomFilterProperties { fpp, ndv }literals — pinning field layout to the API and skipping fpp validation.WriterPropertiesBuilderalso has no setter that takes a builtBloomFilterProperties.What changes are included in this PR?
BloomFilterPropertiesBuilder(with_fpp,with_max_ndv,build,try_build). Two entry points per discussion in the issue:BloomFilterProperties::builder()andBloomFilterPropertiesBuilder::new().WriterPropertiesBuilder::set_bloom_filter_properties+ per-column variant. NDV from the passed-in struct is honoured (no row-group-size override). For dynamic NDV, keep usingset_bloom_filter_enabled/set_bloom_filter_fpp.set_bloom_filter_ndv→set_bloom_filter_max_ndv(also per-column). Old names are#[deprecated(since = "59.0.0")]aliases.Are these changes tested?
Yes — 10 new unit tests + a doc-test.
Are there any user-facing changes?
Additive only.
set_bloom_filter_ndv(and per-column) emit a deprecation warning pointing to_max_ndv.