Skip to content

feat(parquet): add BloomFilterPropertiesBuilder#9877

Open
CuteChuanChuan wants to merge 2 commits intoapache:mainfrom
CuteChuanChuan:raymond/issue-9667-bloomfilterproperties-builder
Open

feat(parquet): add BloomFilterPropertiesBuilder#9877
CuteChuanChuan wants to merge 2 commits intoapache:mainfrom
CuteChuanChuan:raymond/issue-9667-bloomfilterproperties-builder

Conversation

@CuteChuanChuan
Copy link
Copy Markdown
Contributor

@CuteChuanChuan CuteChuanChuan commented May 2, 2026

Which issue does this PR close?

Rationale for this change

No builder exists for BloomFilterProperties, so callers write BloomFilterProperties { fpp, ndv } literals — pinning field layout to the API and skipping fpp validation. WriterPropertiesBuilder also has no setter that takes a built BloomFilterProperties.

What changes are included in this PR?

  • BloomFilterPropertiesBuilder (with_fpp, with_max_ndv, build, try_build). Two entry points per discussion in the issue: BloomFilterProperties::builder() and BloomFilterPropertiesBuilder::new().
  • WriterPropertiesBuilder::set_bloom_filter_properties + per-column variant. NDV from the passed-in struct is honoured (no row-group-size override). For dynamic NDV, keep using set_bloom_filter_enabled / set_bloom_filter_fpp.
  • Renamed set_bloom_filter_ndvset_bloom_filter_max_ndv (also per-column). Old names are #[deprecated(since = "59.0.0")] aliases.

Are these changes tested?

Yes — 10 new unit tests + a doc-test.

Are there any user-facing changes?

Additive only. set_bloom_filter_ndv (and per-column) emit a deprecation warning pointing to _max_ndv.

- BloomFilterPropertiesBuilder with with_fpp / with_max_ndv / build / try_build. Two entry points: BloomFilterProperties::builder() and BloomFilterPropertiesBuilder::new().
- New WriterPropertiesBuilder setters: set_bloom_filter_properties and set_column_bloom_filter_properties. NDV from the supplied struct is honoured and not overridden by the row-group-size fallback.
- Renamed set_bloom_filter_ndv to set_bloom_filter_max_ndv; kept old names but marked deprecated.
- fpp range check extracted into a single shared helper.
@github-actions github-actions Bot added the parquet Changes to the parquet crate label May 2, 2026
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @CuteChuanChuan -- I think this is a grreat step forward and is not an breaking API change (so it is good!)

#9667 says

Make BloomFilterProperties fields non pub
add relevant accessors

It might be worth considering making fpp/ndv NOT public fields that can only be set via accessors, but we can do that as a follow on PR

Also FYI @adriangb

self
}

/// Deprecated alias for [`Self::set_bloom_filter_max_ndv`].
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment thread parquet/src/file/properties.rs Outdated
///
/// Use [`BloomFilterProperties::builder`] or [`BloomFilterPropertiesBuilder::new`]
/// as the entry point. Unset fields fall back to [`DEFAULT_BLOOM_FILTER_FPP`]
/// and [`DEFAULT_BLOOM_FILTER_NDV`] at build time.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the note about unset fields seems like an implmentation detail and is redundant with the comments on with_fpp and with_ndv

I recommend removing this mention

Suggested change
/// and [`DEFAULT_BLOOM_FILTER_NDV`] at build time.
/// as the entry point.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think we should mention somewhere what the defaults are. In particular what fpp we target by default and what the memory usage will be during building if you don't provide a custom / known max ndv.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — added concrete defaults and a worked example to with_max_ndv in 6a2eacb:

  • with_fpp now states the default is 0.05 (5%).
  • with_max_ndv documents the default of 1_048_576 and shows that with the default fpp this reserves ~1 MiB per column for the filter bitset, with the full derivation: num_bits = -8 * ndv / ln(1 - fpp^(1/8)), then next_power_of_two

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the note about unset fields seems like an implmentation detail and is redundant with the comments on with_fpp and with_ndv

I recommend removing this mention

Removed this mentions. Thanks!

/// Panics if the configured `fpp` is not in `(0.0, 1.0)` exclusive.
/// Use [`Self::try_build`] for a non-panicking alternative.
pub fn build(self) -> BloomFilterProperties {
self.try_build().unwrap_or_else(|e| panic!("{e}"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a new panic (it just moved from set_bloom_filter_fpp)

Copy link
Copy Markdown
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great thank you for working on this!

Comment thread parquet/src/file/properties.rs Outdated
///
/// Use [`BloomFilterProperties::builder`] or [`BloomFilterPropertiesBuilder::new`]
/// as the entry point. Unset fields fall back to [`DEFAULT_BLOOM_FILTER_FPP`]
/// and [`DEFAULT_BLOOM_FILTER_NDV`] at build time.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think we should mention somewhere what the defaults are. In particular what fpp we target by default and what the memory usage will be during building if you don't provide a custom / known max ndv.

- Drop the redundant note on the struct-level doc that mentioned `DEFAULT_BLOOM_FILTER_FPP` / `DEFAULT_BLOOM_FILTER_NDV` fall-back
- Spell out the concrete defaults on `with_fpp` (0.05 = 5%) and `with_max_ndv` (1,048,576)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[parquet] Add BloomFilterProperties builder API to make bloom filter configuration explicit

3 participants