Support `merge` for `Distribution` #15290

xudong963 · 2025-03-18T07:57:21Z

Is your feature request related to a problem or challenge?

I'm working on the ticket: #10316.

Given that, we'll replace all Precision with Distribution: synnada-ai#63. So, while I make the design for #10316, I presumably use Distribution in statistics.

There is a spot where I'll do the merge for statistics, and it'll be spread to the Distribution.

The specific case is that I need to compute the partition-level statistics, aka, files will be grouped as the filegroup, each file group will be treated as a partition, and different partitions will be processed in parallel. So, the partition-level statistics will be from the merge of the files in a filegroup.

Describe the solution you'd like

Create a function that combines their statistical properties into a new distribution. The most appropriate approach is to create a GenericDistribution that approximates the mixture of the two input distributions.

pub fn merge_distributions(a: &Distribution, b: &Distribution) -> Result<Distribution> {
    ...
}

I'll open a PR and we can do more discussions based on the PR.

Describe alternatives you've considered

No

Additional context

No

The text was updated successfully, but these errors were encountered:

alamb · 2025-03-18T08:33:43Z

I'm working on the ticket: #10316.

Create a function that combines their statistical properties into a new distribution. The most appropriate approach is to create a GenericDistribution that approximates the mixture of the two input distributions.

I think implementing the actual analysis in

Optimized version of SortPreservingMerge that doesn't actually compare sort keys of the key ranges are ordered #10316.

Will require an accurate distribution (not just an approximation). Using ProgressiveEval requires (for correctness) that we know the ranges do not overlap

xudong963 · 2025-03-18T08:42:50Z

Will require an accurate distribution (not just an approximation

Yes, it depends on whether each distribution is accurate, if they're, the merged distribution should be accurate, or we should merge them conservatively

xudong963 · 2025-03-18T09:50:41Z

There is a proposal: #15296

xudong963 added the enhancement New feature or request label Mar 18, 2025

xudong963 self-assigned this Mar 18, 2025

xudong963 mentioned this issue Mar 18, 2025

feat: support merge for Distribution #15296

Closed

xudong963 closed this as completed Mar 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support `merge` for `Distribution` #15290

Support `merge` for `Distribution` #15290

xudong963 commented Mar 18, 2025

alamb commented Mar 18, 2025

xudong963 commented Mar 18, 2025

xudong963 commented Mar 18, 2025

Support merge for Distribution #15290

Support merge for Distribution #15290

Comments

xudong963 commented Mar 18, 2025

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

alamb commented Mar 18, 2025

xudong963 commented Mar 18, 2025

xudong963 commented Mar 18, 2025

Support `merge` for `Distribution` #15290

Support `merge` for `Distribution` #15290