Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support merge for Distribution #15290

Closed
xudong963 opened this issue Mar 18, 2025 · 3 comments
Closed

Support merge for Distribution #15290

xudong963 opened this issue Mar 18, 2025 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@xudong963
Copy link
Member

Is your feature request related to a problem or challenge?

I'm working on the ticket: #10316.

Given that, we'll replace all Precision with Distribution: synnada-ai#63. So, while I make the design for #10316, I presumably use Distribution in statistics.

There is a spot where I'll do the merge for statistics, and it'll be spread to the Distribution.

The specific case is that I need to compute the partition-level statistics, aka, files will be grouped as the filegroup, each file group will be treated as a partition, and different partitions will be processed in parallel. So, the partition-level statistics will be from the merge of the files in a filegroup.

Describe the solution you'd like

Create a function that combines their statistical properties into a new distribution. The most appropriate approach is to create a GenericDistribution that approximates the mixture of the two input distributions.

pub fn merge_distributions(a: &Distribution, b: &Distribution) -> Result<Distribution> {
    ...
}

I'll open a PR and we can do more discussions based on the PR.

Describe alternatives you've considered

No

Additional context

No

@xudong963 xudong963 added the enhancement New feature or request label Mar 18, 2025
@xudong963 xudong963 self-assigned this Mar 18, 2025
@alamb
Copy link
Contributor

alamb commented Mar 18, 2025

I'm working on the ticket: #10316.

Create a function that combines their statistical properties into a new distribution. The most appropriate approach is to create a GenericDistribution that approximates the mixture of the two input distributions.

I think implementing the actual analysis in

Will require an accurate distribution (not just an approximation). Using ProgressiveEval requires (for correctness) that we know the ranges do not overlap

@xudong963
Copy link
Member Author

Will require an accurate distribution (not just an approximation

Yes, it depends on whether each distribution is accurate, if they're, the merged distribution should be accurate, or we should merge them conservatively

@xudong963
Copy link
Member Author

There is a proposal: #15296

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants