Skip to content

[Variant] Add VariantBuilder::new_with_buffers to write to existing buffers #7912

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 13, 2025

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 11, 2025

Which issue does this PR close?

Rationale for this change

I would like to be able to write Variants directly into the target buffer when writing multiple variants However, the current VariantBuilder allocates a new bufffer for each variant

What changes are included in this PR?

  1. Add VariantBuilder::new_with_buffers and docs and tests

You can see how this API can be used to write directly into a buffer in VariantArrayBuilder in this PR:

Are these changes tested?

Yes new tests

Are there any user-facing changes?

New API

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 11, 2025
@@ -1916,6 +1989,80 @@ mod tests {
assert_eq!(metadata.num_field_names(), 3);
}

/// Test reusing buffers with nested objects
#[test]
fn test_with_existing_buffers_nested() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a really cool test case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have loved to append the variant but I hit this bug:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you apply this diff, it should give you what you need: #7914.

I don't love that we need to use a builder to reencode the object/list. I'm going to spend some time thinking about a more elegant solution.

I wonder if this PR with its existing buffers can help 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you apply this diff, it should give you what you need: #7914.

Thank you @friendlymatthew -- that looks great ❤️ -- I left a few comments

I don't love that we need to use a builder to reencode the object/list. I'm going to spend some time thinking about a more elegant solution.

I think potentially a more performant solution would be to optimize #7914 to copy the Variant bytes directly and then update the field_ids.

However, that will get tricky I think if the field_ids in the variant under construction are different sizes than the old one.

I suggest we start with the simple thing (what you have) and then we can optimize if/when we have benchmarks that show it would help

@alamb
Copy link
Contributor Author

alamb commented Jul 13, 2025

I originally split this PR out from #7911 thinking it would make it eaiser to review

However, @scovich has been providing great feedback directly on #7911 directly, and thus to avoid too many PRs to keep track of I plan to merge this one in and keep working on #7911

@alamb alamb merged commit daf31be into apache:main Jul 13, 2025
12 checks passed
@alamb alamb deleted the alamb/new_from_metadata branch July 14, 2025 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Variant] Support VariantBuilder to write to buffers owned by the caller
2 participants