Skip to content

[Variant] Avoid extra allocation in object builder #7935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

klion26
Copy link
Member

@klion26 klion26 commented Jul 16, 2025

Which issue does this PR close?

This pr wants to avoid the extra allocation for the object builder and the later buffer copy.

Rationale for this change

Avoid extra allocation in the object builder like the issue descripted.

What changes are included in this PR?

  • add object_start_offset in ObjectBuilder, which describes the start offset in the parent buffer for the current object
  • Add has_been_finished in ObjectBuilder, which describes whether the current object has been finished; it will be used in the Drop function.
  • Modify the logic of new, finish, parent_state, and drop function according to the change.

Are these changes tested?

The logic has been covered by the exist logic.

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 16, 2025
@klion26
Copy link
Member Author

klion26 commented Jul 16, 2025

@alamb Please help to review this when you have time, thanks.

(state, self.validate_unique_fields)
let validate_unique_fields = self.validate_unique_fields;

match &mut self.parent_state {
Copy link
Member Author

@klion26 klion26 Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did not find a better solution for this. I can change this if there is a better solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use the pattern in this pR: klion26#1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, it becomes cleaner now.

@klion26
Copy link
Member Author

klion26 commented Jul 16, 2025

The test in builder.rs completed successfully, will investigate why CI fails.
pushed a fixup to fix the failed CI.

@klion26 klion26 force-pushed the 7899-avoid-extra-allocation-in-object-builder branch from 91cfb73 to 1f2bcc3 Compare July 16, 2025 10:29
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @klion26 -- I plan to review it carefully tomorrow

klion26 added 2 commits July 17, 2025 09:59
This commit will reuse the parent buffer for object builder.
It can avoid the extra allocation for the object and the later buffer copy.
@klion26 klion26 force-pushed the 7899-avoid-extra-allocation-in-object-builder branch from 1f2bcc3 to 6096566 Compare July 17, 2025 03:48
@@ -1064,20 +1084,58 @@ impl<'a> ObjectBuilder<'a> {
key: &str,
value: T,
) -> Result<(), ArrowError> {
// Get metadata_builder from parent state
let metadata_builder = self.parent_state.metadata_builder();
match &mut self.parent_state {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a proposal of how to avoid the duplication:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@klion26
Copy link
Member Author

klion26 commented Jul 17, 2025

@alamb thank you! I've rebased on the main branch, and hardened the nestes object test.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @klion26 -- this looks quite cool.

I think we should try and avoid the replication when possible -- I left a suggestion on how to do that and get the compiler to be happy

Likewise I think it would be nice to add a few more tests. Let me know if it makes sense

assert_eq!(inner_inner_object_d.len(), 1);
assert_eq!(inner_inner_object_d.field_name(0).unwrap(), "cc");
assert_eq!(inner_inner_object_d.field(0).unwrap(), Variant::from("dd"));

assert_eq!(outer_object.field_name(1).unwrap(), "b");
assert_eq!(outer_object.field(1).unwrap(), Variant::from(true));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also add tests for the rollback behavior (as in starting an ObjectBuilder but not calling finish)

Similar we should test a list builder rollback too

Copy link
Member Author

@klion26 klion26 Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will add more tests for this. Currently, the list inside the object depends on the from_json test, add a unit test for it in builder.rs is better, I'll add this.

Not sure if the tests like test_xx_no_finishi()(such as test_object_builder_to_list_builder_inner_no_finish()) are enough to cover the rollback logic?

We've called drop in the test_xx_no_finish() test, do we need to call drop if we add tests to cover the rollaback logic?

@@ -999,8 +1000,17 @@ impl<'a> ListBuilder<'a> {
let offset_size = int_size(data_size);

// Get parent's buffer
let offset_shift = match &self.parent_state {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice if this was a function in ParentState, something like

Suggested change
let offset_shift = match &self.parent_state {
let offset_shift = self.parent_state.object_start_offset();

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -1064,20 +1084,58 @@ impl<'a> ObjectBuilder<'a> {
key: &str,
value: T,
) -> Result<(), ArrowError> {
// Get metadata_builder from parent state
let metadata_builder = self.parent_state.metadata_builder();
match &mut self.parent_state {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a proposal of how to avoid the duplication:

Comment on lines 1053 to 1057
let start_offset = match &parent_state {
ParentState::Variant { buffer, .. } => buffer.offset(),
ParentState::List { buffer, .. } => buffer.offset(),
ParentState::Object { buffer, .. } => buffer.offset(),
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let start_offset = match &parent_state {
ParentState::Variant { buffer, .. } => buffer.offset(),
ParentState::List { buffer, .. } => buffer.offset(),
ParentState::Object { buffer, .. } => buffer.offset(),
};
let start_offset = parent_state.buffer().offset();

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried this way, but it needs to change the parent_state to mutable. Added a function to retrieve the current offset of the buffer.

(state, self.validate_unique_fields)
let validate_unique_fields = self.validate_unique_fields;

match &mut self.parent_state {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can use the pattern in this pR: klion26#1

let data_size = self.buffer.offset();
let num_fields = self.fields.len();
let is_large = num_fields > u8::MAX as usize;
let metadata_builder = match &self.parent_state {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice to put this into a method as well rather than an inline match statement

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed


let starting_offset = self.object_start_offset;

// Shift existing data to make room for the header
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a follow on PR we can consider avoiding this extra splice somehow (by preallocating the size or something). Future work though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filed an issue(#7960) to trace this.

buffer[header_pos..header_pos + offset_size as usize]
.copy_from_slice(&data_size_bytes[..offset_size as usize]);

let start_offset_shift = match &self.parent_state {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is another place we could use the method and avoid an inline match

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@klion26
Copy link
Member Author

klion26 commented Jul 18, 2025

@alamb Thanks for the detailed review, will adressed them soon.

@klion26 klion26 force-pushed the 7899-avoid-extra-allocation-in-object-builder branch from e24d8e3 to deb0782 Compare July 18, 2025 05:11
Copy link
Member Author

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb Thanks for the detailed review and the suggestion, the code becomes much more cleaner. I've addressed most of the comments(in commit deb0782), will add the test to cover the rollback logic after confirmation.

Comment on lines 1053 to 1057
let start_offset = match &parent_state {
ParentState::Variant { buffer, .. } => buffer.offset(),
ParentState::List { buffer, .. } => buffer.offset(),
ParentState::Object { buffer, .. } => buffer.offset(),
};
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried this way, but it needs to change the parent_state to mutable. Added a function to retrieve the current offset of the buffer.


let starting_offset = self.object_start_offset;

// Shift existing data to make room for the header
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filed an issue(#7960) to trace this.

buffer[header_pos..header_pos + offset_size as usize]
.copy_from_slice(&data_size_bytes[..offset_size as usize]);

let start_offset_shift = match &self.parent_state {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

let data_size = self.buffer.offset();
let num_fields = self.fields.len();
let is_large = num_fields > u8::MAX as usize;
let metadata_builder = match &self.parent_state {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

(state, self.validate_unique_fields)
let validate_unique_fields = self.validate_unique_fields;

match &mut self.parent_state {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, it becomes cleaner now.

@@ -1064,20 +1084,58 @@ impl<'a> ObjectBuilder<'a> {
key: &str,
value: T,
) -> Result<(), ArrowError> {
// Get metadata_builder from parent state
let metadata_builder = self.parent_state.metadata_builder();
match &mut self.parent_state {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -999,8 +1000,17 @@ impl<'a> ListBuilder<'a> {
let offset_size = int_size(data_size);

// Get parent's buffer
let offset_shift = match &self.parent_state {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Variant] Avoid extra allocation in ObjectBuilder
2 participants