-
Notifications
You must be signed in to change notification settings - Fork 974
[Variant] Avoid extra allocation in object builder #7935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Variant] Avoid extra allocation in object builder #7935
Conversation
@alamb Please help to review this when you have time, thanks. |
parquet-variant/src/builder.rs
Outdated
(state, self.validate_unique_fields) | ||
let validate_unique_fields = self.validate_unique_fields; | ||
|
||
match &mut self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did not find a better solution for this. I can change this if there is a better solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use the pattern in this pR: klion26#1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, it becomes cleaner now.
|
91cfb73
to
1f2bcc3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR @klion26 -- I plan to review it carefully tomorrow
This commit will reuse the parent buffer for object builder. It can avoid the extra allocation for the object and the later buffer copy.
1f2bcc3
to
6096566
Compare
parquet-variant/src/builder.rs
Outdated
@@ -1064,20 +1084,58 @@ impl<'a> ObjectBuilder<'a> { | |||
key: &str, | |||
value: T, | |||
) -> Result<(), ArrowError> { | |||
// Get metadata_builder from parent state | |||
let metadata_builder = self.parent_state.metadata_builder(); | |||
match &mut self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a proposal of how to avoid the duplication:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@alamb thank you! I've rebased on the main branch, and hardened the nestes object test. |
6096566
to
442c935
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @klion26 -- this looks quite cool.
I think we should try and avoid the replication when possible -- I left a suggestion on how to do that and get the compiler to be happy
Likewise I think it would be nice to add a few more tests. Let me know if it makes sense
assert_eq!(inner_inner_object_d.len(), 1); | ||
assert_eq!(inner_inner_object_d.field_name(0).unwrap(), "cc"); | ||
assert_eq!(inner_inner_object_d.field(0).unwrap(), Variant::from("dd")); | ||
|
||
assert_eq!(outer_object.field_name(1).unwrap(), "b"); | ||
assert_eq!(outer_object.field(1).unwrap(), Variant::from(true)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also add tests for the rollback behavior (as in starting an ObjectBuilder but not calling finish)
Similar we should test a list builder rollback too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I will add more tests for this. Currently, the list inside the object depends on the from_json
test, add a unit test for it in builder.rs
is better, I'll add this.
Not sure if the tests like test_xx_no_finishi()
(such as test_object_builder_to_list_builder_inner_no_finish()
) are enough to cover the rollback logic?
We've called drop
in the test_xx_no_finish()
test, do we need to call drop
if we add tests to cover the rollaback logic?
parquet-variant/src/builder.rs
Outdated
@@ -999,8 +1000,17 @@ impl<'a> ListBuilder<'a> { | |||
let offset_size = int_size(data_size); | |||
|
|||
// Get parent's buffer | |||
let offset_shift = match &self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice if this was a function in ParentState, something like
let offset_shift = match &self.parent_state { | |
let offset_shift = self.parent_state.object_start_offset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
parquet-variant/src/builder.rs
Outdated
@@ -1064,20 +1084,58 @@ impl<'a> ObjectBuilder<'a> { | |||
key: &str, | |||
value: T, | |||
) -> Result<(), ArrowError> { | |||
// Get metadata_builder from parent state | |||
let metadata_builder = self.parent_state.metadata_builder(); | |||
match &mut self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a proposal of how to avoid the duplication:
parquet-variant/src/builder.rs
Outdated
let start_offset = match &parent_state { | ||
ParentState::Variant { buffer, .. } => buffer.offset(), | ||
ParentState::List { buffer, .. } => buffer.offset(), | ||
ParentState::Object { buffer, .. } => buffer.offset(), | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let start_offset = match &parent_state { | |
ParentState::Variant { buffer, .. } => buffer.offset(), | |
ParentState::List { buffer, .. } => buffer.offset(), | |
ParentState::Object { buffer, .. } => buffer.offset(), | |
}; | |
let start_offset = parent_state.buffer().offset(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried this way, but it needs to change the parent_state
to mutable. Added a function to retrieve the current offset of the buffer.
parquet-variant/src/builder.rs
Outdated
(state, self.validate_unique_fields) | ||
let validate_unique_fields = self.validate_unique_fields; | ||
|
||
match &mut self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can use the pattern in this pR: klion26#1
parquet-variant/src/builder.rs
Outdated
let data_size = self.buffer.offset(); | ||
let num_fields = self.fields.len(); | ||
let is_large = num_fields > u8::MAX as usize; | ||
let metadata_builder = match &self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to put this into a method as well rather than an inline match statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
let starting_offset = self.object_start_offset; | ||
|
||
// Shift existing data to make room for the header |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a follow on PR we can consider avoiding this extra splice somehow (by preallocating the size or something). Future work though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filed an issue(#7960) to trace this.
parquet-variant/src/builder.rs
Outdated
buffer[header_pos..header_pos + offset_size as usize] | ||
.copy_from_slice(&data_size_bytes[..offset_size as usize]); | ||
|
||
let start_offset_shift = match &self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is another place we could use the method and avoid an inline match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@alamb Thanks for the detailed review, will adressed them soon. |
e24d8e3
to
deb0782
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parquet-variant/src/builder.rs
Outdated
let start_offset = match &parent_state { | ||
ParentState::Variant { buffer, .. } => buffer.offset(), | ||
ParentState::List { buffer, .. } => buffer.offset(), | ||
ParentState::Object { buffer, .. } => buffer.offset(), | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried this way, but it needs to change the parent_state
to mutable. Added a function to retrieve the current offset of the buffer.
|
||
let starting_offset = self.object_start_offset; | ||
|
||
// Shift existing data to make room for the header |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filed an issue(#7960) to trace this.
parquet-variant/src/builder.rs
Outdated
buffer[header_pos..header_pos + offset_size as usize] | ||
.copy_from_slice(&data_size_bytes[..offset_size as usize]); | ||
|
||
let start_offset_shift = match &self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
parquet-variant/src/builder.rs
Outdated
let data_size = self.buffer.offset(); | ||
let num_fields = self.fields.len(); | ||
let is_large = num_fields > u8::MAX as usize; | ||
let metadata_builder = match &self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
parquet-variant/src/builder.rs
Outdated
(state, self.validate_unique_fields) | ||
let validate_unique_fields = self.validate_unique_fields; | ||
|
||
match &mut self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, it becomes cleaner now.
parquet-variant/src/builder.rs
Outdated
@@ -1064,20 +1084,58 @@ impl<'a> ObjectBuilder<'a> { | |||
key: &str, | |||
value: T, | |||
) -> Result<(), ArrowError> { | |||
// Get metadata_builder from parent state | |||
let metadata_builder = self.parent_state.metadata_builder(); | |||
match &mut self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
parquet-variant/src/builder.rs
Outdated
@@ -999,8 +1000,17 @@ impl<'a> ListBuilder<'a> { | |||
let offset_size = int_size(data_size); | |||
|
|||
// Get parent's buffer | |||
let offset_shift = match &self.parent_state { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Which issue does this PR close?
ObjectBuilder
#7899 .This pr wants to avoid the extra allocation for the object builder and the later buffer copy.
Rationale for this change
Avoid extra allocation in the object builder like the issue descripted.
What changes are included in this PR?
object_start_offset
inObjectBuilder
, which describes the start offset in the parent buffer for the current objecthas_been_finished
inObjectBuilder
, which describes whether the current object has been finished; it will be used in theDrop
function.new
,finish
,parent_state
, anddrop
function according to the change.Are these changes tested?
The logic has been covered by the exist logic.
Are there any user-facing changes?
No