-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Variant] Improve variant_get performance on a perfect shredding
#8887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Variant] Improve variant_get performance on a perfect shredding
#8887
Conversation
| // Try to return the typed value directly when we have a perfect shredding match. | ||
| if !matches!(as_field.data_type(), DataType::Struct(_)) { | ||
| if let Some(typed_value) = target.typed_value_field() { | ||
| let types_match = typed_value.data_type() == as_field.data_type(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is probably overly restrictive, Utf8, LargeUtf8, and Utf8View should be allowed. Maybe check if cast-able.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, the current code is correct unless we're willing to inject a typecast.
Otherwise, the caller could end up with a different type than they requested.
That said, casting does seem reasonable -- it would certainly be faster than a variant builder, and the caller did request a specific type.
|
This test case failed: arrow-rs/parquet-variant-compute/src/variant_get.rs Lines 3754 to 3781 in fea605c
It is because the perfectly shredded array itself is not a valid arrow array. I'm not sure if this is a well-defined behavior, I didn't check carefully, but I feel it is What do you think @alamb @klion26 @friendlymatthew ? My gut feeling is that this data integrity checking during read time can be very expensive; for example, validating utf8 every time we read a string array can be extremely slow. |
|
This test here wants to cover the behavior of CastOptions; I think 1) we can't guarantee that the input is valid in Returning to this test, we might be able to modify the input for |
Makes sense to me, thank you @klion26 , in that case I'll try to make the optimization only enabled for not safe cast case |
|
I have this on my radar to review; I am hoping to ahve some time to devote to variant stuff next week -- this week I have been occupied getting the 57.1.0 release ready |
| // Try to return the typed value directly when we have a perfect shredding match. | ||
| if !matches!(as_field.data_type(), DataType::Struct(_)) { | ||
| if let Some(typed_value) = target.typed_value_field() { | ||
| let types_match = typed_value.data_type() == as_field.data_type(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, the current code is correct unless we're willing to inject a typecast.
Otherwise, the caller could end up with a different type than they requested.
That said, casting does seem reasonable -- it would certainly be faster than a variant builder, and the caller did request a specific type.
|
I checked arrow's arrow-rs/arrow-cast/src/cast/mod.rs Lines 751 to 761 in 779e9bd
It doesn't check if the |
Sorry, what do you mean? If the types exactly match then there's nothing to convert and the options shouldn't matter? |
I was debating about whether to perform a sanity check before returning, here's the context: #8887 (comment) I agree that we should not perform any additional conversions, but currently there're two test cases failing it. |
Yes, I agree with this assessment. We shouldn't be relying on In my mind, I expect the cast options to apply when the source is not shredded (aka it is a dynamically typed variant) so "casting" is required as part of Therefore I think we should update the test to either use a non -shredded variant. Here is a test to do so: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @XiangpengHao, @scovich and @klion26 -- I think this behavior makes a lot of sense (and I am surprised we didn't already do this)
I have a suggestion about changing the tests, and I think this PR should have a few more tests, but otherwise I think this PR is ready to go
| } | ||
|
|
||
| #[test] | ||
| fn test_perfect_shredding_returns_same_arc_ptr() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also please add some other tests? Specifically, tests for:
- The case when the shredded value has all nulls
- The case when the shredded value has some nulls
- The case when the shredded value is a Struct
This pr improves the performance of
variant_geton a perfectly shredded variant, it bypasses the array builder and directly clone the shredded column.For example, if a variant looks like this:
Then if we read
event_typeand we also want to cast it into a string, then we don't have to go through the builder but instead directly clone thetyped_valuearray.Specifically this optimization is safe if:
valueis null (does not exists)typed_valuehas the same data type as the requested data typeI think this is a pretty common case of variant shredding.
====
This PR also has benchmark code. It improves the performance by many many times (of course 😄).
Let me know what you think! (fwiw, this pr is part of the efforts in datafusion-contrib/datafusion-variant#19 (comment))