Skip to content

Conversation

@XiangpengHao
Copy link
Contributor

This pr improves the performance of variant_get on a perfectly shredded variant, it bypasses the array builder and directly clone the shredded column.

For example, if a variant looks like this:

optional group event (VARIANT) {
  required binary metadata;
  optional binary value;                
  optional group typed_value {        
    required group event_type {       
      optional binary value;                <- this is null
      optional binary typed_value (STRING);
    }
  }
}

Then if we read event_type and we also want to cast it into a string, then we don't have to go through the builder but instead directly clone the typed_value array.

Specifically this optimization is safe if:

  1. value is null (does not exists)
  2. typed_value has the same data type as the requested data type

I think this is a pretty common case of variant shredding.

====

This PR also has benchmark code. It improves the performance by many many times (of course 😄).

Let me know what you think! (fwiw, this pr is part of the efforts in datafusion-contrib/datafusion-variant#19 (comment))

@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Nov 19, 2025
// Try to return the typed value directly when we have a perfect shredding match.
if !matches!(as_field.data_type(), DataType::Struct(_)) {
if let Some(typed_value) = target.typed_value_field() {
let types_match = typed_value.data_type() == as_field.data_type();
Copy link
Contributor Author

@XiangpengHao XiangpengHao Nov 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably overly restrictive, Utf8, LargeUtf8, and Utf8View should be allowed. Maybe check if cast-able.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, the current code is correct unless we're willing to inject a typecast.
Otherwise, the caller could end up with a different type than they requested.

That said, casting does seem reasonable -- it would certainly be faster than a variant builder, and the caller did request a specific type.

@XiangpengHao
Copy link
Contributor Author

This test case failed:

perfectly_shredded_variant_array_fn!(perfectly_shredded_invalid_time_variant_array, || {
// 86401000000 is invalid for Time64Microsecond (max is 86400000000)
Time64MicrosecondArray::from(vec![
Some(86401000000),
Some(86401000000),
Some(86401000000),
])
});
#[test]
fn test_variant_get_error_when_cast_failure_and_safe_false() {
let variant_array = perfectly_shredded_invalid_time_variant_array();
let field = Field::new("result", DataType::Time64(TimeUnit::Microsecond), true);
let cast_options = CastOptions {
safe: false, // Will error on cast failure
..Default::default()
};
let options = GetOptions::new()
.with_as_type(Some(FieldRef::from(field)))
.with_cast_options(cast_options);
let err = variant_get(&variant_array, options).unwrap_err();
assert!(
err.to_string().contains(
"Cast error: Cast failed at index 0 (array type: Time64(µs)): Invalid microsecond from midnight: 86401000000"
)
);
}

It is because the perfectly shredded array itself is not a valid arrow array. I'm not sure if this is a well-defined behavior, I didn't check carefully, but I feel it is Time64MicrosecondArray's responsibility to make sure the its data is valid.

What do you think @alamb @klion26 @friendlymatthew ?

My gut feeling is that this data integrity checking during read time can be very expensive; for example, validating utf8 every time we read a string array can be extremely slow.

@klion26
Copy link
Member

klion26 commented Nov 20, 2025

This test here wants to cover the behavior of CastOptions; I think 1) we can't guarantee that the input is valid in variant_get here; 2) we'll do a type cast in variant_get(type_conversion.rs), and the cast may or may not succeed, and the CastOptions::safe controls how to handle suce cases.

Returning to this test, we might be able to modify the input for variant_get here, but we may still need to ensure that different CastOptions are covered.

@XiangpengHao
Copy link
Contributor Author

This test here wants to cover the behavior of CastOptions; I think 1) we can't guarantee that the input is valid in variant_get here; 2) we'll do a type cast in variant_get(type_conversion.rs), and the cast may or may not succeed, and the CastOptions::safe controls how to handle suce cases.

Returning to this test, we might be able to modify the input for variant_get here, but we may still need to ensure that different CastOptions are covered.

Makes sense to me, thank you @klion26 , in that case I'll try to make the optimization only enabled for not safe cast case

@alamb
Copy link
Contributor

alamb commented Nov 21, 2025

I have this on my radar to review; I am hoping to ahve some time to devote to variant stuff next week -- this week I have been occupied getting the 57.1.0 release ready

// Try to return the typed value directly when we have a perfect shredding match.
if !matches!(as_field.data_type(), DataType::Struct(_)) {
if let Some(typed_value) = target.typed_value_field() {
let types_match = typed_value.data_type() == as_field.data_type();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, the current code is correct unless we're willing to inject a typecast.
Otherwise, the caller could end up with a different type than they requested.

That said, casting does seem reasonable -- it would certainly be faster than a variant builder, and the caller did request a specific type.

@XiangpengHao
Copy link
Contributor Author

I checked arrow's cast_with_options:

pub fn cast_with_options(
array: &dyn Array,
to_type: &DataType,
cast_options: &CastOptions,
) -> Result<ArrayRef, ArrowError> {
use DataType::*;
let from_type = array.data_type();
// clone array if types are the same
if from_type == to_type {
return Ok(make_array(array.to_data()));
}

It doesn't check if the from_type is a valid data type regardless of cast_options. I think we should follow the same conventions where, as we take the same cast_options from user 🤔

@scovich
Copy link
Contributor

scovich commented Nov 21, 2025

I checked arrow's cast_with_options:

pub fn cast_with_options(
array: &dyn Array,
to_type: &DataType,
cast_options: &CastOptions,
) -> Result<ArrayRef, ArrowError> {
use DataType::*;
let from_type = array.data_type();
// clone array if types are the same
if from_type == to_type {
return Ok(make_array(array.to_data()));
}

It doesn't check if the from_type is a valid data type regardless of cast_options. I think we should follow the same conventions where, as we take the same cast_options from user 🤔

Sorry, what do you mean? If the types exactly match then there's nothing to convert and the options shouldn't matter?

@XiangpengHao
Copy link
Contributor Author

Sorry, what do you mean? If the types exactly match then there's nothing to convert and the options shouldn't matter?

I was debating about whether to perform a sanity check before returning, here's the context: #8887 (comment)

I agree that we should not perform any additional conversions, but currently there're two test cases failing it.

@alamb
Copy link
Contributor

alamb commented Nov 25, 2025

This test case failed:

perfectly_shredded_variant_array_fn!(perfectly_shredded_invalid_time_variant_array, || {
// 86401000000 is invalid for Time64Microsecond (max is 86400000000)
Time64MicrosecondArray::from(vec![
Some(86401000000),
Some(86401000000),
Some(86401000000),
])
});
#[test]
fn test_variant_get_error_when_cast_failure_and_safe_false() {
let variant_array = perfectly_shredded_invalid_time_variant_array();
let field = Field::new("result", DataType::Time64(TimeUnit::Microsecond), true);
let cast_options = CastOptions {
safe: false, // Will error on cast failure
..Default::default()
};
let options = GetOptions::new()
.with_as_type(Some(FieldRef::from(field)))
.with_cast_options(cast_options);
let err = variant_get(&variant_array, options).unwrap_err();
assert!(
err.to_string().contains(
"Cast error: Cast failed at index 0 (array type: Time64(µs)): Invalid microsecond from midnight: 86401000000"
)
);
}

It is because the perfectly shredded array itself is not a valid arrow array. I'm not sure if this is a well-defined behavior, I didn't check carefully, but I feel it is Time64MicrosecondArray's responsibility to make sure the its data is valid.

What do you think @alamb @klion26 @friendlymatthew ?

Yes, I agree with this assessment. We shouldn't be relying on variant_get to detect incorrectly shredded values.

In my mind, I expect the cast options to apply when the source is not shredded (aka it is a dynamically typed variant) so "casting" is required as part of variant_get

Therefore I think we should update the test to either use a non -shredded variant. Here is a test to do so:

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @XiangpengHao, @scovich and @klion26 -- I think this behavior makes a lot of sense (and I am surprised we didn't already do this)

I have a suggestion about changing the tests, and I think this PR should have a few more tests, but otherwise I think this PR is ready to go

}

#[test]
fn test_perfect_shredding_returns_same_arc_ptr() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also please add some other tests? Specifically, tests for:

  1. The case when the shredded value has all nulls
  2. The case when the shredded value has some nulls
  3. The case when the shredded value is a Struct

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants