-
Notifications
You must be signed in to change notification settings - Fork 972
[Variant] Define shredding schema for VariantArrayBuilder
#7921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
// if !value_field.is_nullable() { | ||
// return Err(ArrowError::InvalidArgumentError( | ||
// "Expected value field to be nullable".to_string(), | ||
// )); | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not see anything in the shredding spec that explicitly states value
can not be nullable. Same thing with typed_value
below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spec says that both can be nullable:
required group measurement (VARIANT) { required binary metadata; optional binary value; optional int64 typed_value; }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I don't know, is what should happen with metadata
in a variant that shredded as a deeply nested struct, where one or more of those struct fields happen to be variant typed (which in turn can also be shredded further, and which can also contain variant fields of their own?
v: VARIANT {
metadata: BINARY,
value: BINARY,
typed_value: {
a: STRUCT {
b: STRUCT {
c: STRUCT {
w: VARIANT {
metadata: BINARY, << --- ???
value: BINARY,
typed_value: STRUCT {
x: STRUCT {
y: STRUCT {
z: STRUCT {
u: VARIANT {
metadata: BINARY, <<--- ???
value: BINARY,
}
}
}
}
}
}
}
}
}
}
}
The spec says that
Variant metadata is stored in the top-level Variant group in a binary metadata column regardless of whether the Variant value is shredded.
All value columns within the Variant must use the same metadata. All field names of a Variant, whether shredded or not, must be present in the metadata.
I'm pretty sure that means w
and u
must not have metadata
columns -- because they are still "inside" v
.
Even if one tried to store path names of u
inside all three metadata
columns, the field ids would disagree unless we forced u.metadata
and w.metadata
to be copies of v.metadata
. Easy enough to do that in arrow-rs (all array data are anyway Arc), but what about the actual parquet file??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the spec is 100% clear on this one, unfortunately. Maybe we need to ask the parquet-variant folks for clarification and/or refinement of the spec's wording?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should only be one metadata
field, at the top level. So a metadata
field at w
is not allowed.
Also, the immediately a.b.c
fields are not valid for a shredded variant. Every shredded field needs a new typed_value
and/or value
field directly under it. So the path in parquet for v:a.b.c.w.x.y.z
should be a.typed_value.b.typed_value.c.typed_value.w.typed_value.x.typed_value.y.typed_value.z.typed_value.u.value
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The relevant paragraph from the spec for my second comment:
Each shredded field in the typed_value group is represented as a required group that contains optional value and typed_value fields.
The value field stores the value as Variant-encoded binary when the typed_value cannot represent the field. This layout enables readers to skip data based on the field statistics for value and typed_value.
The typed_value field may be omitted when not shredding fields as a specific type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more slightly pedantic clarification: the type of w
and u
in parquet are not VARIANT (i.e. only v
is annotated with the Variant logical type). u
and v
are just shredded fields that happen to not have a typed_value
field, only value
.
if metadata_field.is_nullable() { | ||
return Err(ArrowError::InvalidArgumentError( | ||
"Invalid VariantArray: metadata field can not be nullable".to_string(), | ||
)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I make sure to check metadata is not nullable. But I wonder if we should remove this. You could imagine a user wanting to use the same metadata throughout the entire building process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for variant columns nested inside a shredded variant, we must not have a metadata
column?
See https://github.com/apache/arrow-rs/pull/7921/files#r2204684924 above?
551c3d3
to
118ee3f
Compare
118ee3f
to
a2c8c72
Compare
// this is directly mapped from the spec's parquet physical types | ||
// note, there are more data types we can support | ||
// but for the sake of simplicity, I chose the smallest subset | ||
match typed_value_field.data_type() { | ||
DataType::Boolean | ||
| DataType::Int32 | ||
| DataType::Int64 | ||
| DataType::Float32 | ||
| DataType::Float64 | ||
| DataType::BinaryView => {} | ||
DataType::Union(union_fields, _) => { | ||
union_fields | ||
.iter() | ||
.map(|(_, f)| f.clone()) | ||
.try_for_each(|f| { | ||
let DataType::Struct(fields) = f.data_type().clone() else { | ||
return Err(ArrowError::InvalidArgumentError( | ||
"Expected struct".to_string(), | ||
)); | ||
}; | ||
|
||
validate_value_and_typed_value(&fields, false) | ||
})?; | ||
} | ||
|
||
foreign => { | ||
return Err(ArrowError::NotYetImplemented(format!( | ||
"Unsupported VariantArray 'typed_value' field, got {foreign}" | ||
))) | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this, but I treat the field DataType
s as the parquet physical type defined in the specification: https://github.com/apache/parquet-format/blob/master/VariantShredding.md#shredded-value-types.
I'm curious to get your thoughts, maybe we should stick with the Variant type mapping?
One reason why the current logic isn't the best is because when we go to reconstruct variants, certain variant types like int8
will get casted to a DataType::Int32
. This means when we go to encode the values back to variant, we won't know their original types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to store (logical) int32 in memory just because parquet physically encodes them that way? When reading an int8 column from normal parquet, doesn't it come back as an int8 PrimitiveArray?
// Create union of different element types | ||
let union_fields = UnionFields::new( | ||
vec![0, 1], | ||
vec![ | ||
Field::new("string_element", string_element, true), | ||
Field::new("int_element", int_element, true), | ||
], | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this, the field names are weird. However, we need a way to support a heterogenous list of Fields
.
I'm curious if there is a nicer way to represent a group.
let typed_value_field = Field::new( | ||
"typed_value", | ||
DataType::Union( | ||
UnionFields::new( | ||
vec![0, 1], | ||
vec![ | ||
Field::new("event_type", DataType::Struct(element_group_1), true), | ||
Field::new("event_ts", DataType::Struct(element_group_2), true), | ||
], | ||
), | ||
UnionMode::Sparse, | ||
), | ||
false, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to https://github.com/apache/arrow-rs/pull/7921/files#r2203048613, but this is nicer, since we can treat field names as key names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't actually review yet, but wanted to at least respond to one comment.
// if !value_field.is_nullable() { | ||
// return Err(ArrowError::InvalidArgumentError( | ||
// "Expected value field to be nullable".to_string(), | ||
// )); | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spec says that both can be nullable:
required group measurement (VARIANT) { required binary metadata; optional binary value; optional int64 typed_value; }
// if !value_field.is_nullable() { | ||
// return Err(ArrowError::InvalidArgumentError( | ||
// "Expected value field to be nullable".to_string(), | ||
// )); | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I don't know, is what should happen with metadata
in a variant that shredded as a deeply nested struct, where one or more of those struct fields happen to be variant typed (which in turn can also be shredded further, and which can also contain variant fields of their own?
v: VARIANT {
metadata: BINARY,
value: BINARY,
typed_value: {
a: STRUCT {
b: STRUCT {
c: STRUCT {
w: VARIANT {
metadata: BINARY, << --- ???
value: BINARY,
typed_value: STRUCT {
x: STRUCT {
y: STRUCT {
z: STRUCT {
u: VARIANT {
metadata: BINARY, <<--- ???
value: BINARY,
}
}
}
}
}
}
}
}
}
}
}
The spec says that
Variant metadata is stored in the top-level Variant group in a binary metadata column regardless of whether the Variant value is shredded.
All value columns within the Variant must use the same metadata. All field names of a Variant, whether shredded or not, must be present in the metadata.
I'm pretty sure that means w
and u
must not have metadata
columns -- because they are still "inside" v
.
Even if one tried to store path names of u
inside all three metadata
columns, the field ids would disagree unless we forced u.metadata
and w.metadata
to be copies of v.metadata
. Easy enough to do that in arrow-rs (all array data are anyway Arc), but what about the actual parquet file??
} | ||
} | ||
|
||
if let Some(typed_value_field) = fields.iter().find(|f| f.name() == TYPED_VALUE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if let Some(typed_value_field) = fields.iter().find(|f| f.name() == TYPED_VALUE) { | |
if let Some(typed_value_field) = typed_value_field_res { |
// this is directly mapped from the spec's parquet physical types | ||
// note, there are more data types we can support | ||
// but for the sake of simplicity, I chose the smallest subset | ||
match typed_value_field.data_type() { | ||
DataType::Boolean | ||
| DataType::Int32 | ||
| DataType::Int64 | ||
| DataType::Float32 | ||
| DataType::Float64 | ||
| DataType::BinaryView => {} | ||
DataType::Union(union_fields, _) => { | ||
union_fields | ||
.iter() | ||
.map(|(_, f)| f.clone()) | ||
.try_for_each(|f| { | ||
let DataType::Struct(fields) = f.data_type().clone() else { | ||
return Err(ArrowError::InvalidArgumentError( | ||
"Expected struct".to_string(), | ||
)); | ||
}; | ||
|
||
validate_value_and_typed_value(&fields, false) | ||
})?; | ||
} | ||
|
||
foreign => { | ||
return Err(ArrowError::NotYetImplemented(format!( | ||
"Unsupported VariantArray 'typed_value' field, got {foreign}" | ||
))) | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to store (logical) int32 in memory just because parquet physically encodes them that way? When reading an int8 column from normal parquet, doesn't it come back as an int8 PrimitiveArray?
| DataType::Float32 | ||
| DataType::Float64 | ||
| DataType::BinaryView => {} | ||
DataType::Union(union_fields, _) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My initial reaction was that I don't think variant data can represent a union type?
I guess this is a way to slightly relax strongly typed data, as long as the union members themselves are all valid variant types? And whichever union member is active becomes the only field+value of a variant object? But how would a reader of that shredded data know to read it back as a union, instead of the (sparse) struct it appears to be? Would it be better to just require a struct from the start?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend we start without support for union type and then add it as we implement additional functionality
if metadata_field.is_nullable() { | ||
return Err(ArrowError::InvalidArgumentError( | ||
"Invalid VariantArray: metadata field can not be nullable".to_string(), | ||
)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for variant columns nested inside a shredded variant, we must not have a metadata
column?
See https://github.com/apache/arrow-rs/pull/7921/files#r2204684924 above?
// if !value_field.is_nullable() { | ||
// return Err(ArrowError::InvalidArgumentError( | ||
// "Expected value field to be nullable".to_string(), | ||
// )); | ||
// } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure the spec is 100% clear on this one, unfortunately. Maybe we need to ask the parquet-variant folks for clarification and/or refinement of the spec's wording?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @friendlymatthew -- this is looking quite cool
In order to make progress, I would suggest we try and write up a basic 'end to end' test.
Specifically, perhaps something like
- Create a perfectly shredded Arrow array for a
uint64
value - Implement the
get_variant
API from #7919 / @Samyak2 for that array and show how it will returnVariant::Int64
And then we can expand those tests for the more exciting shredding variants afterwards
I think the tests in this PR for just the schemas are a bit too theoretical -- they would be much closer to the end user if they were used for actual StructArrays in tests
| DataType::Float32 | ||
| DataType::Float64 | ||
| DataType::BinaryView => {} | ||
DataType::Union(union_fields, _) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend we start without support for union type and then add it as we implement additional functionality
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @friendlymatthew -- I think this is quite close.
Ok(()) | ||
} | ||
|
||
/// Validates that the provided [`Fields`] conform to the Variant shredding specification. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I thought of while reviewing this PR was maybe we could potentially wrap this into its own Structure, like
struct VariantSchema {
inner: Fields
}
And then all this validation logic could be part of the constructor
impl VariantSchema {
fn try_new(fields: Fields) -> Result<Self> {... }
...
}
The benefits of this would be
- Now we could be sure that a validated schema was always passed to
shred_variant
- We would then have a place to put methods on -- such as
VariantSchema::type(path: VariantPath)
for retrieving the type of a particular path, perhaps
Which issue does this PR close?
My initial PR is getting too large so I figured it would be better to split these up.
Rationale for this change
This PR updates the
VariantArrayBuilder
to pass in the desired shredded output schema in the constructor. It also contains validation logic that defines what is a valid schema and what is not.In other words, the schema that you define ahead of time gets checked if it is spec-compliant