-
Notifications
You must be signed in to change notification settings - Fork 975
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
- part of [EPIC] [Parquet] Implement Variant type support in Parquet #6736
- This is a follow up to [Variant] Add low level support for shredding and unshredding #7715
As we begin to contemplate how to read and write shredded variants, we will need some way to construct arrow arrays that contain shredded variants
Physically these will be Arrow StructArrays
with two or three fields
- Non shredded: (2 fields)
STRUCT { "metadata": Binary, "value": Binary}
- Shredded: (3 fields)
STRUCT { "metadata": Binary, "value": Binary, typed_value: STRUCT { ... } }
More information on to represent Variants as Arrow arrays can be found on the proposal:
- [Format] Add an Arrow Canonical Extension Type for Parquet Variant arrow#46908
- Google Document: https://docs.google.com/document/d/1pw0AWoMQY3SjD7R4LgbPvMjG_xSCtXp3rZHkVp9jpZ4/edit?usp=sharing
Describe the solution you'd like
I would like some way to construct such shredded arrays easily and efficiently in Idomatic Rust style
Describe alternatives you've considered
One an idea from @zeroshade (thank you!) is to create a VariantArrayBuilder
that is responsible for building the correct StructArray
s from variants, including shredding out any columns. In order to created a shredded output, you would provide the shredded schema up front
For example, (based on the go implemntation and @scovich 's comment here), to create a shredded Arrow array that shreds out columns "foo" and "bar" from any variant objects,
We would need this schema:
STRUCT {
metadata: BinaryView,
value: BinaryView,
typed_value: STRUCT {
foo: Int64,
bar: Int32
}
}
The code would look like this
// Create an arrow Field that describes the desired shredded output schema
let shredded_schema = Field::new_struct(
vec![ "metadata", "value", "typed_value"],
vec![Field::new(DataType::BinaryView), Field::new(DataType::BinaryView), Field:::new_struct(
vec!["foo", "bar"],
vec![Field::new(DataType::Int64), Field::new(DataType::Int32)],
));
// Create a builder for an array (batch) of Variant values
let array_builder = VariantArrayBuilder::new(shredded_schema);
// append a row to the builder
let object= array_builder.new_object();
... add appropriate fields ...
// use like normal ObjectBuilder(??)
object.finish()
// append a second row (has no foo or bar fields)
array_builder.append_value(43);
...
/// Finalze the builder
let variant_array: StructArray = array_builder.build()?;
// variant_array is a shreded variant
I think a VariantArrayBuilder will be helpful for usecases other than Variant, and @harshmotw-db has created some version of one here:
Prior Art
Golang implementation:
- https://github.com/apache/arrow-go/blob/main/arrow/extensions/variant_test.go
- https://github.com/apache/arrow-go/blob/main/arrow/extensions/variant.go
- Here are some examples of it being used: https://github.com/apache/arrow-go/blob/b196d3b316d09f63786f021d4f1baa1fdd7620d2/arrow/extensions/variant_test.go#L363-L391
- Spark variant code: https://github.com/apache/spark/tree/master/common/variant/src/main/java/org/apache/spark/types/variant
Additional context