[VARIANT] Path-based Field Extraction for VariantArray #7946

carpecodeum · 2025-07-16T21:51:40Z

Which issue does this PR close?

This PR implements efficient path-based field extraction and manipulation capabilities for VariantArray, enabling direct access to nested fields without expensive unshredding operations.
Follow-up on #7919

Closes [Variant] Support variant_get kernel for shredded variants #7941

Rationale for this change

This work builds directly on the path navigation concepts introduced in #7919, sharing the fundamental VariantPathElement design with Field and Index variants. While PR #7919 provided a compute kernel approach with a variant_get function, this PR provides instance-based methods directly on VariantArray with a builder API using owned strings rather than PR #7919 vector-based approach.

This is a draft still, as the changes for #7919 got merged today, I still have to incorporate those changes, and looking forward to reviews and suggestions.

This PR is complementary to #7921, which implements schema-driven shredding during array construction. This PR provides runtime path-based access to both shredded and unshredded data, creating a complete solution for both efficient construction and efficient access of variant data.

Big Thanks to @mprammer @PinkCrow007 for their continued support throughout my Variant exploration

What changes are included in this PR?

Field removal operations through methods like remove_field and remove_fields enable removal of specific fields from variant data, crucial for shredding operations where temporary or debug fields need to be stripped. field_operations.rs provides direct binary manipulation through functions like get_path_bytes, extract_field_bytes, and remove_field_bytes that operate on raw binary format without constructing intermediate objects. variant_parser.rs supports all variant types with parsers for 17 different primitive types, providing the foundation for efficient binary navigation.

The performance-critical byte operations could serve as the underlying implementation for PR #7919's compute kernel, potentially providing better performance for batch operations by avoiding object construction overhead. The field removal capabilities could extend PR #7919's functionality beyond extraction to comprehensive field manipulation. The instance-based approach provides different ergonomics that complement PR #7919's compute kernel approach.

This PR focuses on runtime access and manipulation rather than construction-time optimization, leaving build-time schema-driven shredding to PR #7921. Future work is integration with PR #7919's compute kernel approach, potentially using this PR's byte-level operations as the underlying implementation.

Are these changes tested?

Yes, tests are added

Are there any user-facing changes?

Not yet

carpecodeum · 2025-07-16T21:53:19Z

CC - @alamb @Samyak2 @friendlymatthew @scovich

alamb

Thank you for this PR @carpecodeum

This is very cool

I think there is already a variant_get implementation in https://github.com/apache/arrow-rs/blob/d809f19bc0fe2c3c1968f5111b6afa785d2e8bcd/parquet-variant-compute/src/variant_get.rs#L35-L34 contributed by @Samyak2

To take the next steps and implement shredding I think we will need two things:

A way to create shredded variants
A way to represent shredded variants

The idea of removing fields from Variants is interesting, though I wonder if that is an operation we would ever want to do on single Variant instance -- it seems like removing fields for shredding will require copying the underlying bytes anyways, so I was thinking we might just want to create an output variant array entirely

Something like

fn variant_shred(input: VariantArray, output: VariantArray, schema: SchemaRef)

Maybe it is worth looking at how the java or go implementations work

alamb · 2025-07-17T16:14:54Z

parquet-variant-compute/.cargo/config.toml

@@ -0,0 +1,2 @@
+[build]
+rustflags = ["-A", "unknown-lints", "-A", "clippy::transmute-int-to-float"] 


is this required?

alamb · 2025-07-17T16:15:49Z

parquet-variant-compute/examples/field_removal.rs

+    {
+        let mut variant_builder = VariantBuilder::new();
+        {
+            let mut obj = variant_builder.new_object();


Note I think these tests wil lbe easier to write after the API in

[Variant] Add ObjectBuilder::with_field for convenience #7950

alamb · 2025-07-17T16:17:41Z

parquet-variant-compute/src/field_operations.rs

+use parquet_variant::VariantMetadata;
+use std::collections::HashSet;
+
+/// Represents a path element in a variant path


i think if you merge up from main this code will no longer be required

i agree, a lot of this becomes redundant but get_field_bytes is able to get field bytes from an object at the byte level, will this wont be required too?

I am not quite sure what you are asking.

I think it might help to move shredded variant forward by writing the tests / examples of how variant_get should work with shredded arrays

I tried to work up a simple example here:

[Variant] WIP Tests for variant_get of shredded variants #7965

I saw your PR it looks great, im myself trying to work up a few examples

One idea would to be to try and create the other examples from https://docs.google.com/document/d/1pw0AWoMQY3SjD7R4LgbPvMjG_xSCtXp3rZHkVp9jpZ4/ in code

alamb · 2025-07-18T19:23:31Z

parquet-variant-compute/src/variant_array.rs

+    /// let path = VariantPath::field("name");
+    /// let name_variant = variant_array.get_path(0, &path);
+    /// ```
+    pub fn get_path(&self, index: usize, path: &VariantPath) -> Option<Variant> {


I wonder how this is different than variant_get 🤔

arrow-rs/parquet-variant-compute/src/variant_get.rs

Line 35 in 99eb1bc

pub fn variant_get(input: &ArrayRef, options: GetOptions) -> Result<ArrayRef> {

its quite similar, I started working on it, and didn't realise that the idea is very similar to PR 7919 until Wednesday. I'm working on making a lot of changes here and removing the redundancy between my PR and the variant_get functionality, which might include the tests and examples in this PR itself, also will try to do the optimisations for variant_get

carpecodeum · 2025-07-18T19:32:48Z

Thank you for this PR @carpecodeum

This is very cool

I think there is already a variant_get implementation in https://github.com/apache/arrow-rs/blob/d809f19bc0fe2c3c1968f5111b6afa785d2e8bcd/parquet-variant-compute/src/variant_get.rs#L35-L34 contributed by @Samyak2

To take the next steps and implement shredding I think we will need two things:

A way to create shredded variants

A way to represent shredded variants

The idea of removing fields from Variants is interesting, though I wonder if that is an operation we would ever want to do on single Variant instance -- it seems like removing fields for shredding will require copying the underlying bytes anyways, so I was thinking we might just want to create an output variant array entirely

Something like
fn variant_shred(input: VariantArray, output: VariantArray, schema: SchemaRef)
Maybe it is worth looking at how the java or go implementations work

Is there any issue for implementing this? I would love to work on it

alamb · 2025-07-18T19:35:50Z

Is there any issue for implementing this? I would love to work on it

I think we are discussing reading shredded variants on

[Variant] Support variant_get kernel for shredded variants #7941

We are discussing creating shredded variants on

[Variant] API to construct Shredded Variant Arrays #7895

I don't think we have enough of an idea of how this will work to break them down into finer grained tasks yet.

carpecodeum force-pushed the variant-shredding branch from 18d88b0 to b1afed1 Compare July 16, 2025 21:56

alamb reviewed Jul 17, 2025

View reviewed changes

carpecodeum added 7 commits July 18, 2025 14:14

[ADD] Path-based field extraction for VariantArray

08353cb

[FIX] sanitise variant_array file

aeccf2b

[ADD] add hybrid approach for field access

314a599

[FIX] fix variant_array implementation

de9d386

[ADD] add support for path operations on different data types

8e4b034

[FIX] minor fixes

be50708

[FIX] fix formatting issues

c712747

carpecodeum force-pushed the variant-shredding branch 2 times, most recently from 9a616b5 to c712747 Compare July 18, 2025 18:39

alamb reviewed Jul 18, 2025

View reviewed changes

carpecodeum added 4 commits July 20, 2025 18:31

[FIX] remove redundancy

0607431

[FIX] improve the tests

852d6cd

[FIX] refactor code for modularity

e95b0d5

[FIX] fix issues with the spec

7c671f5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[VARIANT] Path-based Field Extraction for VariantArray #7946

[VARIANT] Path-based Field Extraction for VariantArray #7946

Uh oh!

carpecodeum commented Jul 16, 2025 •

edited

Loading

Uh oh!

carpecodeum commented Jul 16, 2025 •

edited

Loading

Uh oh!

alamb left a comment

Uh oh!

alamb Jul 17, 2025

Uh oh!

alamb Jul 17, 2025

Uh oh!

alamb Jul 17, 2025

Uh oh!

carpecodeum Jul 18, 2025

Uh oh!

alamb Jul 18, 2025

Uh oh!

carpecodeum Jul 18, 2025

Uh oh!

alamb Jul 18, 2025

Uh oh!

alamb Jul 18, 2025

Uh oh!

carpecodeum Jul 18, 2025

Uh oh!

carpecodeum commented Jul 18, 2025

Uh oh!

alamb commented Jul 18, 2025

Uh oh!

Uh oh!

		@@ -0,0 +1,2 @@
		[build]
		rustflags = ["-A", "unknown-lints", "-A", "clippy::transmute-int-to-float"]

[VARIANT] Path-based Field Extraction for VariantArray #7946

Are you sure you want to change the base?

[VARIANT] Path-based Field Extraction for VariantArray #7946

Uh oh!

Conversation

carpecodeum commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

carpecodeum commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

carpecodeum commented Jul 18, 2025

Uh oh!

alamb commented Jul 18, 2025

Uh oh!

Uh oh!

carpecodeum commented Jul 16, 2025 •

edited

Loading

carpecodeum commented Jul 16, 2025 •

edited

Loading