Variant shredding #2

carpecodeum · 2025-07-16T16:56:56Z

Path-based Field Extraction for VariantArray

This PR implements efficient path-based field extraction and manipulation capabilities for VariantArray, enabling direct access to nested fields without expensive unshredding operations. The implementation provides both high-level convenience methods and low-level byte operations to support various analytical workloads on variant data.

Relationship to Concurrent PRs

This work builds directly on the path navigation concepts introduced in PR #7919, sharing the fundamental VariantPathElement design with Field and Index variants. While PR apache#7919 provides a compute kernel approach with a variant_get function, this PR provides instance-based methods directly on VariantArray with a fluent builder API using owned strings rather than PR apache#7919's vector-based approach.

This PR is complementary to PR #7921, which implements schema-driven shredding during array construction. This PR provides runtime path-based access to both shredded and unshredded data, creating a complete solution for both efficient construction and efficient access of variant data.

What This PR Contributes

This PR introduces three entirely original capabilities missing from both concurrent PRs. Field removal operations through methods like remove_field and remove_fields enable efficient removal of specific fields from variant data, crucial for shredding operations where temporary or debug fields need to be stripped. A complete byte-level operations module (field_operations.rs) provides direct binary manipulation through functions like get_path_bytes, extract_field_bytes, and remove_field_bytes that operate on raw binary format without constructing intermediate objects. A comprehensive binary parser (variant_parser.rs) supports all variant types with specialized parsers for 17 different primitive types, providing the foundation for efficient binary navigation.

How This Benefits PR apache#7919

The performance-critical byte operations could serve as the underlying implementation for PR apache#7919's compute kernel, potentially providing better performance for batch operations by avoiding object construction overhead. The field removal capabilities could extend PR apache#7919's functionality beyond extraction to comprehensive field manipulation. The instance-based approach provides different ergonomics that complement PR apache#7919's compute kernel approach.

Implementation Details

The implementation follows a three-tier architecture: high-level instance methods returning Variant objects for convenient manipulation, mid-level path operations using VariantPath and VariantPathElement types for type-safe nested access, and low-level byte operations for maximum performance where object construction overhead is prohibitive. This directly addresses the performance concerns identified in PR apache#7919 by providing direct binary navigation without full object reconstruction, enabling efficient batch operations, and implementing selective field access that prevents the quadratic work patterns identified in the original performance analysis.

What Remains Pending

This PR focuses on runtime access and manipulation rather than construction-time optimization, leaving build-time schema-driven shredding to PR apache#7921. Future work could explore integration with PR apache#7919's compute kernel approach, potentially using this PR's byte-level operations as the underlying implementation.

@rmn-boiko

# Which issue does this PR close? - Related to apache#7395 - Closes apache#7495 - Closes apache#7377 # Rationale for this change Let's update tonic to the latest Given the open and unresolved questions on @rmn-boiko's PR apache#7377 from @Xuanwo and @sundy-li, I thought a new PR would result in a faster resolution. # What changes are included in this PR? This PR is based on apache#7495 from @MichaelScofield -- I resolved some merge conflicts and updated Cargo.toml in the integration tests # Are these changes tested? Yes, by CI # Are there any user-facing changes? New dependency version --------- Co-authored-by: LFC <[email protected]>

…pache#7922) # Which issue does this PR close? - Part of apache#7896 # Rationale for this change In apache#7896, we saw that inserting a large amount of field names takes a long time -- in this case ~45s to insert 2**24 field names. The bulk of this time is spent just allocating the strings, but we also see quite a bit of time spent reallocating the `IndexSet` that we're inserting into. `with_field_names` is an optimization to declare the field names upfront which avoids having to reallocate and rehash the entire `IndexSet` during field name insertion. Using this method requires at least 2 string allocations for each field name -- 1 to declare field names upfront and 1 to insert the actual field name during object building. This PR adds a new method `with_field_name_capacity` which allows you to reserve space to the metadata builder, without needing to allocate the field names themselves upfront. In this case, we see a modest performance improvement when inserting the field names during object building Before: <img width="1512" height="829" alt="Screenshot 2025-07-13 at 12 08 43 PM" src="https://github.com/user-attachments/assets/6ef0d9fe-1e08-4d3a-8f6b-703de550865c" /> After: <img width="1512" height="805" alt="Screenshot 2025-07-13 at 12 08 55 PM" src="https://github.com/user-attachments/assets/2faca4cb-0a51-441b-ab6c-5baa1dae84b3" />

…che#7914) # Which issue does this PR close? - Fixes apache#7907 # Rationale for this change When trying to append `VariantObject` or `VariantList`s directly on the `VariantBuilder`, it will panic. # Changes to the public API `VariantBuilder` now has these additional methods: - `append_object`, will panic if shallow validation fails or the object has duplicate field names - `try_append_object`, will perform full validation on the object before appending - `append_list`, will panic if shallow validation fails - `try_append_list`, will perform full validation on the list before appending --------- Co-authored-by: Andrew Lamb <[email protected]>

# Which issue does this PR close? - Closes apache#7893 # What changes are included in this PR? In parquet-variant: - Add a new function `Variant::get_path`: this traverses the path to create a new Variant (does not cast any of it). - Add a new module `parquet_variant::path`: adds structs/enums to define a path to access a variant value deeply. In parquet-variant-compute: - Add a new compute kernel `variant_get`: does the path traversal over a `VariantArray`. In the future, this would also cast the values to a specified type. - Includes some basic unit tests. Not comprehensive. - Includes a simple micro-benchmark for reference. Current limitations: - It can only return another VariantArray. Casts are not implemented yet. - Only top-level object/list access is supported. It panics on finding a nested object/list. Needs apache#7914 to fix this. - Perf is a TODO. # Are these changes tested? Some basic unit tests are added. # Are there any user-facing changes? Yes --------- Co-authored-by: Andrew Lamb <[email protected]>

alamb · 2025-07-16T19:55:27Z

woohoo!

…he#7774) # Which issue does this PR close? - Part of apache#7762 # Rationale for this change As part of apache#7762 I want to optimize applying filters by adding a new code path. To ensure that works well, let's ensure the filtered code path is well covered with tests # What changes are included in this PR? 1. Add tests for filtering batches with 0.01%, 1%, 10% and 90% and varying data types # Are these changes tested? Only tests, no functional changes # Are there any user-facing changes?

scovich

Not quite sure the right way to review this -- is it better to wait for it to become an arrow-rs PR instead?

scovich · 2025-07-17T18:09:53Z

parquet-variant-compute/src/variant_array.rs

@@ -149,6 +150,154 @@ impl VariantArray {
        // spec says fields order is not guaranteed, so we search by name
        self.inner.column_by_name("value").unwrap()
    }
+
+    /// Get the metadata bytes for a specific index
+    pub fn metadata(&self, index: usize) -> &[u8] {


Suggested change

pub fn metadata(&self, index: usize) -> &[u8] {

pub fn metadata_bytes(&self, index: usize) -> &[u8] {

?

scovich · 2025-07-17T18:10:42Z

parquet-variant-compute/src/variant_array.rs

+        for element in path.elements() {
+            match element {
+                crate::field_operations::VariantPathElement::Field(field_name) => {
+                    current_variant = current_variant.get_object_field(field_name)?;
+                }
+                crate::field_operations::VariantPathElement::Index(idx) => {
+                    current_variant = current_variant.get_list_element(*idx)?;
+                }
+            }
+        }
+
+        Some(current_variant)


I think this is just an Iterator::try_fold?

scovich · 2025-07-17T18:19:11Z

parquet-variant-compute/src/variant_array.rs

+                match FieldOperations::remove_field_bytes(
+                    self.metadata(i),
+                    self.value_bytes(i),
+                    field_name,
+                )? {
+                    Some(new_value) => {
+                        builder.append_variant_buffers(self.metadata(i), &new_value);
+                    }
+                    None => {
+                        // Field didn't exist, use original value
+                        builder.append_variant_buffers(self.metadata(i), self.value_bytes(i));
+                    }
+                }


I think there's some redundancy here?

Suggested change

match FieldOperations::remove_field_bytes(

self.metadata(i),

self.value_bytes(i),

field_name,

)? {

Some(new_value) => {

builder.append_variant_buffers(self.metadata(i), &new_value);

}

None => {

// Field didn't exist, use original value

builder.append_variant_buffers(self.metadata(i), self.value_bytes(i));

}

}

let new_value = FieldOperations::remove_field_bytes(

self.metadata(i),

self.value_bytes(i),

field_name,

)?;

// Use original value if the field didn't exist

let new_value = new_value.as_ref().unwrap_or_else(|| self.value_bytes(i));

builder.append_variant_buffers(self.metadata(i), new_value);

(again below)

scovich · 2025-07-17T18:21:37Z

parquet-variant-compute/src/variant_parser.rs

+
+/// Primitive type variants
+#[derive(Debug, Clone, PartialEq)]
+pub enum PrimitiveType {


It seems like several of the types here are similar to similar ones defined elsewhere?
Is there a way to harmonize them?

(a lot of redundant logic as well)

scovich · 2025-07-17T18:24:16Z

parquet-variant-compute/src/variant_parser.rs

+        match primitive_type {
+            0 => Ok(PrimitiveType::Null),
+            1 => Ok(PrimitiveType::True),


Honest question: Is it more readable to have a bunch of Ok(...)? Or to pull out the result and wrap it once?

let result = match primitive_type { 0 => PrimitiveType::Null, 1 => PrimitiveType::True, ... _ => { return Err(ArrowError::InvalidArgumentError(format!(...))); } }; Ok(result)

scovich · 2025-07-17T18:29:51Z

parquet-variant-compute/src/variant_parser.rs

+        if length > 13 {
+            return Err(ArrowError::InvalidArgumentError(format!(
+                "Short string length {} exceeds maximum of 13",
+                length


Isn't the string length a 6-bit value? The spec says:

The "short string" basic type may be used as an optimization to fold string length into the type byte for strings less than 64 bytes.

scovich · 2025-07-17T18:31:27Z

parquet-variant-compute/src/variant_parser.rs

+            | PrimitiveType::TimestampNtz
+            | PrimitiveType::TimestampLtz => 8,
+            PrimitiveType::Decimal16 => 16,
+            PrimitiveType::Binary | PrimitiveType::String => 0, // Variable length, need to read from data


I wonder if this method should return Option<usize> to distinguish null/true/false from binary/string?

alamb and others added 4 commits July 16, 2025 13:38

alamb mentioned this pull request Jul 16, 2025

[Variant] Add variant_get compute kernel apache/arrow-rs#7919

Merged

alamb and others added 7 commits July 16, 2025 16:08

[ADD] Path-based field extraction for VariantArray

f210cf2

[FIX] sanitise variant_array file

8ef1aee

[ADD] add hybrid approach for field access

742b3a0

[FIX] fix variant_array implementation

1b7aed0

[ADD] add support for path operations on different data types

be30239

[FIX] minor fixes

b1afed1

carpecodeum force-pushed the variant-shredding branch from 18d88b0 to b1afed1 Compare July 16, 2025 21:56

[FIX] fix formatting issues

887603e

scovich reviewed Jul 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Variant shredding #2

Variant shredding #2

Uh oh!

carpecodeum commented Jul 16, 2025 •

edited

Loading

Uh oh!

alamb commented Jul 16, 2025

Uh oh!

scovich left a comment

Uh oh!

scovich Jul 17, 2025

Uh oh!

scovich Jul 17, 2025

Uh oh!

scovich Jul 17, 2025

Uh oh!

scovich Jul 17, 2025

Uh oh!

scovich Jul 17, 2025

Uh oh!

scovich Jul 17, 2025

Uh oh!

scovich Jul 17, 2025

Uh oh!

scovich Jul 17, 2025

Uh oh!

Uh oh!

	pub fn metadata(&self, index: usize) -> &[u8] {
	pub fn metadata_bytes(&self, index: usize) -> &[u8] {

Variant shredding #2

Are you sure you want to change the base?

Variant shredding #2

Uh oh!

Conversation

carpecodeum commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Path-based Field Extraction for VariantArray

Relationship to Concurrent PRs

What This PR Contributes

How This Benefits PR apache#7919

Implementation Details

What Remains Pending

Uh oh!

alamb commented Jul 16, 2025

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

carpecodeum commented Jul 16, 2025 •

edited

Loading