[Variant] Impl `PartialEq` for VariantObject #7943

friendlymatthew · 2025-07-16T20:34:32Z

Rationale for this change

Closes [Variant] Impl PartialEq for VariantObject #7943 #7948

This PR introduces a custom implementation of PartialEq for variant objects.

According to the spec, field values are not required to be in the same order as the field IDs, to enable flexibility when constructing Variant values.

Instead of comparing the raw bytes of 2 variant objects, this implementation recursively checks whether the field values are equal -- regardless of their order

friendlymatthew · 2025-07-16T20:36:20Z

parquet-variant/src/builder.rs

+            Variant::ShortString(s) => self.append_short_string(s),
+            Variant::Object(obj) => self.append_object(metadata_builder, obj),
+            Variant::List(list) => self.append_list(metadata_builder, list),
+        }


It is unfortunate to have to copy the same match statement as try_append_variant, but ValueBuffer::append_object and ValueBuffer::try_append_object call different iterators and append methods

I see -- I think it is the unfortunate consequence of having the distinction between the "validated" and "non validated" variant APIs

Maybe we can add a few comments explaining the practical difference (performance vs extra error checking)

I also left a suggestion above about how to potentially unify the code paths

friendlymatthew · 2025-07-16T20:54:02Z

I'm going to do a full pass and write some comments for the new methods I introduced tomorrow

alamb

Thank you @friendlymatthew -- this is looking great.

Can we please add some tests -- both positive and negative, for the partialeq implementation?

Like show creating two objects with the same fields but differnetly populated metadata are still equal

And also show that two objects creatd with the same metadata/bytes will still be equal

Also that two objects created from two different builders but the same vall sequence are are eual

And that two objects with different fields are not equal

alamb · 2025-07-16T20:56:49Z

parquet-variant/src/variant/object.rs

+            && self.first_value_byte == other.first_value_byte
+            && self.validated == other.validated;
+
+        // value validation


👍

We can probably add a fast path for the case when the metadata is actually the same and then we can just compare the field ids

But that could also be done in a different PR

Hm, I am not sure if we can compare by field names.

Here's a test case where two variants have the same field names but differing values. The two objects will have the same metadata, but it will fail the logical comparison.

fn foo() { let mut b = VariantBuilder::new(); let mut o = b.new_object(); o.insert("a", ()); o.insert("b", 4.3); o.finish().unwrap(); let (m, v) = b.finish(); let v1 = Variant::try_new(&m, &v).unwrap(); // second object, same field name but different values let mut b = VariantBuilder::new(); let mut o = b.new_object(); o.insert("a", ()); let mut inner_o = o.new_object("b"); inner_o.insert("a", 3.3); inner_o.finish().unwrap(); o.finish().unwrap(); let (m, v) = b.finish(); let v2 = Variant::try_new(&m, &v).unwrap(); let m1 = v1.metadata().unwrap(); let m2 = v2.metadata().unwrap(); // metadata would be equal since they contain the same keys assert_eq!(m1, m2); // but the objects are not equal assert_ne!(v1, v2); }

cc @scovich, I'm curious if this looks reasonable to you? I may be misunderstanding something..

See #7961 (comment)

For example, depending on what we want to achieve with these logical comparisons, two logically equivalent objects need not have the same header byte.

I'm not sure I understand why the above example is wrong tho? The metadata dictionary being the same says very little about any two objects that happen to use it?

parquet-variant/src/variant/object.rs

friendlymatthew · 2025-07-17T08:06:13Z

Like show creating two objects with the same fields but differnetly populated metadata are still equal

Hm, I see what you mean. I guess whether or not the metadata is sorted shouldn't matter when doing a logical comparison. i.e. field ids ordering do not matter

friendlymatthew · 2025-07-17T08:51:16Z

Unrelated but these public methods on VariantMetadata are redundant:

arrow-rs/parquet-variant/src/variant/metadata.rs

Lines 210 to 213 in 03a837e

    
               /// The number of metadata dictionary entries 
        
               pub fn len(&self) -> usize { 
        
                   self.dictionary_size() 
        
               }

arrow-rs/parquet-variant/src/variant/metadata.rs

Lines 288 to 291 in 03a837e

    
               /// Get the dictionary size 
        
               pub const fn dictionary_size(&self) -> usize { 
        
                   self.dictionary_size as _ 
        
               }

alamb · 2025-07-17T11:40:58Z

Unrelated but these public methods on VariantMetadata are redundant:

Filed an issue to track: [Variant] remove VariantMetadata::dictionary_size #7947

alamb

Thank you @friendlymatthew -- I left some comments for your consideration but I also think we could address them as follow ons

alamb · 2025-07-17T11:44:00Z

parquet-variant/src/builder.rs

+            Variant::ShortString(s) => self.append_short_string(s),
+            Variant::Object(obj) => self.append_object(metadata_builder, obj),
+            Variant::List(list) => self.append_list(metadata_builder, list),
+        }


I see -- I think it is the unfortunate consequence of having the distinction between the "validated" and "non validated" variant APIs

Maybe we can add a few comments explaining the practical difference (performance vs extra error checking)

alamb · 2025-07-17T11:45:30Z

parquet-variant/src/variant/metadata.rs

+impl<'m> PartialEq for VariantMetadata<'m> {
+    fn eq(&self, other: &Self) -> bool {
+        let mut is_equal = self.is_empty() == other.is_empty()
+            && self.is_fully_validated() == other.is_fully_validated()


It seems like the validated and is_fully_validated flags doesn't need to be part of a logical type check? Like two variants can be equal by value even if one is fully validated and one is not

I would expect the following to pass for all variants and metadata

let variant1 = Variant::new(metadata, buffers); let variant2 = Variant::new(metadata, buffers).with_full_validation(); assert_eq!(variant1, variant2)

Filed validated and is_fully_validated flags doesn't need to be part of PartialEq #7952 to track

alamb · 2025-07-17T12:50:27Z

parquet-variant/src/variant/object.rs

+
+        // create another object pre-filled with field names, b and a
+        // but insert the fields in the order of a, b
+        let mut b = VariantBuilder::new().with_field_names(["b", "a"].into_iter());


alamb · 2025-07-17T12:53:26Z

parquet-variant/src/variant/object.rs

+            && self.num_elements == other.num_elements
+            && self.first_field_offset_byte == other.first_field_offset_byte
+            && self.first_value_byte == other.first_value_byte
+            && self.validated == other.validated;


same comment here about validation

alamb · 2025-07-17T12:55:19Z

parquet-variant/src/builder.rs

+    ) -> Result<(), ArrowError> {
+        let mut object_builder = self.new_object(metadata_builder);
+
+        for res in obj.iter_try() {


You might be able to consolidate the apis if you checked the is_validated dynamically

For example, have a single try_append_object and internally try

/// if source variant is already validated, use faster APIs if obj.is_validated() { for (field_name, value) in obj.iter() { object_builder.insert(field_name, value); } } else { for res in obj.iter_try() { let (field_name, value) = res?; object_builder.try_insert(field_name, value)?; } }

Except append_object is infallible... so even if we "consolidate" by having try_append_object fall back to its infallible partner, both similar-but-different for loops will still exist.

alamb · 2025-07-17T12:55:34Z

parquet-variant/src/builder.rs

+            Variant::ShortString(s) => self.append_short_string(s),
+            Variant::Object(obj) => self.append_object(metadata_builder, obj),
+            Variant::List(list) => self.append_list(metadata_builder, list),
+        }


I also left a suggestion above about how to potentially unify the code paths

alamb · 2025-07-17T12:56:46Z

parquet-variant/src/variant/metadata.rs

+        let is_equal = self.is_empty() == other.is_empty()
+            && self.is_fully_validated() == other.is_fully_validated()
+            && self.first_value_byte == other.first_value_byte
+            && self.validated == other.validated;


You could break early here if is_equal is false and not check the fields

alamb · 2025-07-17T12:58:10Z

parquet-variant/src/variant/metadata.rs

@@ -332,6 +334,30 @@ impl<'m> VariantMetadata<'m> {
    }
 }

+// According to the spec, metadata dictionaries are not required to be in a specific order,


Do we need a special is_equal check for VariantMetadata? It seems like now since VariantObject doesn't include a check for the metadata being equal we could avoid a special equality 🤔

alamb · 2025-07-17T12:59:20Z

parquet-variant/src/variant/object.rs

+
+        // objects are still logically equal
+        assert_eq!(v1, v2);
+    }


I think we should also add some other tests eventually too -- like lists and primitives

alamb · 2025-07-17T15:23:57Z

What I plan to do is merge this PR and then write some more tests for a few more cases

alamb · 2025-07-17T16:14:08Z

Here is a follow on PR:

[Test] Add tests for VariantList equality #7953

scovich · 2025-07-17T20:19:41Z

parquet-variant/src/variant/metadata.rs

+        for field_name in self.iter() {
+            if !other_field_names.contains(field_name) {
+                return false;


This seems one-sided? Don't we need to prove the symmetric difference is empty?

Yes. You are right. I think @friendlymatthew has a fix for it in

[Variant] Revisit VariantMetadata and Object equality #7961

scovich · 2025-07-17T20:24:26Z

parquet-variant/src/variant/metadata.rs

+// Instead of comparing the raw bytes of 2 variant metadata instances, this implementation
+// checks whether the dictionary entries are equal -- regardless of their sorting order
+impl<'m> PartialEq for VariantMetadata<'m> {
+    fn eq(&self, other: &Self) -> bool {


Given the cost of constructing hash maps etc, is it worth adding the following quick-check in case the dictionaries are both sorted?

if self.is_ordered() && other.is_ordered() { if self.len() != other.len() { return false; } let self_value_bytes = ... all string value bytes ...; let other_value_bytes = ... all string value bytes ...; return self_value_bytes == other_value_bytes; }

Before trusting a single long string compare tho, we would need to convince ourselves that there's no way two dictionaries with different offsets can have identical value bytes. Otherwise, we'd have to loop over the two sets of strings manually.

Very nice. I will think about this and push up a PR.

Follow up: https://github.com/apache/arrow-rs/pull/7961/files

# Which issue does this PR close? - Follow on to #7943 - Part of #7948 # Rationale for this change I found a few more tests I would like to have seen while reviewing #7943 # What changes are included in this PR? Add some list equality tests # Are these changes tested? It is only tests, no functionality changes # Are there any user-facing changes? No

scovich · 2025-07-18T16:32:10Z

parquet-variant/src/variant/object.rs

+            match other_fields.get(field_name as &str) {
+                Some(other_variant) => {
+                    is_equal = is_equal && variant == *other_variant;
+                }
+                None => return false,
+            }


post-hoc nit: we should short circuit

if other_fields.get(field_name as &str).is_none_or(|other| variant != *other) { return false; }

... but the check is actually incomplete because it fails to prove the symmetric difference is empty.

github-actions bot added the parquet Changes to the parquet crate label Jul 16, 2025

friendlymatthew commented Jul 16, 2025

View reviewed changes

friendlymatthew force-pushed the friendlymatthew/partial-eq-variant-obj branch 3 times, most recently from 3f2c734 to 8e8769c Compare July 16, 2025 20:53

friendlymatthew force-pushed the friendlymatthew/partial-eq-variant-obj branch 2 times, most recently from a961b17 to 18b3ca4 Compare July 16, 2025 20:59

alamb reviewed Jul 16, 2025

View reviewed changes

friendlymatthew force-pushed the friendlymatthew/partial-eq-variant-obj branch 3 times, most recently from b67b6c7 to 892ad1d Compare July 17, 2025 07:58

friendlymatthew force-pushed the friendlymatthew/partial-eq-variant-obj branch from 892ad1d to 29010e4 Compare July 17, 2025 08:47

alamb mentioned this pull request Jul 17, 2025

[Variant] remove VariantMetadata::dictionary_size #7947

Closed

friendlymatthew force-pushed the friendlymatthew/partial-eq-variant-obj branch 2 times, most recently from 28b3785 to 2fda8df Compare July 17, 2025 11:48

Impl PartialEq for VariantObject

5b97872

friendlymatthew force-pushed the friendlymatthew/partial-eq-variant-obj branch from 2fda8df to 5b97872 Compare July 17, 2025 12:17

alamb approved these changes Jul 17, 2025

View reviewed changes

alamb merged commit d0fa24e into apache:main Jul 17, 2025
13 checks passed

This was referenced Jul 17, 2025

validated and is_fully_validated flags doesn't need to be part of PartialEq #7952

Open

[Test] Add tests for VariantList equality #7953

Merged

alamb mentioned this pull request Jul 17, 2025

[Variant] Support appending complex variants in VariantBuilder #7914

Merged

scovich reviewed Jul 17, 2025

View reviewed changes

scovich reviewed Jul 18, 2025

View reviewed changes

[Variant] Impl PartialEq for VariantObject #7943

[Variant] Impl PartialEq for VariantObject #7943

Conversation

friendlymatthew commented Jul 16, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Uh oh!

friendlymatthew Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

friendlymatthew commented Jul 16, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

friendlymatthew commented Jul 17, 2025

Uh oh!

friendlymatthew commented Jul 17, 2025

Uh oh!

alamb commented Jul 17, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jul 17, 2025

Uh oh!

Uh oh!

alamb commented Jul 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

[Variant] Impl `PartialEq` for VariantObject #7943

[Variant] Impl `PartialEq` for VariantObject #7943

friendlymatthew commented Jul 16, 2025 •

edited by alamb

Loading

friendlymatthew Jul 16, 2025 •

edited

Loading