Skip to content

[Variant] Impl PartialEq for VariantObject #7943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

friendlymatthew
Copy link
Contributor

@friendlymatthew friendlymatthew commented Jul 16, 2025

Rationale for this change

This PR introduces a custom implementation of PartialEq for variant objects.

According to the spec, field values are not required to be in the same order as the field IDs, to enable flexibility when constructing Variant values.

Instead of comparing the raw bytes of 2 variant objects, this implementation recursively checks whether the field values are equal -- regardless of their order

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 16, 2025
Variant::ShortString(s) => self.append_short_string(s),
Variant::Object(obj) => self.append_object(metadata_builder, obj),
Variant::List(list) => self.append_list(metadata_builder, list),
}
Copy link
Contributor Author

@friendlymatthew friendlymatthew Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unfortunate to have to copy the same match statement as try_append_variant, but ValueBuffer::append_object and ValueBuffer::try_append_object call different iterators and append methods

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- I think it is the unfortunate consequence of having the distinction between the "validated" and "non validated" variant APIs

Maybe we can add a few comments explaining the practical difference (performance vs extra error checking)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also left a suggestion above about how to potentially unify the code paths

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/partial-eq-variant-obj branch 3 times, most recently from 3f2c734 to 8e8769c Compare July 16, 2025 20:53
@friendlymatthew
Copy link
Contributor Author

I'm going to do a full pass and write some comments for the new methods I introduced tomorrow

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/partial-eq-variant-obj branch 2 times, most recently from a961b17 to 18b3ca4 Compare July 16, 2025 20:59
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @friendlymatthew -- this is looking great.

Can we please add some tests -- both positive and negative, for the partialeq implementation?

Like show creating two objects with the same fields but differnetly populated metadata are still equal

And also show that two objects creatd with the same metadata/bytes will still be equal

Also that two objects created from two different builders but the same vall sequence are are eual

And that two objects with different fields are not equal

&& self.first_value_byte == other.first_value_byte
&& self.validated == other.validated;

// value validation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

We can probably add a fast path for the case when the metadata is actually the same and then we can just compare the field ids

But that could also be done in a different PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I am not sure if we can compare by field names.

Here's a test case where two variants have the same field names but differing values. The two objects will have the same metadata, but it will fail the logical comparison.

fn foo() {
    let mut b = VariantBuilder::new();
    let mut o = b.new_object();

    o.insert("a", ());
    o.insert("b", 4.3);

    o.finish().unwrap();

    let (m, v) = b.finish();

    let v1 = Variant::try_new(&m, &v).unwrap();

    // second object, same field name but different values
    let mut b = VariantBuilder::new();
    let mut o = b.new_object();

    o.insert("a", ());
    let mut inner_o = o.new_object("b");
    inner_o.insert("a", 3.3);
    inner_o.finish().unwrap();
    o.finish().unwrap();

    let (m, v) = b.finish();

    let v2 = Variant::try_new(&m, &v).unwrap();

    let m1 = v1.metadata().unwrap();
    let m2 = v2.metadata().unwrap();

    // metadata would be equal since they contain the same keys
    assert_eq!(m1, m2);

    // but the objects are not equal
    assert_ne!(v1, v2);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @scovich, I'm curious if this looks reasonable to you? I may be misunderstanding something..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #7961 (comment)

For example, depending on what we want to achieve with these logical comparisons, two logically equivalent objects need not have the same header byte.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why the above example is wrong tho? The metadata dictionary being the same says very little about any two objects that happen to use it?

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/partial-eq-variant-obj branch 3 times, most recently from b67b6c7 to 892ad1d Compare July 17, 2025 07:58
@friendlymatthew
Copy link
Contributor Author

Like show creating two objects with the same fields but differnetly populated metadata are still equal

Hm, I see what you mean. I guess whether or not the metadata is sorted shouldn't matter when doing a logical comparison. i.e. field ids ordering do not matter

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/partial-eq-variant-obj branch from 892ad1d to 29010e4 Compare July 17, 2025 08:47
@friendlymatthew
Copy link
Contributor Author

Unrelated but these public methods on VariantMetadata are redundant:

/// The number of metadata dictionary entries
pub fn len(&self) -> usize {
self.dictionary_size()
}

/// Get the dictionary size
pub const fn dictionary_size(&self) -> usize {
self.dictionary_size as _
}

@alamb
Copy link
Contributor

alamb commented Jul 17, 2025

Unrelated but these public methods on VariantMetadata are redundant:

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/partial-eq-variant-obj branch 2 times, most recently from 28b3785 to 2fda8df Compare July 17, 2025 11:48
@friendlymatthew friendlymatthew force-pushed the friendlymatthew/partial-eq-variant-obj branch from 2fda8df to 5b97872 Compare July 17, 2025 12:17
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @friendlymatthew -- I left some comments for your consideration but I also think we could address them as follow ons

Variant::ShortString(s) => self.append_short_string(s),
Variant::Object(obj) => self.append_object(metadata_builder, obj),
Variant::List(list) => self.append_list(metadata_builder, list),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- I think it is the unfortunate consequence of having the distinction between the "validated" and "non validated" variant APIs

Maybe we can add a few comments explaining the practical difference (performance vs extra error checking)

impl<'m> PartialEq for VariantMetadata<'m> {
fn eq(&self, other: &Self) -> bool {
let mut is_equal = self.is_empty() == other.is_empty()
&& self.is_fully_validated() == other.is_fully_validated()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the validated and is_fully_validated flags doesn't need to be part of a logical type check? Like two variants can be equal by value even if one is fully validated and one is not

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the following to pass for all variants and metadata

let variant1 = Variant::new(metadata, buffers);
let variant2 = Variant::new(metadata, buffers).with_full_validation();
assert_eq!(variant1, variant2)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


// create another object pre-filled with field names, b and a
// but insert the fields in the order of a, b
let mut b = VariantBuilder::new().with_field_names(["b", "a"].into_iter());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

&& self.num_elements == other.num_elements
&& self.first_field_offset_byte == other.first_field_offset_byte
&& self.first_value_byte == other.first_value_byte
&& self.validated == other.validated;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here about validation

) -> Result<(), ArrowError> {
let mut object_builder = self.new_object(metadata_builder);

for res in obj.iter_try() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to consolidate the apis if you checked the is_validated dynamically

For example, have a single try_append_object and internally try

/// if source variant is already validated, use faster APIs
if obj.is_validated() {
        for (field_name, value) in obj.iter() {
            object_builder.insert(field_name, value);
        }
} else {
        for res in obj.iter_try() {
             let (field_name, value) = res?;
            object_builder.try_insert(field_name, value)?;
        }
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except append_object is infallible... so even if we "consolidate" by having try_append_object fall back to its infallible partner, both similar-but-different for loops will still exist.

Variant::ShortString(s) => self.append_short_string(s),
Variant::Object(obj) => self.append_object(metadata_builder, obj),
Variant::List(list) => self.append_list(metadata_builder, list),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also left a suggestion above about how to potentially unify the code paths

let is_equal = self.is_empty() == other.is_empty()
&& self.is_fully_validated() == other.is_fully_validated()
&& self.first_value_byte == other.first_value_byte
&& self.validated == other.validated;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could break early here if is_equal is false and not check the fields

@@ -332,6 +334,30 @@ impl<'m> VariantMetadata<'m> {
}
}

// According to the spec, metadata dictionaries are not required to be in a specific order,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a special is_equal check for VariantMetadata? It seems like now since VariantObject doesn't include a check for the metadata being equal we could avoid a special equality 🤔


// objects are still logically equal
assert_eq!(v1, v2);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also add some other tests eventually too -- like lists and primitives

@alamb
Copy link
Contributor

alamb commented Jul 17, 2025

What I plan to do is merge this PR and then write some more tests for a few more cases

@alamb alamb merged commit d0fa24e into apache:main Jul 17, 2025
13 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 17, 2025

Here is a follow on PR:

Comment on lines +351 to +353
for field_name in self.iter() {
if !other_field_names.contains(field_name) {
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems one-sided? Don't we need to prove the symmetric difference is empty?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. You are right. I think @friendlymatthew has a fix for it in

// Instead of comparing the raw bytes of 2 variant metadata instances, this implementation
// checks whether the dictionary entries are equal -- regardless of their sorting order
impl<'m> PartialEq for VariantMetadata<'m> {
fn eq(&self, other: &Self) -> bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the cost of constructing hash maps etc, is it worth adding the following quick-check in case the dictionaries are both sorted?

if self.is_ordered() && other.is_ordered() {
    if self.len() != other.len() {
        return false;
    }
    let self_value_bytes = ... all string value bytes ...;
    let other_value_bytes = ... all string value bytes ...;
    return self_value_bytes == other_value_bytes;
}

Before trusting a single long string compare tho, we would need to convince ourselves that there's no way two dictionaries with different offsets can have identical value bytes. Otherwise, we'd have to loop over the two sets of strings manually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice. I will think about this and push up a PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alamb added a commit that referenced this pull request Jul 18, 2025
# Which issue does this PR close?
- Follow on to #7943
- Part of #7948

# Rationale for this change

I found a few more tests I would like to have seen while reviewing
#7943

# What changes are included in this PR?

Add some list equality tests

# Are these changes tested?

It is only tests, no functionality changes

# Are there any user-facing changes?

No
Comment on lines +412 to +417
match other_fields.get(field_name as &str) {
Some(other_variant) => {
is_equal = is_equal && variant == *other_variant;
}
None => return false,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

post-hoc nit: we should short circuit

            if other_fields.get(field_name as &str).is_none_or(|other| variant != *other) {
                return false;
            }

... but the check is actually incomplete because it fails to prove the symmetric difference is empty.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Variant] Impl PartialEq for VariantObject #7943
3 participants