-
Notifications
You must be signed in to change notification settings - Fork 974
[Variant] Impl PartialEq
for VariantObject
#7943
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Variant] Impl PartialEq
for VariantObject
#7943
Conversation
Variant::ShortString(s) => self.append_short_string(s), | ||
Variant::Object(obj) => self.append_object(metadata_builder, obj), | ||
Variant::List(list) => self.append_list(metadata_builder, list), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unfortunate to have to copy the same match statement as try_append_variant
, but ValueBuffer::append_object
and ValueBuffer::try_append_object
call different iterators and append methods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see -- I think it is the unfortunate consequence of having the distinction between the "validated" and "non validated" variant APIs
Maybe we can add a few comments explaining the practical difference (performance vs extra error checking)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also left a suggestion above about how to potentially unify the code paths
3f2c734
to
8e8769c
Compare
I'm going to do a full pass and write some comments for the new methods I introduced tomorrow |
a961b17
to
18b3ca4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @friendlymatthew -- this is looking great.
Can we please add some tests -- both positive and negative, for the partialeq implementation?
Like show creating two objects with the same fields but differnetly populated metadata are still equal
And also show that two objects creatd with the same metadata/bytes will still be equal
Also that two objects created from two different builders but the same vall sequence are are eual
And that two objects with different fields are not equal
&& self.first_value_byte == other.first_value_byte | ||
&& self.validated == other.validated; | ||
|
||
// value validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
We can probably add a fast path for the case when the metadata is actually the same and then we can just compare the field ids
But that could also be done in a different PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I am not sure if we can compare by field names.
Here's a test case where two variants have the same field names but differing values. The two objects will have the same metadata, but it will fail the logical comparison.
fn foo() {
let mut b = VariantBuilder::new();
let mut o = b.new_object();
o.insert("a", ());
o.insert("b", 4.3);
o.finish().unwrap();
let (m, v) = b.finish();
let v1 = Variant::try_new(&m, &v).unwrap();
// second object, same field name but different values
let mut b = VariantBuilder::new();
let mut o = b.new_object();
o.insert("a", ());
let mut inner_o = o.new_object("b");
inner_o.insert("a", 3.3);
inner_o.finish().unwrap();
o.finish().unwrap();
let (m, v) = b.finish();
let v2 = Variant::try_new(&m, &v).unwrap();
let m1 = v1.metadata().unwrap();
let m2 = v2.metadata().unwrap();
// metadata would be equal since they contain the same keys
assert_eq!(m1, m2);
// but the objects are not equal
assert_ne!(v1, v2);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @scovich, I'm curious if this looks reasonable to you? I may be misunderstanding something..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #7961 (comment)
For example, depending on what we want to achieve with these logical comparisons, two logically equivalent objects need not have the same header byte.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand why the above example is wrong tho? The metadata dictionary being the same says very little about any two objects that happen to use it?
b67b6c7
to
892ad1d
Compare
Hm, I see what you mean. I guess whether or not the metadata is sorted shouldn't matter when doing a logical comparison. i.e. field ids ordering do not matter |
892ad1d
to
29010e4
Compare
Unrelated but these public methods on arrow-rs/parquet-variant/src/variant/metadata.rs Lines 210 to 213 in 03a837e
arrow-rs/parquet-variant/src/variant/metadata.rs Lines 288 to 291 in 03a837e
|
|
28b3785
to
2fda8df
Compare
2fda8df
to
5b97872
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @friendlymatthew -- I left some comments for your consideration but I also think we could address them as follow ons
Variant::ShortString(s) => self.append_short_string(s), | ||
Variant::Object(obj) => self.append_object(metadata_builder, obj), | ||
Variant::List(list) => self.append_list(metadata_builder, list), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see -- I think it is the unfortunate consequence of having the distinction between the "validated" and "non validated" variant APIs
Maybe we can add a few comments explaining the practical difference (performance vs extra error checking)
impl<'m> PartialEq for VariantMetadata<'m> { | ||
fn eq(&self, other: &Self) -> bool { | ||
let mut is_equal = self.is_empty() == other.is_empty() | ||
&& self.is_fully_validated() == other.is_fully_validated() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the validated
and is_fully_validated
flags doesn't need to be part of a logical type check? Like two variants can be equal by value even if one is fully validated and one is not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect the following to pass for all variants and metadata
let variant1 = Variant::new(metadata, buffers);
let variant2 = Variant::new(metadata, buffers).with_full_validation();
assert_eq!(variant1, variant2)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
// create another object pre-filled with field names, b and a | ||
// but insert the fields in the order of a, b | ||
let mut b = VariantBuilder::new().with_field_names(["b", "a"].into_iter()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
&& self.num_elements == other.num_elements | ||
&& self.first_field_offset_byte == other.first_field_offset_byte | ||
&& self.first_value_byte == other.first_value_byte | ||
&& self.validated == other.validated; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment here about validation
) -> Result<(), ArrowError> { | ||
let mut object_builder = self.new_object(metadata_builder); | ||
|
||
for res in obj.iter_try() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be able to consolidate the apis if you checked the is_validated
dynamically
For example, have a single try_append_object and internally try
/// if source variant is already validated, use faster APIs
if obj.is_validated() {
for (field_name, value) in obj.iter() {
object_builder.insert(field_name, value);
}
} else {
for res in obj.iter_try() {
let (field_name, value) = res?;
object_builder.try_insert(field_name, value)?;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Except append_object
is infallible... so even if we "consolidate" by having try_append_object
fall back to its infallible partner, both similar-but-different for loops will still exist.
Variant::ShortString(s) => self.append_short_string(s), | ||
Variant::Object(obj) => self.append_object(metadata_builder, obj), | ||
Variant::List(list) => self.append_list(metadata_builder, list), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also left a suggestion above about how to potentially unify the code paths
let is_equal = self.is_empty() == other.is_empty() | ||
&& self.is_fully_validated() == other.is_fully_validated() | ||
&& self.first_value_byte == other.first_value_byte | ||
&& self.validated == other.validated; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could break early here if is_equal is false and not check the fields
@@ -332,6 +334,30 @@ impl<'m> VariantMetadata<'m> { | |||
} | |||
} | |||
|
|||
// According to the spec, metadata dictionaries are not required to be in a specific order, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a special is_equal check for VariantMetadata? It seems like now since VariantObject doesn't include a check for the metadata being equal we could avoid a special equality 🤔
|
||
// objects are still logically equal | ||
assert_eq!(v1, v2); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also add some other tests eventually too -- like lists and primitives
What I plan to do is merge this PR and then write some more tests for a few more cases |
Here is a follow on PR: |
for field_name in self.iter() { | ||
if !other_field_names.contains(field_name) { | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems one-sided? Don't we need to prove the symmetric difference is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. You are right. I think @friendlymatthew has a fix for it in
// Instead of comparing the raw bytes of 2 variant metadata instances, this implementation | ||
// checks whether the dictionary entries are equal -- regardless of their sorting order | ||
impl<'m> PartialEq for VariantMetadata<'m> { | ||
fn eq(&self, other: &Self) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the cost of constructing hash maps etc, is it worth adding the following quick-check in case the dictionaries are both sorted?
if self.is_ordered() && other.is_ordered() {
if self.len() != other.len() {
return false;
}
let self_value_bytes = ... all string value bytes ...;
let other_value_bytes = ... all string value bytes ...;
return self_value_bytes == other_value_bytes;
}
Before trusting a single long string compare tho, we would need to convince ourselves that there's no way two dictionaries with different offsets can have identical value bytes. Otherwise, we'd have to loop over the two sets of strings manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. I will think about this and push up a PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Which issue does this PR close? - Follow on to #7943 - Part of #7948 # Rationale for this change I found a few more tests I would like to have seen while reviewing #7943 # What changes are included in this PR? Add some list equality tests # Are these changes tested? It is only tests, no functionality changes # Are there any user-facing changes? No
match other_fields.get(field_name as &str) { | ||
Some(other_variant) => { | ||
is_equal = is_equal && variant == *other_variant; | ||
} | ||
None => return false, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
post-hoc nit: we should short circuit
if other_fields.get(field_name as &str).is_none_or(|other| variant != *other) {
return false;
}
... but the check is actually incomplete because it fails to prove the symmetric difference is empty.
Rationale for this change
This PR introduces a custom implementation of
PartialEq
for variant objects.According to the spec, field values are not required to be in the same order as the field IDs, to enable flexibility when constructing Variant values.
Instead of comparing the raw bytes of 2 variant objects, this implementation recursively checks whether the field values are equal -- regardless of their order