Skip to content

[Variant] Avoid collecting offset iterator #7934

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 36 additions & 22 deletions parquet-variant/src/variant/metadata.rs
Original file line number Diff line number Diff line change
Expand Up @@ -234,44 +234,58 @@ impl<'m> VariantMetadata<'m> {
self.header.first_offset_byte() as _..self.first_value_byte as _,
)?;

let offsets =
map_bytes_to_offsets(offset_bytes, self.header.offset_size).collect::<Vec<_>>();

// Verify the string values in the dictionary are UTF-8 encoded strings.
let value_buffer =
string_from_slice(self.bytes, 0, self.first_value_byte as _..self.bytes.len())?;

let mut offsets_iter = map_bytes_to_offsets(offset_bytes, self.header.offset_size);
let mut current_offset = offsets_iter.next().unwrap_or(0);

if self.header.is_sorted {
// Validate the dictionary values are unique and lexicographically sorted
//
// Since we use the offsets to access dictionary values, this also validates
// offsets are in-bounds and monotonically increasing
let are_dictionary_values_unique_and_sorted = (1..offsets.len())
.map(|i| {
let field_range = offsets[i - 1]..offsets[i];
value_buffer.get(field_range)
})
.is_sorted_by(|a, b| match (a, b) {
(Some(a), Some(b)) => a < b,
_ => false,
});

if !are_dictionary_values_unique_and_sorted {
return Err(ArrowError::InvalidArgumentError(
"dictionary values are not unique and ordered".to_string(),
));
let mut prev_value: Option<&str> = None;

for next_offset in offsets_iter {
if next_offset <= current_offset {
return Err(ArrowError::InvalidArgumentError(
"offsets not monotonically increasing".to_string(),
));
}

let current_value =
value_buffer
.get(current_offset..next_offset)
.ok_or_else(|| {
ArrowError::InvalidArgumentError("offset out of bounds".to_string())
})?;

if let Some(prev_val) = prev_value {
if current_value <= prev_val {
return Err(ArrowError::InvalidArgumentError(
"dictionary values are not unique and ordered".to_string(),
));
}
}

prev_value = Some(current_value);
current_offset = next_offset;
}
} else {
// Validate offsets are in-bounds and monotonically increasing
//
// Since shallow validation ensures the first and last offsets are in bounds,
// we can also verify all offsets are in-bounds by checking if
// offsets are monotonically increasing
let are_offsets_monotonic = offsets.is_sorted_by(|a, b| a < b);
if !are_offsets_monotonic {
return Err(ArrowError::InvalidArgumentError(
"offsets not monotonically increasing".to_string(),
));
for next_offset in offsets_iter {
if next_offset <= current_offset {
return Err(ArrowError::InvalidArgumentError(
"offsets not monotonically increasing".to_string(),
));
}
current_offset = next_offset;
Comment on lines -270 to +288
Copy link
Contributor

@scovich scovich Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Belated comment but -- AFAIK this specific snippet of original code was perfectly fine? This PR would have just changed it from https://doc.rust-lang.org/std/primitive.slice.html#method.is_sorted_by to Iterator::is_sorted_by?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but this PR unconditionally consumes the first offset outside the if/else; that part would need to move inside the if so this else can keep doing what it always did.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth a follow on PR

Copy link
Contributor

@scovich scovich Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! There's also a bug: Neither variant nor JSON forbids the empty string as a field name. So we have to allow empty ranges. In case the dictionary is sorted, "" compares less-than every other string, and would be the first string; the sortedness check would naturally catch any later empty strings as breaking the sort order.

I'll throw up a quick PR for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
}

Expand Down
58 changes: 36 additions & 22 deletions parquet-variant/src/variant/object.rs
Original file line number Diff line number Diff line change
Expand Up @@ -217,23 +217,31 @@ impl<'m, 'v> VariantObject<'m, 'v> {
self.header.field_ids_start_byte() as _..self.first_field_offset_byte as _,
)?;

let field_ids = map_bytes_to_offsets(field_id_buffer, self.header.field_id_size)
.collect::<Vec<_>>();

let mut field_ids_iter =
map_bytes_to_offsets(field_id_buffer, self.header.field_id_size);
// Validate all field ids exist in the metadata dictionary and the corresponding field names are lexicographically sorted
if self.metadata.is_sorted() {
// Since the metadata dictionary has unique and sorted field names, we can also guarantee this object's field names
// are lexicographically sorted by their field id ordering
if !field_ids.is_sorted() {
return Err(ArrowError::InvalidArgumentError(
"field names not sorted".to_string(),
));
}
let dictionary_size = self.metadata.dictionary_size();

if let Some(mut current_id) = field_ids_iter.next() {
for next_id in field_ids_iter {
if current_id >= dictionary_size {
return Err(ArrowError::InvalidArgumentError(
"field id is not valid".to_string(),
));
}

if next_id <= current_id {
return Err(ArrowError::InvalidArgumentError(
"field names not sorted".to_string(),
));
}
current_id = next_id;
}

// Since field ids are sorted, if the last field is smaller than the dictionary size,
// we also know all field ids are smaller than the dictionary size and in-bounds.
if let Some(&last_field_id) = field_ids.last() {
if last_field_id >= self.metadata.dictionary_size() {
if current_id >= dictionary_size {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It took me a few times to figure out why this (redundant) check was still needed

I am not sure if there is some way to refactor the loop to avoid this (perhaps by keeping previous_id: Option<u32> as you did in the loop above 🤔

No changes needed, I just figured I would point it out

Copy link
Contributor Author

@codephage2020 codephage2020 Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't expect this to be confusing. This loop has considered many ways of writing, and in the end, I referred to https://github.com/apache/arrow-rs/blob/main/parquet-variant/src/variant/list.rs#L225-L232. I think this is an elegant implementation. Perhaps I need to add comments or use the following writing style to make the code more readable.

let mut previous_id: Option<u32> = None;

for field_id in field_ids_iter {
    if field_id >= dictionary_size {
        return Err(ArrowError::InvalidArgumentError(
            "field id is not valid".to_string(),
        ));
    }
    if let Some(prev_id) = previous_id {
        if field_id <= prev_id {
            return Err(ArrowError::InvalidArgumentError(
                "field names not sorted".to_string(),
            ));
        }
    }
    
    previous_id = Some(field_id);
}

return Err(ArrowError::InvalidArgumentError(
"field id is not valid".to_string(),
));
Expand All @@ -244,16 +252,22 @@ impl<'m, 'v> VariantObject<'m, 'v> {
// to check lexicographical order
//
// Since we are probing the metadata dictionary by field id, this also verifies field ids are in-bounds
let are_field_names_sorted = field_ids
.iter()
.map(|&i| self.metadata.get(i))
.collect::<Result<Vec<_>, _>>()?
.is_sorted();

if !are_field_names_sorted {
return Err(ArrowError::InvalidArgumentError(
"field names not sorted".to_string(),
));
let mut current_field_name = match field_ids_iter.next() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wasn't added in this PR, but this check for the field names being sorted doesn't seem right to me -- I thought the only requirement on an object's fields were that the field_ids were sorted (so lookup by field_id can be fast) but the corresponding names of the fields don't have to be sorted

Maybe @friendlymatthew can help

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. VariantEncoding.md is described like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, this is how I understood the spec

The field ids and field offsets must be in lexicographical order of the corresponding field names in the metadata dictionary

If we have an object with a sorted dictionary, the field ids are already ordered by the lexicographical order of field names.

If we don't have a sorted dictionary, we iterate through the field ids and probe the object for the corresponding field name. If the field names are lexicographically ordered, we can also verify that the field ids are in lexicographical order.

Copy link
Contributor

@alamb alamb Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it -- I re-read the spec and I was confused and agree this check is doign the right thing

The field ids and field offsets must be in lexicographical order of the corresponding field names in the metadata dictionary. However, the actual value entries do not need to be in any particular order. This implies that the field_offset values may not be monotonically increasing. For example, for the following object:

Some(field_id) => Some(self.metadata.get(field_id)?),
None => None,
};

for field_id in field_ids_iter {
let next_field_name = self.metadata.get(field_id)?;

if let Some(current_name) = current_field_name {
if next_field_name <= current_name {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I caught this post-merge but in this branch we can't assume the field names in the metadata dictionary are unique. So when we perform the probing mentioned in https://github.com/apache/arrow-rs/pull/7934/files#r2213356637, we should only check if next_field_name < current_name.

fixed in #7961

return Err(ArrowError::InvalidArgumentError(
"field names not sorted".to_string(),
));
}
}
current_field_name = Some(next_field_name);
}
}

Expand Down
Loading