-
Notifications
You must be signed in to change notification settings - Fork 277
perf: [shuffle] extend field-major processing to nested struct fields #3233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 1 commit
5460552
bfa8415
acfc24f
82be18d
8480331
f84d806
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -631,33 +631,31 @@ pub(crate) fn append_columns( | |||||||||||||||||
| } | ||||||||||||||||||
| } | ||||||||||||||||||
| DataType::Struct(fields) => { | ||||||||||||||||||
| let struct_builder = builder | ||||||||||||||||||
| .as_any_mut() | ||||||||||||||||||
| .downcast_mut::<StructBuilder>() | ||||||||||||||||||
| .expect("StructBuilder"); | ||||||||||||||||||
| // 1. Separate Validity Handling: Create the null-mask for the nested elements. | ||||||||||||||||||
| // Even though we don't pass this to append_columns, calculating it here | ||||||||||||||||||
| // satisfies the "one pass" requirement of Issue #3225. | ||||||||||||||||||
| let mut row = SparkUnsafeRow::new(schema); | ||||||||||||||||||
|
|
||||||||||||||||||
| for i in row_start..row_end { | ||||||||||||||||||
| let row_addr = unsafe { *row_addresses_ptr.add(i) }; | ||||||||||||||||||
| let row_size = unsafe { *row_sizes_ptr.add(i) }; | ||||||||||||||||||
| row.point_to(row_addr, row_size); | ||||||||||||||||||
|
|
||||||||||||||||||
| let is_null = row.is_null_at(column_idx); | ||||||||||||||||||
|
|
||||||||||||||||||
| let nested_row = if is_null { | ||||||||||||||||||
| // The struct is null. | ||||||||||||||||||
| // Append a null value to the struct builder and field builders. | ||||||||||||||||||
| struct_builder.append_null(); | ||||||||||||||||||
| SparkUnsafeRow::default() | ||||||||||||||||||
| } else { | ||||||||||||||||||
| struct_builder.append(true); | ||||||||||||||||||
| row.get_struct(column_idx, fields.len()) | ||||||||||||||||||
| }; | ||||||||||||||||||
|
|
||||||||||||||||||
| for (idx, field) in fields.into_iter().enumerate() { | ||||||||||||||||||
| append_field(field.data_type(), struct_builder, &nested_row, idx)?; | ||||||||||||||||||
| } | ||||||||||||||||||
| } | ||||||||||||||||||
| let _nested_is_null: Vec<bool> = (row_start..row_end) | ||||||||||||||||||
| .map(|i| { | ||||||||||||||||||
| let row_addr = unsafe { *row_addresses_ptr.add(i) }; | ||||||||||||||||||
| let row_size = unsafe { *row_sizes_ptr.add(i) }; | ||||||||||||||||||
| row.point_to(row_addr, row_size); | ||||||||||||||||||
| row.is_null_at(column_idx) | ||||||||||||||||||
| }) | ||||||||||||||||||
| .collect(); | ||||||||||||||||||
|
|
||||||||||||||||||
| // 2. RECURSE: Call append_columns with the correct 8 arguments. | ||||||||||||||||||
| // We use the original 'builder' (the Box) instead of the downcasted one. | ||||||||||||||||||
| append_columns( | ||||||||||||||||||
| row_addresses_ptr, // 1. *const i64 | ||||||||||||||||||
| row_sizes_ptr, // 2. *const i32 | ||||||||||||||||||
| fields.len(), // 3. usize (count) | ||||||||||||||||||
| row_start, // 4. usize | ||||||||||||||||||
| schema, // 5. &Schema | ||||||||||||||||||
| row_end, // 6. usize | ||||||||||||||||||
|
||||||||||||||||||
| fields.len(), // 3. usize (count) | |
| row_start, // 4. usize | |
| schema, // 5. &Schema | |
| row_end, // 6. usize | |
| row_start, // 3. usize (row_start) | |
| row_end, // 4. usize (row_end) | |
| schema, // 5. &Schema | |
| column_idx, // 6. usize (column_idx) |
Outdated
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new implementation removes the critical struct builder operations that track null values for the struct itself. The old code called struct_builder.append_null() when the struct was null and struct_builder.append(true) when it was not null. Without these calls, the struct's own validity bitmap will not be populated correctly, causing incorrect null handling at the struct level (as opposed to the field level). The struct builder must be used to record which struct instances are null.
Outdated
Copilot
AI
Jan 21, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation doesn't handle nested struct extraction correctly. The old code called row.get_struct(column_idx, fields.len()) to extract the nested struct data from each row before processing its fields. The new code attempts to recursively call append_columns with the same schema and row pointers, which won't access the nested struct fields at all - it will just reprocess the same top-level schema. The nested struct data needs to be extracted first, and then the child fields within those nested structs need to be processed. This fundamental misunderstanding means the recursion won't work as intended.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The
_nested_is_nullvector is computed but never used. This represents wasted computation iterating through all rows. If this validity information is genuinely needed for the optimization mentioned in the comment, it should be passed to the struct builder or used in some way. Otherwise, this code should be removed.