Skip to content

perf[arrow-select]: add specialized REE interleave#9856

Merged
alamb merged 1 commit intoapache:mainfrom
polarsignals:asubiotto/specializedreeinterleave
May 7, 2026
Merged

perf[arrow-select]: add specialized REE interleave#9856
alamb merged 1 commit intoapache:mainfrom
polarsignals:asubiotto/specializedreeinterleave

Conversation

@asubiotto
Copy link
Copy Markdown
Contributor

Benchmarks for this PR are in #9849. They have been separated out so we can compare this PR to main once the benchmarks have merged.

The specialized interleave works by preserving run ends as much as possible by coalescing groups of adjacent logical indices pointing to the same source and calling interleave on the run end values.

Future work could additionally coalesce values across sources, but this requires a value equality check.

Which issue does this PR close?

  • None

Rationale for this change

interleave_fallback on REE arrays is slow

What changes are included in this PR?

A specialized REE interleave implementation

Are these changes tested?

Yes, by existing tests.

Are there any user-facing changes?

@github-actions github-actions Bot added the arrow Changes to the arrow crate label Apr 30, 2026
@asubiotto
Copy link
Copy Markdown
Contributor Author

cc @alamb or any other maintainer that owns this code

Comment thread arrow-select/src/interleave.rs
Comment thread arrow-select/src/interleave.rs Outdated
for (out_pos, &(arr, row)) in indices.iter().enumerate() {
let row = R::Native::from_usize(row).ok_or_else(|| {
ArrowError::InvalidArgumentError(format!(
"interleave_run_end: row index {row} out of range"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, I don't think this check looks correct

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same as above, this checks that the usize from interleave is in fact a valid index into the input arrays. Since the input arrays are REE<R>, this check must pass, otherwise the interleave indexes have been incorrectly formed.

Copy link
Copy Markdown
Contributor

@Rich-T-kid Rich-T-kid May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im also a bit confused about this check, your checking if the row is out of bounds but couldn't you do this by checking the size of the array like
let current_array = values_arrays[arr]
if current_array.len() >= row { return arrow error( "row index {row} out of range"}
for example
` let mut builder = PrimitiveRunBuilder::<Int16Type, Int16Type>::new();
builder.extend([0, 0, 0, 1, 1, 0, 0, 1, 1, 1].into_iter().map(Some));
let a = builder.finish();

    let mut builder = PrimitiveRunBuilder::<Int16Type, Int16Type>::new();
    builder.extend([2, 2, 1, 1, 1, 0, 1, 0, 0, 0].into_iter().map(Some));
    let b = builder.finish();

    // logical: [1, 1, 1, 1, 1] across an a→b boundary; should compact to one run.
    // greater than int16::max
    let result = interleave(&[&a, &b], &[(0, 32766), (0, 4), (1, 2), (1, 3), (1, 4)]).unwrap();
    let result = result.as_run::<Int16Type>();`

This code returns an error but the error comes from the call to get_physical_indices()
let phys = runs[arr_idx].get_physical_indices(&logical_rows)?; not the validation step that your doing within the loop.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the confusion is the error message. I will change that. What I'm really doing here is a usize->R conversion based on need so that I can use it in get_physical_indices below and erroring if it fails. I'm checking whether the index is even representable in the array's type not whether the index is out of bounds on the input.


// Coalesce by physical-pair equality only: emit a new run when the
// (array_idx, physical_idx) pair changes between adjacent output rows.
// TODO: We could perform an equality check across sources to extend the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is what #9865 (and its issue #7710) are meant to address?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly. That PR would make sense in this block so we don't compact in the interleave fallback. This also means that the equality cost is only paid when interleave pairs select from different input run arrays (assumption is input run arrays are well formed). I'm concerned about the per-row slicing cost though. I think ideally you would have a cache of comparators but I believe that require some crate readjusting.

@Rich-T-kid
Copy link
Copy Markdown
Contributor

hey 😃 , Im working on #9865 which works to resolve #7710. I added a test case from my branch that isn't working on this branch currently. Im going to pull your changes down and push up a revised branch.

@Rich-T-kid Rich-T-kid mentioned this pull request May 5, 2026
@Rich-T-kid
Copy link
Copy Markdown
Contributor

@asubiotto I made a PR could you check it out? #9919

@asubiotto
Copy link
Copy Markdown
Contributor Author

@asubiotto I made a PR could you check it out? #9919

Hi, thanks for pushing that up. While I think we eventually want to do this I would prefer an incremental approach which is already much better than what we have today (deduping logical runs within the same source). The reason is that while I think we should eventually dedup based on values, I'm not too keen on the slice cost per value and I think we can probably work on a much more performant approach by building and reusing a comparator to reduce the dynamic dispatch overhead. This is why I think we should decouple the two changes: 1) Merge the specialized REE interleave and 2) Optimize the interleave by value deduplication across sources

@asubiotto asubiotto force-pushed the asubiotto/specializedreeinterleave branch from f103e47 to 6fd0803 Compare May 6, 2026 10:29
@asubiotto
Copy link
Copy Markdown
Contributor Author

I'm also seeing a need for better value equality checks for Dict<Struct> interning to merge dictionaries (similar use case to what you're thinking of for REE). I think we can kill two birds with one stone

@Rich-T-kid
Copy link
Copy Markdown
Contributor

@asubiotto yea that makes sense to me. ill close the PR.

@Jefffrey
Copy link
Copy Markdown
Contributor

Jefffrey commented May 7, 2026

run benchmark interleave_kernels

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4393751531-2043-flqs4 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing asubiotto/specializedreeinterleave (6fd0803) to b114241 (merge-base) diff
BENCH_NAME=interleave_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench interleave_kernels
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                                                                        asubiotto_specializedreeinterleave     main
-----                                                                                        ----------------------------------     ----
interleave dict(20, 0.0) 100 [0..100, 100..230, 450..1000]                                   1.00    641.3±3.84ns        ? ?/sec    1.00    642.7±6.80ns        ? ?/sec
interleave dict(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]                         1.01   1855.0±7.86ns        ? ?/sec    1.00   1843.8±7.86ns        ? ?/sec
interleave dict(20, 0.0) 1024 [0..100, 100..230, 450..1000]                                  1.00   1816.0±6.54ns        ? ?/sec    1.01   1830.8±9.70ns        ? ?/sec
interleave dict(20, 0.0) 400 [0..100, 100..230, 450..1000]                                   1.01   1032.7±4.09ns        ? ?/sec    1.00   1018.2±7.44ns        ? ?/sec
interleave dict_distinct 100                                                                 1.00      2.1±0.01µs        ? ?/sec    1.02      2.1±0.01µs        ? ?/sec
interleave dict_distinct 1024                                                                1.00      2.1±0.01µs        ? ?/sec    1.03      2.1±0.01µs        ? ?/sec
interleave dict_distinct 2048                                                                1.00      2.1±0.01µs        ? ?/sec    1.02      2.1±0.01µs        ? ?/sec
interleave dict_sparse(20, 0.0) 100 [0..100, 100..230, 450..1000]                            1.00   1537.4±4.90ns        ? ?/sec    1.01   1556.1±5.52ns        ? ?/sec
interleave dict_sparse(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]                  1.01      3.0±0.01µs        ? ?/sec    1.00      3.0±0.01µs        ? ?/sec
interleave dict_sparse(20, 0.0) 1024 [0..100, 100..230, 450..1000]                           1.01      2.7±0.01µs        ? ?/sec    1.00      2.7±0.01µs        ? ?/sec
interleave dict_sparse(20, 0.0) 400 [0..100, 100..230, 450..1000]                            1.00   1940.5±5.59ns        ? ?/sec    1.00   1947.9±8.46ns        ? ?/sec
interleave i32(0.0) 100 [0..100, 100..230, 450..1000]                                        1.02    217.0±1.78ns        ? ?/sec    1.00    211.9±1.79ns        ? ?/sec
interleave i32(0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]                              1.00    950.5±2.41ns        ? ?/sec    1.00    951.0±2.23ns        ? ?/sec
interleave i32(0.0) 1024 [0..100, 100..230, 450..1000]                                       1.06   1007.7±2.59ns        ? ?/sec    1.00    950.3±3.50ns        ? ?/sec
interleave i32(0.0) 400 [0..100, 100..230, 450..1000]                                        1.03    532.1±3.19ns        ? ?/sec    1.00    516.2±2.36ns        ? ?/sec
interleave i32(0.5) 100 [0..100, 100..230, 450..1000]                                        1.00    443.0±3.91ns        ? ?/sec    1.00    442.8±3.80ns        ? ?/sec
interleave i32(0.5) 1024 [0..100, 100..230, 450..1000, 0..1000]                              1.00      2.9±0.02µs        ? ?/sec    1.00      2.9±0.01µs        ? ?/sec
interleave i32(0.5) 1024 [0..100, 100..230, 450..1000]                                       1.01      3.0±0.03µs        ? ?/sec    1.00      2.9±0.01µs        ? ?/sec
interleave i32(0.5) 400 [0..100, 100..230, 450..1000]                                        1.00  1274.1±11.68ns        ? ?/sec    1.04   1330.0±5.42ns        ? ?/sec
interleave list<i64>(0.0,0.0,20) 100 [0..100, 100..230, 450..1000]                           1.00   1625.3±2.79ns        ? ?/sec    1.02   1663.4±2.79ns        ? ?/sec
interleave list<i64>(0.0,0.0,20) 1024 [0..100, 100..230, 450..1000, 0..1000]                 1.00     14.9±0.02µs        ? ?/sec    1.00     14.9±0.05µs        ? ?/sec
interleave list<i64>(0.0,0.0,20) 1024 [0..100, 100..230, 450..1000]                          1.00     14.8±0.05µs        ? ?/sec    1.01     15.0±0.02µs        ? ?/sec
interleave list<i64>(0.0,0.0,20) 400 [0..100, 100..230, 450..1000]                           1.00      6.1±0.02µs        ? ?/sec    1.00      6.1±0.01µs        ? ?/sec
interleave list<i64>(0.1,0.1,20) 100 [0..100, 100..230, 450..1000]                           1.00      3.9±0.03µs        ? ?/sec    1.00      3.9±0.02µs        ? ?/sec
interleave list<i64>(0.1,0.1,20) 1024 [0..100, 100..230, 450..1000, 0..1000]                 1.00     32.0±0.10µs        ? ?/sec    1.01     32.2±0.14µs        ? ?/sec
interleave list<i64>(0.1,0.1,20) 1024 [0..100, 100..230, 450..1000]                          1.00     32.5±0.12µs        ? ?/sec    1.00     32.5±0.20µs        ? ?/sec
interleave list<i64>(0.1,0.1,20) 400 [0..100, 100..230, 450..1000]                           1.00     12.9±0.04µs        ? ?/sec    1.01     13.0±0.06µs        ? ?/sec
interleave list_view<i64>(0.0,0.0,20) 100 [0..100, 100..230, 450..1000]                      1.00      3.6±0.01µs        ? ?/sec    1.00      3.6±0.01µs        ? ?/sec
interleave list_view<i64>(0.0,0.0,20) 1024 [0..100, 100..230, 450..1000, 0..1000]            1.01     17.6±0.04µs        ? ?/sec    1.00     17.4±0.03µs        ? ?/sec
interleave list_view<i64>(0.0,0.0,20) 1024 [0..100, 100..230, 450..1000]                     1.01     17.2±0.04µs        ? ?/sec    1.00     17.0±0.04µs        ? ?/sec
interleave list_view<i64>(0.0,0.0,20) 400 [0..100, 100..230, 450..1000]                      1.02      8.3±0.02µs        ? ?/sec    1.00      8.1±0.02µs        ? ?/sec
interleave list_view<i64>(0.1,0.1,20) 100 [0..100, 100..230, 450..1000]                      1.03      6.4±0.01µs        ? ?/sec    1.00      6.2±0.02µs        ? ?/sec
interleave list_view<i64>(0.1,0.1,20) 1024 [0..100, 100..230, 450..1000, 0..1000]            1.00     34.9±0.05µs        ? ?/sec    1.00     34.7±0.05µs        ? ?/sec
interleave list_view<i64>(0.1,0.1,20) 1024 [0..100, 100..230, 450..1000]                     1.01     35.1±0.06µs        ? ?/sec    1.00     34.9±0.04µs        ? ?/sec
interleave list_view<i64>(0.1,0.1,20) 400 [0..100, 100..230, 450..1000]                      1.02     15.7±0.02µs        ? ?/sec    1.00     15.4±0.03µs        ? ?/sec
interleave list_view_overlapping<i64>(80x,20) 100 [0..100, 100..230, 450..1000]              1.00      4.0±0.01µs        ? ?/sec    1.00      4.0±0.01µs        ? ?/sec
interleave list_view_overlapping<i64>(80x,20) 1024 [0..100, 100..230, 450..1000, 0..1000]    1.01     22.3±0.05µs        ? ?/sec    1.00     22.1±0.05µs        ? ?/sec
interleave list_view_overlapping<i64>(80x,20) 1024 [0..100, 100..230, 450..1000]             1.01     22.6±0.06µs        ? ?/sec    1.00     22.4±0.05µs        ? ?/sec
interleave list_view_overlapping<i64>(80x,20) 400 [0..100, 100..230, 450..1000]              1.02     10.4±0.01µs        ? ?/sec    1.00     10.2±0.02µs        ? ?/sec
interleave str(20, 0.0) 100 [0..100, 100..230, 450..1000]                                    1.00    607.8±1.35ns        ? ?/sec    1.00    607.6±1.33ns        ? ?/sec
interleave str(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]                          1.00      4.6±0.01µs        ? ?/sec    1.01      4.6±0.02µs        ? ?/sec
interleave str(20, 0.0) 1024 [0..100, 100..230, 450..1000]                                   1.00      4.6±0.01µs        ? ?/sec    1.00      4.6±0.01µs        ? ?/sec
interleave str(20, 0.0) 400 [0..100, 100..230, 450..1000]                                    1.00   1887.5±4.71ns        ? ?/sec    1.02   1924.8±7.08ns        ? ?/sec
interleave str(20, 0.5) 100 [0..100, 100..230, 450..1000]                                    1.00    747.5±1.84ns        ? ?/sec    1.00    746.0±3.23ns        ? ?/sec
interleave str(20, 0.5) 1024 [0..100, 100..230, 450..1000, 0..1000]                          1.01      6.0±0.02µs        ? ?/sec    1.00      5.9±0.01µs        ? ?/sec
interleave str(20, 0.5) 1024 [0..100, 100..230, 450..1000]                                   1.00      5.9±0.02µs        ? ?/sec    1.00      5.9±0.02µs        ? ?/sec
interleave str(20, 0.5) 400 [0..100, 100..230, 450..1000]                                    1.01      2.5±0.01µs        ? ?/sec    1.00      2.5±0.01µs        ? ?/sec
interleave str_view(0.0) 100 [0..100, 100..230, 450..1000]                                   1.02    622.0±1.83ns        ? ?/sec    1.00    607.7±1.46ns        ? ?/sec
interleave str_view(0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]                         1.00      2.6±0.00µs        ? ?/sec    1.01      2.6±0.00µs        ? ?/sec
interleave str_view(0.0) 1024 [0..100, 100..230, 450..1000]                                  1.00      2.6±0.00µs        ? ?/sec    1.01      2.6±0.01µs        ? ?/sec
interleave str_view(0.0) 400 [0..100, 100..230, 450..1000]                                   1.00   1218.2±1.49ns        ? ?/sec    1.02   1237.8±1.90ns        ? ?/sec
interleave struct(i32(0.0), i32(0.0) 100 [0..100, 100..230, 450..1000]                       1.00    635.7±7.28ns        ? ?/sec    1.02    646.1±7.41ns        ? ?/sec
interleave struct(i32(0.0), i32(0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]             1.00      2.2±0.01µs        ? ?/sec    1.00      2.2±0.01µs        ? ?/sec
interleave struct(i32(0.0), i32(0.0) 1024 [0..100, 100..230, 450..1000]                      1.00      2.1±0.01µs        ? ?/sec    1.00      2.1±0.01µs        ? ?/sec
interleave struct(i32(0.0), i32(0.0) 400 [0..100, 100..230, 450..1000]                       1.00   1141.1±7.61ns        ? ?/sec    1.01   1147.7±7.02ns        ? ?/sec
interleave struct(i32(0.0), str(20, 0.0) 100 [0..100, 100..230, 450..1000]                   1.05   1077.9±5.69ns        ? ?/sec    1.00   1028.3±4.77ns        ? ?/sec
interleave struct(i32(0.0), str(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]         1.00      5.9±0.01µs        ? ?/sec    1.00      5.9±0.01µs        ? ?/sec
interleave struct(i32(0.0), str(20, 0.0) 1024 [0..100, 100..230, 450..1000]                  1.01      5.8±0.02µs        ? ?/sec    1.00      5.8±0.02µs        ? ?/sec
interleave struct(i32(0.0), str(20, 0.0) 400 [0..100, 100..230, 450..1000]                   1.02      2.7±0.01µs        ? ?/sec    1.00      2.6±0.00µs        ? ?/sec
interleave struct(str(20, 0.0), str(20, 0.0)) 100 [0..100, 100..230, 450..1000]              1.03   1443.6±5.01ns        ? ?/sec    1.00   1403.2±3.32ns        ? ?/sec
interleave struct(str(20, 0.0), str(20, 0.0)) 1024 [0..100, 100..230, 450..1000, 0..1000]    1.03      9.9±0.02µs        ? ?/sec    1.00      9.6±0.02µs        ? ?/sec
interleave struct(str(20, 0.0), str(20, 0.0)) 1024 [0..100, 100..230, 450..1000]             1.01      9.6±0.02µs        ? ?/sec    1.00      9.5±0.03µs        ? ?/sec
interleave struct(str(20, 0.0), str(20, 0.0)) 400 [0..100, 100..230, 450..1000]              1.01      4.1±0.01µs        ? ?/sec    1.00      4.0±0.01µs        ? ?/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 610.1s
Peak memory 3.0 GiB
Avg memory 3.0 GiB
CPU user 604.3s
CPU sys 0.7s
Peak spill 0 B

branch

Metric Value
Wall time 605.1s
Peak memory 3.0 GiB
Avg memory 3.0 GiB
CPU user 601.5s
CPU sys 0.1s
Peak spill 0 B

File an issue against this benchmark runner

@Jefffrey
Copy link
Copy Markdown
Contributor

Jefffrey commented May 7, 2026

@asubiotto could you merge up from main so we can compare the benchmark?

The specialized interleave works by preserving run ends as much as possible by
coalescing groups of adjacent logical indices pointing to the same source and
calling interleave on the run end values.

Future work could additionally coalesce values across sources, but this
requires a value equality check.

Signed-off-by: Alfonso Subiotto Marques <[email protected]>
@asubiotto asubiotto force-pushed the asubiotto/specializedreeinterleave branch from 6fd0803 to b8165b1 Compare May 7, 2026 08:34
@asubiotto
Copy link
Copy Markdown
Contributor Author

Oops, done.

@Jefffrey
Copy link
Copy Markdown
Contributor

Jefffrey commented May 7, 2026

run benchmark interleave_kernels

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4395554475-2046-jsqxr 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing asubiotto/specializedreeinterleave (b8165b1) to 97ff198 (merge-base) diff
BENCH_NAME=interleave_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench interleave_kernels
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot
Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                                                                        asubiotto_specializedreeinterleave     main
-----                                                                                        ----------------------------------     ----
interleave dict(20, 0.0) 100 [0..100, 100..230, 450..1000]                                   1.00    634.1±4.86ns        ? ?/sec    1.00    634.4±3.18ns        ? ?/sec
interleave dict(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]                         1.00   1831.6±6.61ns        ? ?/sec    1.01   1849.4±7.83ns        ? ?/sec
interleave dict(20, 0.0) 1024 [0..100, 100..230, 450..1000]                                  1.00   1802.8±9.62ns        ? ?/sec    1.00   1802.9±9.23ns        ? ?/sec
interleave dict(20, 0.0) 400 [0..100, 100..230, 450..1000]                                   1.00   1011.9±4.47ns        ? ?/sec    1.01   1017.9±4.34ns        ? ?/sec
interleave dict_distinct 100                                                                 1.00      2.1±0.01µs        ? ?/sec    1.00      2.1±0.01µs        ? ?/sec
interleave dict_distinct 1024                                                                1.00      2.1±0.01µs        ? ?/sec    1.01      2.1±0.01µs        ? ?/sec
interleave dict_distinct 2048                                                                1.00      2.1±0.01µs        ? ?/sec    1.00      2.1±0.01µs        ? ?/sec
interleave dict_sparse(20, 0.0) 100 [0..100, 100..230, 450..1000]                            1.01   1523.6±5.47ns        ? ?/sec    1.00   1515.0±6.63ns        ? ?/sec
interleave dict_sparse(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]                  1.00      3.0±0.01µs        ? ?/sec    1.01      3.0±0.01µs        ? ?/sec
interleave dict_sparse(20, 0.0) 1024 [0..100, 100..230, 450..1000]                           1.00      2.7±0.01µs        ? ?/sec    1.00      2.7±0.01µs        ? ?/sec
interleave dict_sparse(20, 0.0) 400 [0..100, 100..230, 450..1000]                            1.01   1935.3±6.22ns        ? ?/sec    1.00   1922.3±4.08ns        ? ?/sec
interleave i32(0.0) 100 [0..100, 100..230, 450..1000]                                        1.00    212.4±2.54ns        ? ?/sec    1.01    213.7±2.70ns        ? ?/sec
interleave i32(0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]                              1.00    962.7±3.97ns        ? ?/sec    1.01    968.4±3.74ns        ? ?/sec
interleave i32(0.0) 1024 [0..100, 100..230, 450..1000]                                       1.00    939.2±2.28ns        ? ?/sec    1.07   1001.3±4.04ns        ? ?/sec
interleave i32(0.0) 400 [0..100, 100..230, 450..1000]                                        1.00    452.9±2.25ns        ? ?/sec    1.16    524.4±4.19ns        ? ?/sec
interleave i32(0.5) 100 [0..100, 100..230, 450..1000]                                        1.00    443.4±3.75ns        ? ?/sec    1.01    449.8±5.65ns        ? ?/sec
interleave i32(0.5) 1024 [0..100, 100..230, 450..1000, 0..1000]                              1.00      2.9±0.02µs        ? ?/sec    1.00      2.9±0.01µs        ? ?/sec
interleave i32(0.5) 1024 [0..100, 100..230, 450..1000]                                       1.00      3.0±0.02µs        ? ?/sec    1.02      3.0±0.02µs        ? ?/sec
interleave i32(0.5) 400 [0..100, 100..230, 450..1000]                                        1.00   1256.6±9.25ns        ? ?/sec    1.06   1329.4±7.54ns        ? ?/sec
interleave list<i64>(0.0,0.0,20) 100 [0..100, 100..230, 450..1000]                           1.00   1627.3±2.87ns        ? ?/sec    1.02   1656.7±5.27ns        ? ?/sec
interleave list<i64>(0.0,0.0,20) 1024 [0..100, 100..230, 450..1000, 0..1000]                 1.00     14.9±0.11µs        ? ?/sec    1.00     14.9±0.05µs        ? ?/sec
interleave list<i64>(0.0,0.0,20) 1024 [0..100, 100..230, 450..1000]                          1.00     14.8±0.04µs        ? ?/sec    1.01     14.9±0.03µs        ? ?/sec
interleave list<i64>(0.0,0.0,20) 400 [0..100, 100..230, 450..1000]                           1.00      6.1±0.01µs        ? ?/sec    1.00      6.1±0.02µs        ? ?/sec
interleave list<i64>(0.1,0.1,20) 100 [0..100, 100..230, 450..1000]                           1.00      3.9±0.01µs        ? ?/sec    1.01      3.9±0.01µs        ? ?/sec
interleave list<i64>(0.1,0.1,20) 1024 [0..100, 100..230, 450..1000, 0..1000]                 1.00     32.0±0.14µs        ? ?/sec    1.00     32.1±0.16µs        ? ?/sec
interleave list<i64>(0.1,0.1,20) 1024 [0..100, 100..230, 450..1000]                          1.00     32.4±0.14µs        ? ?/sec    1.00     32.3±0.13µs        ? ?/sec
interleave list<i64>(0.1,0.1,20) 400 [0..100, 100..230, 450..1000]                           1.01     13.1±0.06µs        ? ?/sec    1.00     13.0±0.04µs        ? ?/sec
interleave list_view<i64>(0.0,0.0,20) 100 [0..100, 100..230, 450..1000]                      1.00      3.5±0.01µs        ? ?/sec    1.01      3.6±0.01µs        ? ?/sec
interleave list_view<i64>(0.0,0.0,20) 1024 [0..100, 100..230, 450..1000, 0..1000]            1.00     17.4±0.04µs        ? ?/sec    1.00     17.4±0.03µs        ? ?/sec
interleave list_view<i64>(0.0,0.0,20) 1024 [0..100, 100..230, 450..1000]                     1.01     17.3±0.04µs        ? ?/sec    1.00     17.1±0.03µs        ? ?/sec
interleave list_view<i64>(0.0,0.0,20) 400 [0..100, 100..230, 450..1000]                      1.01      8.3±0.02µs        ? ?/sec    1.00      8.1±0.02µs        ? ?/sec
interleave list_view<i64>(0.1,0.1,20) 100 [0..100, 100..230, 450..1000]                      1.00      6.0±0.01µs        ? ?/sec    1.02      6.1±0.02µs        ? ?/sec
interleave list_view<i64>(0.1,0.1,20) 1024 [0..100, 100..230, 450..1000, 0..1000]            1.01     34.8±0.04µs        ? ?/sec    1.00     34.6±0.05µs        ? ?/sec
interleave list_view<i64>(0.1,0.1,20) 1024 [0..100, 100..230, 450..1000]                     1.00     34.6±0.06µs        ? ?/sec    1.00     34.6±0.05µs        ? ?/sec
interleave list_view<i64>(0.1,0.1,20) 400 [0..100, 100..230, 450..1000]                      1.01     15.5±0.05µs        ? ?/sec    1.00     15.4±0.05µs        ? ?/sec
interleave list_view_overlapping<i64>(80x,20) 100 [0..100, 100..230, 450..1000]              1.01      4.1±0.01µs        ? ?/sec    1.00      4.0±0.01µs        ? ?/sec
interleave list_view_overlapping<i64>(80x,20) 1024 [0..100, 100..230, 450..1000, 0..1000]    1.00     22.2±0.05µs        ? ?/sec    1.01     22.3±0.08µs        ? ?/sec
interleave list_view_overlapping<i64>(80x,20) 1024 [0..100, 100..230, 450..1000]             1.00     22.4±0.06µs        ? ?/sec    1.00     22.5±0.05µs        ? ?/sec
interleave list_view_overlapping<i64>(80x,20) 400 [0..100, 100..230, 450..1000]              1.00     10.3±0.01µs        ? ?/sec    1.00     10.3±0.02µs        ? ?/sec
interleave ree_i32<dict<u32,utf8>>(64 runs) 100 [0..100, 100..230, 450..1000]                1.00      4.3±0.01µs        ? ?/sec    2.39     10.3±0.04µs        ? ?/sec
interleave ree_i32<dict<u32,utf8>>(64 runs) 1024 [0..100, 100..230, 450..1000, 0..1000]      1.00     20.6±0.07µs        ? ?/sec    4.06     83.5±0.11µs        ? ?/sec
interleave ree_i32<dict<u32,utf8>>(64 runs) 1024 [0..100, 100..230, 450..1000]               1.00     20.1±0.07µs        ? ?/sec    3.87     77.8±0.10µs        ? ?/sec
interleave ree_i32<dict<u32,utf8>>(64 runs) 400 [0..100, 100..230, 450..1000]                1.00      9.5±0.02µs        ? ?/sec    3.37     32.1±0.04µs        ? ?/sec
interleave ree_i32<i64>(64 runs) 100 [0..100, 100..230, 450..1000]                           1.00      3.3±0.01µs        ? ?/sec    2.80      9.1±0.04µs        ? ?/sec
interleave ree_i32<i64>(64 runs) 1024 [0..100, 100..230, 450..1000, 0..1000]                 1.00     19.1±0.07µs        ? ?/sec    4.21     80.4±0.09µs        ? ?/sec
interleave ree_i32<i64>(64 runs) 1024 [0..100, 100..230, 450..1000]                          1.00     18.6±0.07µs        ? ?/sec    4.00     74.4±0.10µs        ? ?/sec
interleave ree_i32<i64>(64 runs) 400 [0..100, 100..230, 450..1000]                           1.00      8.3±0.03µs        ? ?/sec    3.62     30.1±0.05µs        ? ?/sec
interleave str(20, 0.0) 100 [0..100, 100..230, 450..1000]                                    1.00    601.3±1.44ns        ? ?/sec    1.00    600.2±1.40ns        ? ?/sec
interleave str(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]                          1.00      4.6±0.01µs        ? ?/sec    1.01      4.6±0.01µs        ? ?/sec
interleave str(20, 0.0) 1024 [0..100, 100..230, 450..1000]                                   1.00      4.6±0.01µs        ? ?/sec    1.01      4.6±0.01µs        ? ?/sec
interleave str(20, 0.0) 400 [0..100, 100..230, 450..1000]                                    1.00   1888.9±4.02ns        ? ?/sec    1.00   1895.1±7.63ns        ? ?/sec
interleave str(20, 0.5) 100 [0..100, 100..230, 450..1000]                                    1.00    747.1±0.78ns        ? ?/sec    1.00    749.2±1.17ns        ? ?/sec
interleave str(20, 0.5) 1024 [0..100, 100..230, 450..1000, 0..1000]                          1.00      6.0±0.02µs        ? ?/sec    1.00      5.9±0.02µs        ? ?/sec
interleave str(20, 0.5) 1024 [0..100, 100..230, 450..1000]                                   1.00      5.9±0.03µs        ? ?/sec    1.00      5.9±0.01µs        ? ?/sec
interleave str(20, 0.5) 400 [0..100, 100..230, 450..1000]                                    1.01      2.5±0.01µs        ? ?/sec    1.00      2.5±0.01µs        ? ?/sec
interleave str_view(0.0) 100 [0..100, 100..230, 450..1000]                                   1.03    575.9±9.14ns        ? ?/sec    1.00    559.3±0.87ns        ? ?/sec
interleave str_view(0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]                         1.01      2.6±0.01µs        ? ?/sec    1.00      2.6±0.00µs        ? ?/sec
interleave str_view(0.0) 1024 [0..100, 100..230, 450..1000]                                  1.01      2.6±0.01µs        ? ?/sec    1.00      2.6±0.00µs        ? ?/sec
interleave str_view(0.0) 400 [0..100, 100..230, 450..1000]                                   1.00   1226.0±8.73ns        ? ?/sec    1.01   1241.2±1.58ns        ? ?/sec
interleave struct(i32(0.0), i32(0.0) 100 [0..100, 100..230, 450..1000]                       1.00    647.8±7.91ns        ? ?/sec    1.01    655.1±7.88ns        ? ?/sec
interleave struct(i32(0.0), i32(0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]             1.00      2.1±0.01µs        ? ?/sec    1.02      2.2±0.01µs        ? ?/sec
interleave struct(i32(0.0), i32(0.0) 1024 [0..100, 100..230, 450..1000]                      1.00      2.1±0.01µs        ? ?/sec    1.00      2.1±0.01µs        ? ?/sec
interleave struct(i32(0.0), i32(0.0) 400 [0..100, 100..230, 450..1000]                       1.00   1152.4±7.32ns        ? ?/sec    1.00   1151.4±7.07ns        ? ?/sec
interleave struct(i32(0.0), str(20, 0.0) 100 [0..100, 100..230, 450..1000]                   1.00   1032.6±5.63ns        ? ?/sec    1.01   1042.3±5.78ns        ? ?/sec
interleave struct(i32(0.0), str(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]         1.00      5.8±0.01µs        ? ?/sec    1.00      5.8±0.01µs        ? ?/sec
interleave struct(i32(0.0), str(20, 0.0) 1024 [0..100, 100..230, 450..1000]                  1.00      5.8±0.01µs        ? ?/sec    1.00      5.8±0.01µs        ? ?/sec
interleave struct(i32(0.0), str(20, 0.0) 400 [0..100, 100..230, 450..1000]                   1.00      2.6±0.00µs        ? ?/sec    1.01      2.6±0.01µs        ? ?/sec
interleave struct(str(20, 0.0), str(20, 0.0)) 100 [0..100, 100..230, 450..1000]              1.00   1399.8±4.73ns        ? ?/sec    1.01   1414.6±5.06ns        ? ?/sec
interleave struct(str(20, 0.0), str(20, 0.0)) 1024 [0..100, 100..230, 450..1000, 0..1000]    1.00      9.6±0.02µs        ? ?/sec    1.00      9.6±0.02µs        ? ?/sec
interleave struct(str(20, 0.0), str(20, 0.0)) 1024 [0..100, 100..230, 450..1000]             1.00      9.5±0.02µs        ? ?/sec    1.00      9.5±0.03µs        ? ?/sec
interleave struct(str(20, 0.0), str(20, 0.0)) 400 [0..100, 100..230, 450..1000]              1.00      4.0±0.01µs        ? ?/sec    1.00      4.1±0.01µs        ? ?/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 690.1s
Peak memory 3.0 GiB
Avg memory 3.0 GiB
CPU user 685.6s
CPU sys 0.7s
Peak spill 0 B

branch

Metric Value
Wall time 685.2s
Peak memory 3.0 GiB
Avg memory 3.0 GiB
CPU user 681.9s
CPU sys 0.2s
Peak spill 0 B

File an issue against this benchmark runner

Copy link
Copy Markdown
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI failure unrelated:

@alamb alamb merged commit 3c71d92 into apache:main May 7, 2026
25 of 26 checks passed
@alamb
Copy link
Copy Markdown
Contributor

alamb commented May 7, 2026

Thanks @asubiotto and @Jefffrey

@asubiotto asubiotto deleted the asubiotto/specializedreeinterleave branch May 8, 2026 10:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants