perf[arrow-select]: add specialized REE interleave by asubiotto · Pull Request #9856 · apache/arrow-rs

asubiotto · 2026-04-30T12:20:13Z

Benchmarks for this PR are in #9849. They have been separated out so we can compare this PR to main once the benchmarks have merged.

The specialized interleave works by preserving run ends as much as possible by coalescing groups of adjacent logical indices pointing to the same source and calling interleave on the run end values.

Future work could additionally coalesce values across sources, but this requires a value equality check.

Which issue does this PR close?

None

Rationale for this change

interleave_fallback on REE arrays is slow

What changes are included in this PR?

A specialized REE interleave implementation

Are these changes tested?

Yes, by existing tests.

Are there any user-facing changes?

asubiotto · 2026-04-30T12:58:04Z

cc @alamb or any other maintainer that owns this code

Jefffrey · 2026-05-05T05:11:10Z

+    for (out_pos, &(arr, row)) in indices.iter().enumerate() {
+        let row = R::Native::from_usize(row).ok_or_else(|| {
+            ArrowError::InvalidArgumentError(format!(
+                "interleave_run_end: row index {row} out of range"


Similarly here, I don't think this check looks correct

This is the same as above, this checks that the usize from interleave is in fact a valid index into the input arrays. Since the input arrays are REE<R>, this check must pass, otherwise the interleave indexes have been incorrectly formed.

Im also a bit confused about this check, your checking if the row is out of bounds but couldn't you do this by checking the size of the array like
let current_array = values_arrays[arr]
if current_array.len() >= row { return arrow error( "row index {row} out of range"}
for example
` let mut builder = PrimitiveRunBuilder::<Int16Type, Int16Type>::new();
builder.extend([0, 0, 0, 1, 1, 0, 0, 1, 1, 1].into_iter().map(Some));
let a = builder.finish();

let mut builder = PrimitiveRunBuilder::<Int16Type, Int16Type>::new(); builder.extend([2, 2, 1, 1, 1, 0, 1, 0, 0, 0].into_iter().map(Some)); let b = builder.finish(); // logical: [1, 1, 1, 1, 1] across an a→b boundary; should compact to one run. // greater than int16::max let result = interleave(&[&a, &b], &[(0, 32766), (0, 4), (1, 2), (1, 3), (1, 4)]).unwrap(); let result = result.as_run::<Int16Type>();`

This code returns an error but the error comes from the call to get_physical_indices()
let phys = runs[arr_idx].get_physical_indices(&logical_rows)?; not the validation step that your doing within the loop.

Yeah, I think the confusion is the error message. I will change that. What I'm really doing here is a usize->R conversion based on need so that I can use it in get_physical_indices below and erroring if it fails. I'm checking whether the index is even representable in the array's type not whether the index is out of bounds on the input.

Jefffrey · 2026-05-05T05:11:48Z

+
+    // Coalesce by physical-pair equality only: emit a new run when the
+    // (array_idx, physical_idx) pair changes between adjacent output rows.
+    // TODO: We could perform an equality check across sources to extend the


I suppose this is what #9865 (and its issue #7710) are meant to address?

Yes, exactly. That PR would make sense in this block so we don't compact in the interleave fallback. This also means that the equality cost is only paid when interleave pairs select from different input run arrays (assumption is input run arrays are well formed). I'm concerned about the per-row slicing cost though. I think ideally you would have a cache of comparators but I believe that require some crate readjusting.

Rich-T-kid · 2026-05-05T14:33:51Z

hey 😃 , Im working on #9865 which works to resolve #7710. I added a test case from my branch that isn't working on this branch currently. Im going to pull your changes down and push up a revised branch.

Rich-T-kid · 2026-05-05T17:03:37Z

@asubiotto I made a PR could you check it out? #9919

asubiotto · 2026-05-06T09:52:54Z

@asubiotto I made a PR could you check it out? #9919

Hi, thanks for pushing that up. While I think we eventually want to do this I would prefer an incremental approach which is already much better than what we have today (deduping logical runs within the same source). The reason is that while I think we should eventually dedup based on values, I'm not too keen on the slice cost per value and I think we can probably work on a much more performant approach by building and reusing a comparator to reduce the dynamic dispatch overhead. This is why I think we should decouple the two changes: 1) Merge the specialized REE interleave and 2) Optimize the interleave by value deduplication across sources

asubiotto · 2026-05-06T12:06:46Z

I'm also seeing a need for better value equality checks for Dict<Struct> interning to merge dictionaries (similar use case to what you're thinking of for REE). I think we can kill two birds with one stone

Rich-T-kid · 2026-05-06T13:48:21Z

@asubiotto yea that makes sense to me. ill close the PR.

Jefffrey · 2026-05-07T02:27:49Z

run benchmark interleave_kernels

adriangbot · 2026-05-07T02:31:53Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4393751531-2043-flqs4 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing asubiotto/specializedreeinterleave (6fd0803) to b114241 (merge-base) diff
BENCH_NAME=interleave_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench interleave_kernels
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-05-07T02:53:11Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                                                                                        asubiotto_specializedreeinterleave     main
-----                                                                                        ----------------------------------     ----
interleave dict(20, 0.0) 100 [0..100, 100..230, 450..1000]                                   1.00    641.3±3.84ns        ? ?/sec    1.00    642.7±6.80ns        ? ?/sec
interleave dict(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]                         1.01   1855.0±7.86ns        ? ?/sec    1.00   1843.8±7.86ns        ? ?/sec
interleave dict(20, 0.0) 1024 [0..100, 100..230, 450..1000]                                  1.00   1816.0±6.54ns        ? ?/sec    1.01   1830.8±9.70ns        ? ?/sec
interleave dict(20, 0.0) 400 [0..100, 100..230, 450..1000]                                   1.01   1032.7±4.09ns        ? ?/sec    1.00   1018.2±7.44ns        ? ?/sec
interleave dict_distinct 100                                                                 1.00      2.1±0.01µs        ? ?/sec    1.02      2.1±0.01µs        ? ?/sec
interleave dict_distinct 1024                                                                1.00      2.1±0.01µs        ? ?/sec    1.03      2.1±0.01µs        ? ?/sec
interleave dict_distinct 2048                                                                1.00      2.1±0.01µs        ? ?/sec    1.02      2.1±0.01µs        ? ?/sec
interleave dict_sparse(20, 0.0) 100 [0..100, 100..230, 450..1000]                            1.00   1537.4±4.90ns        ? ?/sec    1.01   1556.1±5.52ns        ? ?/sec
interleave dict_sparse(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]                  1.01      3.0±0.01µs        ? ?/sec    1.00      3.0±0.01µs        ? ?/sec
interleave dict_sparse(20, 0.0) 1024 [0..100, 100..230, 450..1000]                           1.01      2.7±0.01µs        ? ?/sec    1.00      2.7±0.01µs        ? ?/sec
interleave dict_sparse(20, 0.0) 400 [0..100, 100..230, 450..1000]                            1.00   1940.5±5.59ns        ? ?/sec    1.00   1947.9±8.46ns        ? ?/sec
interleave i32(0.0) 100 [0..100, 100..230, 450..1000]                                        1.02    217.0±1.78ns        ? ?/sec    1.00    211.9±1.79ns        ? ?/sec
interleave i32(0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]                              1.00    950.5±2.41ns        ? ?/sec    1.00    951.0±2.23ns        ? ?/sec
interleave i32(0.0) 1024 [0..100, 100..230, 450..1000]                                       1.06   1007.7±2.59ns        ? ?/sec    1.00    950.3±3.50ns        ? ?/sec
interleave i32(0.0) 400 [0..100, 100..230, 450..1000]                                        1.03    532.1±3.19ns        ? ?/sec    1.00    516.2±2.36ns        ? ?/sec
interleave i32(0.5) 100 [0..100, 100..230, 450..1000]                                        1.00    443.0±3.91ns        ? ?/sec    1.00    442.8±3.80ns        ? ?/sec
interleave i32(0.5) 1024 [0..100, 100..230, 450..1000, 0..1000]                              1.00      2.9±0.02µs        ? ?/sec    1.00      2.9±0.01µs        ? ?/sec
interleave i32(0.5) 1024 [0..100, 100..230, 450..1000]                                       1.01      3.0±0.03µs        ? ?/sec    1.00      2.9±0.01µs        ? ?/sec
interleave i32(0.5) 400 [0..100, 100..230, 450..1000]                                        1.00  1274.1±11.68ns        ? ?/sec    1.04   1330.0±5.42ns        ? ?/sec
interleave list<i64>(0.0,0.0,20) 100 [0..100, 100..230, 450..1000]                           1.00   1625.3±2.79ns        ? ?/sec    1.02   1663.4±2.79ns        ? ?/sec
interleave list<i64>(0.0,0.0,20) 1024 [0..100, 100..230, 450..1000, 0..1000]                 1.00     14.9±0.02µs        ? ?/sec    1.00     14.9±0.05µs        ? ?/sec
interleave list<i64>(0.0,0.0,20) 1024 [0..100, 100..230, 450..1000]                          1.00     14.8±0.05µs        ? ?/sec    1.01     15.0±0.02µs        ? ?/sec
interleave list<i64>(0.0,0.0,20) 400 [0..100, 100..230, 450..1000]                           1.00      6.1±0.02µs        ? ?/sec    1.00      6.1±0.01µs        ? ?/sec
interleave list<i64>(0.1,0.1,20) 100 [0..100, 100..230, 450..1000]                           1.00      3.9±0.03µs        ? ?/sec    1.00      3.9±0.02µs        ? ?/sec
interleave list<i64>(0.1,0.1,20) 1024 [0..100, 100..230, 450..1000, 0..1000]                 1.00     32.0±0.10µs        ? ?/sec    1.01     32.2±0.14µs        ? ?/sec
interleave list<i64>(0.1,0.1,20) 1024 [0..100, 100..230, 450..1000]                          1.00     32.5±0.12µs        ? ?/sec    1.00     32.5±0.20µs        ? ?/sec
interleave list<i64>(0.1,0.1,20) 400 [0..100, 100..230, 450..1000]                           1.00     12.9±0.04µs        ? ?/sec    1.01     13.0±0.06µs        ? ?/sec
interleave list_view<i64>(0.0,0.0,20) 100 [0..100, 100..230, 450..1000]                      1.00      3.6±0.01µs        ? ?/sec    1.00      3.6±0.01µs        ? ?/sec
interleave list_view<i64>(0.0,0.0,20) 1024 [0..100, 100..230, 450..1000, 0..1000]            1.01     17.6±0.04µs        ? ?/sec    1.00     17.4±0.03µs        ? ?/sec
interleave list_view<i64>(0.0,0.0,20) 1024 [0..100, 100..230, 450..1000]                     1.01     17.2±0.04µs        ? ?/sec    1.00     17.0±0.04µs        ? ?/sec
interleave list_view<i64>(0.0,0.0,20) 400 [0..100, 100..230, 450..1000]                      1.02      8.3±0.02µs        ? ?/sec    1.00      8.1±0.02µs        ? ?/sec
interleave list_view<i64>(0.1,0.1,20) 100 [0..100, 100..230, 450..1000]                      1.03      6.4±0.01µs        ? ?/sec    1.00      6.2±0.02µs        ? ?/sec
interleave list_view<i64>(0.1,0.1,20) 1024 [0..100, 100..230, 450..1000, 0..1000]            1.00     34.9±0.05µs        ? ?/sec    1.00     34.7±0.05µs        ? ?/sec
interleave list_view<i64>(0.1,0.1,20) 1024 [0..100, 100..230, 450..1000]                     1.01     35.1±0.06µs        ? ?/sec    1.00     34.9±0.04µs        ? ?/sec
interleave list_view<i64>(0.1,0.1,20) 400 [0..100, 100..230, 450..1000]                      1.02     15.7±0.02µs        ? ?/sec    1.00     15.4±0.03µs        ? ?/sec
interleave list_view_overlapping<i64>(80x,20) 100 [0..100, 100..230, 450..1000]              1.00      4.0±0.01µs        ? ?/sec    1.00      4.0±0.01µs        ? ?/sec
interleave list_view_overlapping<i64>(80x,20) 1024 [0..100, 100..230, 450..1000, 0..1000]    1.01     22.3±0.05µs        ? ?/sec    1.00     22.1±0.05µs        ? ?/sec
interleave list_view_overlapping<i64>(80x,20) 1024 [0..100, 100..230, 450..1000]             1.01     22.6±0.06µs        ? ?/sec    1.00     22.4±0.05µs        ? ?/sec
interleave list_view_overlapping<i64>(80x,20) 400 [0..100, 100..230, 450..1000]              1.02     10.4±0.01µs        ? ?/sec    1.00     10.2±0.02µs        ? ?/sec
interleave str(20, 0.0) 100 [0..100, 100..230, 450..1000]                                    1.00    607.8±1.35ns        ? ?/sec    1.00    607.6±1.33ns        ? ?/sec
interleave str(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]                          1.00      4.6±0.01µs        ? ?/sec    1.01      4.6±0.02µs        ? ?/sec
interleave str(20, 0.0) 1024 [0..100, 100..230, 450..1000]                                   1.00      4.6±0.01µs        ? ?/sec    1.00      4.6±0.01µs        ? ?/sec
interleave str(20, 0.0) 400 [0..100, 100..230, 450..1000]                                    1.00   1887.5±4.71ns        ? ?/sec    1.02   1924.8±7.08ns        ? ?/sec
interleave str(20, 0.5) 100 [0..100, 100..230, 450..1000]                                    1.00    747.5±1.84ns        ? ?/sec    1.00    746.0±3.23ns        ? ?/sec
interleave str(20, 0.5) 1024 [0..100, 100..230, 450..1000, 0..1000]                          1.01      6.0±0.02µs        ? ?/sec    1.00      5.9±0.01µs        ? ?/sec
interleave str(20, 0.5) 1024 [0..100, 100..230, 450..1000]                                   1.00      5.9±0.02µs        ? ?/sec    1.00      5.9±0.02µs        ? ?/sec
interleave str(20, 0.5) 400 [0..100, 100..230, 450..1000]                                    1.01      2.5±0.01µs        ? ?/sec    1.00      2.5±0.01µs        ? ?/sec
interleave str_view(0.0) 100 [0..100, 100..230, 450..1000]                                   1.02    622.0±1.83ns        ? ?/sec    1.00    607.7±1.46ns        ? ?/sec
interleave str_view(0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]                         1.00      2.6±0.00µs        ? ?/sec    1.01      2.6±0.00µs        ? ?/sec
interleave str_view(0.0) 1024 [0..100, 100..230, 450..1000]                                  1.00      2.6±0.00µs        ? ?/sec    1.01      2.6±0.01µs        ? ?/sec
interleave str_view(0.0) 400 [0..100, 100..230, 450..1000]                                   1.00   1218.2±1.49ns        ? ?/sec    1.02   1237.8±1.90ns        ? ?/sec
interleave struct(i32(0.0), i32(0.0) 100 [0..100, 100..230, 450..1000]                       1.00    635.7±7.28ns        ? ?/sec    1.02    646.1±7.41ns        ? ?/sec
interleave struct(i32(0.0), i32(0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]             1.00      2.2±0.01µs        ? ?/sec    1.00      2.2±0.01µs        ? ?/sec
interleave struct(i32(0.0), i32(0.0) 1024 [0..100, 100..230, 450..1000]                      1.00      2.1±0.01µs        ? ?/sec    1.00      2.1±0.01µs        ? ?/sec
interleave struct(i32(0.0), i32(0.0) 400 [0..100, 100..230, 450..1000]                       1.00   1141.1±7.61ns        ? ?/sec    1.01   1147.7±7.02ns        ? ?/sec
interleave struct(i32(0.0), str(20, 0.0) 100 [0..100, 100..230, 450..1000]                   1.05   1077.9±5.69ns        ? ?/sec    1.00   1028.3±4.77ns        ? ?/sec
interleave struct(i32(0.0), str(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]         1.00      5.9±0.01µs        ? ?/sec    1.00      5.9±0.01µs        ? ?/sec
interleave struct(i32(0.0), str(20, 0.0) 1024 [0..100, 100..230, 450..1000]                  1.01      5.8±0.02µs        ? ?/sec    1.00      5.8±0.02µs        ? ?/sec
interleave struct(i32(0.0), str(20, 0.0) 400 [0..100, 100..230, 450..1000]                   1.02      2.7±0.01µs        ? ?/sec    1.00      2.6±0.00µs        ? ?/sec
interleave struct(str(20, 0.0), str(20, 0.0)) 100 [0..100, 100..230, 450..1000]              1.03   1443.6±5.01ns        ? ?/sec    1.00   1403.2±3.32ns        ? ?/sec
interleave struct(str(20, 0.0), str(20, 0.0)) 1024 [0..100, 100..230, 450..1000, 0..1000]    1.03      9.9±0.02µs        ? ?/sec    1.00      9.6±0.02µs        ? ?/sec
interleave struct(str(20, 0.0), str(20, 0.0)) 1024 [0..100, 100..230, 450..1000]             1.01      9.6±0.02µs        ? ?/sec    1.00      9.5±0.03µs        ? ?/sec
interleave struct(str(20, 0.0), str(20, 0.0)) 400 [0..100, 100..230, 450..1000]              1.01      4.1±0.01µs        ? ?/sec    1.00      4.0±0.01µs        ? ?/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	610.1s
Peak memory	3.0 GiB
Avg memory	3.0 GiB
CPU user	604.3s
CPU sys	0.7s
Peak spill	0 B

branch

Metric	Value
Wall time	605.1s
Peak memory	3.0 GiB
Avg memory	3.0 GiB
CPU user	601.5s
CPU sys	0.1s
Peak spill	0 B

File an issue against this benchmark runner

Jefffrey · 2026-05-07T08:32:03Z

@asubiotto could you merge up from main so we can compare the benchmark?

The specialized interleave works by preserving run ends as much as possible by coalescing groups of adjacent logical indices pointing to the same source and calling interleave on the run end values. Future work could additionally coalesce values across sources, but this requires a value equality check. Signed-off-by: Alfonso Subiotto Marques <[email protected]>

asubiotto · 2026-05-07T08:34:28Z

Oops, done.

Jefffrey · 2026-05-07T08:42:41Z

run benchmark interleave_kernels

adriangbot · 2026-05-07T08:45:33Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4395554475-2046-jsqxr 6.12.68+ #1 SMP Wed Apr 1 02:23:28 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing asubiotto/specializedreeinterleave (b8165b1) to 97ff198 (merge-base) diff
BENCH_NAME=interleave_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench interleave_kernels
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-05-07T09:09:29Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                                                                                        asubiotto_specializedreeinterleave     main
-----                                                                                        ----------------------------------     ----
interleave dict(20, 0.0) 100 [0..100, 100..230, 450..1000]                                   1.00    634.1±4.86ns        ? ?/sec    1.00    634.4±3.18ns        ? ?/sec
interleave dict(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]                         1.00   1831.6±6.61ns        ? ?/sec    1.01   1849.4±7.83ns        ? ?/sec
interleave dict(20, 0.0) 1024 [0..100, 100..230, 450..1000]                                  1.00   1802.8±9.62ns        ? ?/sec    1.00   1802.9±9.23ns        ? ?/sec
interleave dict(20, 0.0) 400 [0..100, 100..230, 450..1000]                                   1.00   1011.9±4.47ns        ? ?/sec    1.01   1017.9±4.34ns        ? ?/sec
interleave dict_distinct 100                                                                 1.00      2.1±0.01µs        ? ?/sec    1.00      2.1±0.01µs        ? ?/sec
interleave dict_distinct 1024                                                                1.00      2.1±0.01µs        ? ?/sec    1.01      2.1±0.01µs        ? ?/sec
interleave dict_distinct 2048                                                                1.00      2.1±0.01µs        ? ?/sec    1.00      2.1±0.01µs        ? ?/sec
interleave dict_sparse(20, 0.0) 100 [0..100, 100..230, 450..1000]                            1.01   1523.6±5.47ns        ? ?/sec    1.00   1515.0±6.63ns        ? ?/sec
interleave dict_sparse(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]                  1.00      3.0±0.01µs        ? ?/sec    1.01      3.0±0.01µs        ? ?/sec
interleave dict_sparse(20, 0.0) 1024 [0..100, 100..230, 450..1000]                           1.00      2.7±0.01µs        ? ?/sec    1.00      2.7±0.01µs        ? ?/sec
interleave dict_sparse(20, 0.0) 400 [0..100, 100..230, 450..1000]                            1.01   1935.3±6.22ns        ? ?/sec    1.00   1922.3±4.08ns        ? ?/sec
interleave i32(0.0) 100 [0..100, 100..230, 450..1000]                                        1.00    212.4±2.54ns        ? ?/sec    1.01    213.7±2.70ns        ? ?/sec
interleave i32(0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]                              1.00    962.7±3.97ns        ? ?/sec    1.01    968.4±3.74ns        ? ?/sec
interleave i32(0.0) 1024 [0..100, 100..230, 450..1000]                                       1.00    939.2±2.28ns        ? ?/sec    1.07   1001.3±4.04ns        ? ?/sec
interleave i32(0.0) 400 [0..100, 100..230, 450..1000]                                        1.00    452.9±2.25ns        ? ?/sec    1.16    524.4±4.19ns        ? ?/sec
interleave i32(0.5) 100 [0..100, 100..230, 450..1000]                                        1.00    443.4±3.75ns        ? ?/sec    1.01    449.8±5.65ns        ? ?/sec
interleave i32(0.5) 1024 [0..100, 100..230, 450..1000, 0..1000]                              1.00      2.9±0.02µs        ? ?/sec    1.00      2.9±0.01µs        ? ?/sec
interleave i32(0.5) 1024 [0..100, 100..230, 450..1000]                                       1.00      3.0±0.02µs        ? ?/sec    1.02      3.0±0.02µs        ? ?/sec
interleave i32(0.5) 400 [0..100, 100..230, 450..1000]                                        1.00   1256.6±9.25ns        ? ?/sec    1.06   1329.4±7.54ns        ? ?/sec
interleave list<i64>(0.0,0.0,20) 100 [0..100, 100..230, 450..1000]                           1.00   1627.3±2.87ns        ? ?/sec    1.02   1656.7±5.27ns        ? ?/sec
interleave list<i64>(0.0,0.0,20) 1024 [0..100, 100..230, 450..1000, 0..1000]                 1.00     14.9±0.11µs        ? ?/sec    1.00     14.9±0.05µs        ? ?/sec
interleave list<i64>(0.0,0.0,20) 1024 [0..100, 100..230, 450..1000]                          1.00     14.8±0.04µs        ? ?/sec    1.01     14.9±0.03µs        ? ?/sec
interleave list<i64>(0.0,0.0,20) 400 [0..100, 100..230, 450..1000]                           1.00      6.1±0.01µs        ? ?/sec    1.00      6.1±0.02µs        ? ?/sec
interleave list<i64>(0.1,0.1,20) 100 [0..100, 100..230, 450..1000]                           1.00      3.9±0.01µs        ? ?/sec    1.01      3.9±0.01µs        ? ?/sec
interleave list<i64>(0.1,0.1,20) 1024 [0..100, 100..230, 450..1000, 0..1000]                 1.00     32.0±0.14µs        ? ?/sec    1.00     32.1±0.16µs        ? ?/sec
interleave list<i64>(0.1,0.1,20) 1024 [0..100, 100..230, 450..1000]                          1.00     32.4±0.14µs        ? ?/sec    1.00     32.3±0.13µs        ? ?/sec
interleave list<i64>(0.1,0.1,20) 400 [0..100, 100..230, 450..1000]                           1.01     13.1±0.06µs        ? ?/sec    1.00     13.0±0.04µs        ? ?/sec
interleave list_view<i64>(0.0,0.0,20) 100 [0..100, 100..230, 450..1000]                      1.00      3.5±0.01µs        ? ?/sec    1.01      3.6±0.01µs        ? ?/sec
interleave list_view<i64>(0.0,0.0,20) 1024 [0..100, 100..230, 450..1000, 0..1000]            1.00     17.4±0.04µs        ? ?/sec    1.00     17.4±0.03µs        ? ?/sec
interleave list_view<i64>(0.0,0.0,20) 1024 [0..100, 100..230, 450..1000]                     1.01     17.3±0.04µs        ? ?/sec    1.00     17.1±0.03µs        ? ?/sec
interleave list_view<i64>(0.0,0.0,20) 400 [0..100, 100..230, 450..1000]                      1.01      8.3±0.02µs        ? ?/sec    1.00      8.1±0.02µs        ? ?/sec
interleave list_view<i64>(0.1,0.1,20) 100 [0..100, 100..230, 450..1000]                      1.00      6.0±0.01µs        ? ?/sec    1.02      6.1±0.02µs        ? ?/sec
interleave list_view<i64>(0.1,0.1,20) 1024 [0..100, 100..230, 450..1000, 0..1000]            1.01     34.8±0.04µs        ? ?/sec    1.00     34.6±0.05µs        ? ?/sec
interleave list_view<i64>(0.1,0.1,20) 1024 [0..100, 100..230, 450..1000]                     1.00     34.6±0.06µs        ? ?/sec    1.00     34.6±0.05µs        ? ?/sec
interleave list_view<i64>(0.1,0.1,20) 400 [0..100, 100..230, 450..1000]                      1.01     15.5±0.05µs        ? ?/sec    1.00     15.4±0.05µs        ? ?/sec
interleave list_view_overlapping<i64>(80x,20) 100 [0..100, 100..230, 450..1000]              1.01      4.1±0.01µs        ? ?/sec    1.00      4.0±0.01µs        ? ?/sec
interleave list_view_overlapping<i64>(80x,20) 1024 [0..100, 100..230, 450..1000, 0..1000]    1.00     22.2±0.05µs        ? ?/sec    1.01     22.3±0.08µs        ? ?/sec
interleave list_view_overlapping<i64>(80x,20) 1024 [0..100, 100..230, 450..1000]             1.00     22.4±0.06µs        ? ?/sec    1.00     22.5±0.05µs        ? ?/sec
interleave list_view_overlapping<i64>(80x,20) 400 [0..100, 100..230, 450..1000]              1.00     10.3±0.01µs        ? ?/sec    1.00     10.3±0.02µs        ? ?/sec
interleave ree_i32<dict<u32,utf8>>(64 runs) 100 [0..100, 100..230, 450..1000]                1.00      4.3±0.01µs        ? ?/sec    2.39     10.3±0.04µs        ? ?/sec
interleave ree_i32<dict<u32,utf8>>(64 runs) 1024 [0..100, 100..230, 450..1000, 0..1000]      1.00     20.6±0.07µs        ? ?/sec    4.06     83.5±0.11µs        ? ?/sec
interleave ree_i32<dict<u32,utf8>>(64 runs) 1024 [0..100, 100..230, 450..1000]               1.00     20.1±0.07µs        ? ?/sec    3.87     77.8±0.10µs        ? ?/sec
interleave ree_i32<dict<u32,utf8>>(64 runs) 400 [0..100, 100..230, 450..1000]                1.00      9.5±0.02µs        ? ?/sec    3.37     32.1±0.04µs        ? ?/sec
interleave ree_i32<i64>(64 runs) 100 [0..100, 100..230, 450..1000]                           1.00      3.3±0.01µs        ? ?/sec    2.80      9.1±0.04µs        ? ?/sec
interleave ree_i32<i64>(64 runs) 1024 [0..100, 100..230, 450..1000, 0..1000]                 1.00     19.1±0.07µs        ? ?/sec    4.21     80.4±0.09µs        ? ?/sec
interleave ree_i32<i64>(64 runs) 1024 [0..100, 100..230, 450..1000]                          1.00     18.6±0.07µs        ? ?/sec    4.00     74.4±0.10µs        ? ?/sec
interleave ree_i32<i64>(64 runs) 400 [0..100, 100..230, 450..1000]                           1.00      8.3±0.03µs        ? ?/sec    3.62     30.1±0.05µs        ? ?/sec
interleave str(20, 0.0) 100 [0..100, 100..230, 450..1000]                                    1.00    601.3±1.44ns        ? ?/sec    1.00    600.2±1.40ns        ? ?/sec
interleave str(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]                          1.00      4.6±0.01µs        ? ?/sec    1.01      4.6±0.01µs        ? ?/sec
interleave str(20, 0.0) 1024 [0..100, 100..230, 450..1000]                                   1.00      4.6±0.01µs        ? ?/sec    1.01      4.6±0.01µs        ? ?/sec
interleave str(20, 0.0) 400 [0..100, 100..230, 450..1000]                                    1.00   1888.9±4.02ns        ? ?/sec    1.00   1895.1±7.63ns        ? ?/sec
interleave str(20, 0.5) 100 [0..100, 100..230, 450..1000]                                    1.00    747.1±0.78ns        ? ?/sec    1.00    749.2±1.17ns        ? ?/sec
interleave str(20, 0.5) 1024 [0..100, 100..230, 450..1000, 0..1000]                          1.00      6.0±0.02µs        ? ?/sec    1.00      5.9±0.02µs        ? ?/sec
interleave str(20, 0.5) 1024 [0..100, 100..230, 450..1000]                                   1.00      5.9±0.03µs        ? ?/sec    1.00      5.9±0.01µs        ? ?/sec
interleave str(20, 0.5) 400 [0..100, 100..230, 450..1000]                                    1.01      2.5±0.01µs        ? ?/sec    1.00      2.5±0.01µs        ? ?/sec
interleave str_view(0.0) 100 [0..100, 100..230, 450..1000]                                   1.03    575.9±9.14ns        ? ?/sec    1.00    559.3±0.87ns        ? ?/sec
interleave str_view(0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]                         1.01      2.6±0.01µs        ? ?/sec    1.00      2.6±0.00µs        ? ?/sec
interleave str_view(0.0) 1024 [0..100, 100..230, 450..1000]                                  1.01      2.6±0.01µs        ? ?/sec    1.00      2.6±0.00µs        ? ?/sec
interleave str_view(0.0) 400 [0..100, 100..230, 450..1000]                                   1.00   1226.0±8.73ns        ? ?/sec    1.01   1241.2±1.58ns        ? ?/sec
interleave struct(i32(0.0), i32(0.0) 100 [0..100, 100..230, 450..1000]                       1.00    647.8±7.91ns        ? ?/sec    1.01    655.1±7.88ns        ? ?/sec
interleave struct(i32(0.0), i32(0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]             1.00      2.1±0.01µs        ? ?/sec    1.02      2.2±0.01µs        ? ?/sec
interleave struct(i32(0.0), i32(0.0) 1024 [0..100, 100..230, 450..1000]                      1.00      2.1±0.01µs        ? ?/sec    1.00      2.1±0.01µs        ? ?/sec
interleave struct(i32(0.0), i32(0.0) 400 [0..100, 100..230, 450..1000]                       1.00   1152.4±7.32ns        ? ?/sec    1.00   1151.4±7.07ns        ? ?/sec
interleave struct(i32(0.0), str(20, 0.0) 100 [0..100, 100..230, 450..1000]                   1.00   1032.6±5.63ns        ? ?/sec    1.01   1042.3±5.78ns        ? ?/sec
interleave struct(i32(0.0), str(20, 0.0) 1024 [0..100, 100..230, 450..1000, 0..1000]         1.00      5.8±0.01µs        ? ?/sec    1.00      5.8±0.01µs        ? ?/sec
interleave struct(i32(0.0), str(20, 0.0) 1024 [0..100, 100..230, 450..1000]                  1.00      5.8±0.01µs        ? ?/sec    1.00      5.8±0.01µs        ? ?/sec
interleave struct(i32(0.0), str(20, 0.0) 400 [0..100, 100..230, 450..1000]                   1.00      2.6±0.00µs        ? ?/sec    1.01      2.6±0.01µs        ? ?/sec
interleave struct(str(20, 0.0), str(20, 0.0)) 100 [0..100, 100..230, 450..1000]              1.00   1399.8±4.73ns        ? ?/sec    1.01   1414.6±5.06ns        ? ?/sec
interleave struct(str(20, 0.0), str(20, 0.0)) 1024 [0..100, 100..230, 450..1000, 0..1000]    1.00      9.6±0.02µs        ? ?/sec    1.00      9.6±0.02µs        ? ?/sec
interleave struct(str(20, 0.0), str(20, 0.0)) 1024 [0..100, 100..230, 450..1000]             1.00      9.5±0.02µs        ? ?/sec    1.00      9.5±0.03µs        ? ?/sec
interleave struct(str(20, 0.0), str(20, 0.0)) 400 [0..100, 100..230, 450..1000]              1.00      4.0±0.01µs        ? ?/sec    1.00      4.1±0.01µs        ? ?/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	690.1s
Peak memory	3.0 GiB
Avg memory	3.0 GiB
CPU user	685.6s
CPU sys	0.7s
Peak spill	0 B

branch

Metric	Value
Wall time	685.2s
Peak memory	3.0 GiB
Avg memory	3.0 GiB
CPU user	681.9s
CPU sys	0.2s
Peak spill	0 B

File an issue against this benchmark runner

Jefffrey

CI failure unrelated:

#9938

alamb · 2026-05-07T18:12:11Z

Thanks @asubiotto and @Jefffrey

github-actions Bot added the arrow Changes to the arrow crate label Apr 30, 2026

Jefffrey mentioned this pull request May 5, 2026

Combine overlapping runs in REE (take kernel) #9865

Open

Jefffrey reviewed May 5, 2026

View reviewed changes

Rich-T-kid mentioned this pull request May 5, 2026

Revised pr 9856 #9919

Closed

asubiotto force-pushed the asubiotto/specializedreeinterleave branch from f103e47 to 6fd0803 Compare May 6, 2026 10:29

asubiotto force-pushed the asubiotto/specializedreeinterleave branch from 6fd0803 to b8165b1 Compare May 7, 2026 08:34

Jefffrey approved these changes May 7, 2026

View reviewed changes

alamb merged commit 3c71d92 into apache:main May 7, 2026
25 of 26 checks passed

asubiotto deleted the asubiotto/specializedreeinterleave branch May 8, 2026 10:30

Conversation

asubiotto commented Apr 30, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

asubiotto commented Apr 30, 2026

Uh oh!

Uh oh!

Jefffrey May 5, 2026

Choose a reason for hiding this comment

Uh oh!

asubiotto May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Rich-T-kid May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

asubiotto May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Jefffrey May 5, 2026

Choose a reason for hiding this comment

Uh oh!

asubiotto May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Rich-T-kid commented May 5, 2026

Uh oh!

Rich-T-kid commented May 5, 2026

Uh oh!

asubiotto commented May 6, 2026

Uh oh!

asubiotto commented May 6, 2026

Uh oh!

Rich-T-kid commented May 6, 2026

Uh oh!

Jefffrey commented May 7, 2026

Uh oh!

adriangbot commented May 7, 2026

Uh oh!

adriangbot commented May 7, 2026

Uh oh!

Jefffrey commented May 7, 2026

Uh oh!

asubiotto commented May 7, 2026

Uh oh!

Jefffrey commented May 7, 2026

Uh oh!

adriangbot commented May 7, 2026

Uh oh!

adriangbot commented May 7, 2026

Uh oh!

Jefffrey left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Rich-T-kid May 5, 2026 •

edited

Loading