Update to `arrow`, `parquet` to `57.1.0` #18820

alamb · 2025-11-19T14:50:44Z

Which issue does this PR close?

related to Release arrow-rs / parquet Minor version 57.1.0 (November 2025) arrow-rs#8464

Rationale for this change

Get latest and greatest code from arrow

What changes are included in this PR?

Update to Arrow 57.1.0
Update for API changes (comments inline)

Are these changes tested?

Yes, by CI

Are there any user-facing changes?

No

alamb · 2025-11-19T20:07:34Z

datafusion-cli/src/main.rs

-        | alltypes_plain.parquet            | 1851            | 6957                | 2    | page_index=false |
-        | alltypes_tiny_pages.parquet       | 454233          | 267014              | 2    | page_index=true  |
-        | lz4_raw_compressed_larger.parquet | 380836          | 996                 | 2    | page_index=false |
+        | alltypes_plain.parquet            | 1851            | 8882                | 2    | page_index=false |


the metadata didn't actually get bigger, we just actually included the encryption information (better reporting)

[Parquet] Account for FileDecryptor in ParquetMetaData heap size calculation arrow-rs#8671

Actually I looked into it more and I think the size growth is a bug. See

Parquet metadata size accounting overcounts Arc<T> arrow-rs#8897

Update: the size is correct. As @etseidl says "the truth hurts"

alamb · 2025-11-20T20:12:45Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/upgrade_arrow_57.1.0 (840487e) to 6d9ab45 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb · 2025-11-20T21:11:15Z

🤖: Benchmark completed

Details

Comparing HEAD and alamb_upgrade_arrow_57.1.0
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_upgrade_arrow_57.1 ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │  2669.04 ms │               2708.77 ms │    no change │
│ QQuery 1     │  1235.58 ms │               1311.33 ms │ 1.06x slower │
│ QQuery 2     │  2388.97 ms │               2475.47 ms │    no change │
│ QQuery 3     │  1206.27 ms │               1200.33 ms │    no change │
│ QQuery 4     │  2326.86 ms │               2244.02 ms │    no change │
│ QQuery 5     │ 28556.87 ms │              28558.86 ms │    no change │
│ QQuery 6     │  4095.23 ms │               3958.01 ms │    no change │
│ QQuery 7     │  3903.82 ms │               3868.19 ms │    no change │
└──────────────┴─────────────┴──────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                       │ 46382.65ms │
│ Total Time (alamb_upgrade_arrow_57.1)   │ 46324.97ms │
│ Average Time (HEAD)                     │  5797.83ms │
│ Average Time (alamb_upgrade_arrow_57.1) │  5790.62ms │
│ Queries Faster                          │          0 │
│ Queries Slower                          │          1 │
│ Queries with No Change                  │          7 │
│ Queries with Failure                    │          0 │
└─────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_upgrade_arrow_57.1 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.11 ms │                  2.51 ms │  1.19x slower │
│ QQuery 1     │    49.63 ms │                 49.65 ms │     no change │
│ QQuery 2     │   137.94 ms │                134.20 ms │     no change │
│ QQuery 3     │   163.23 ms │                154.87 ms │ +1.05x faster │
│ QQuery 4     │  1087.29 ms │               1111.76 ms │     no change │
│ QQuery 5     │  1490.99 ms │               1535.73 ms │     no change │
│ QQuery 6     │     2.22 ms │                  2.19 ms │     no change │
│ QQuery 7     │    54.53 ms │                 54.27 ms │     no change │
│ QQuery 8     │  1489.20 ms │               1499.13 ms │     no change │
│ QQuery 9     │  1864.60 ms │               1878.47 ms │     no change │
│ QQuery 10    │   375.25 ms │                386.99 ms │     no change │
│ QQuery 11    │   428.15 ms │                440.85 ms │     no change │
│ QQuery 12    │  1369.91 ms │               1379.43 ms │     no change │
│ QQuery 13    │  2122.32 ms │               2132.63 ms │     no change │
│ QQuery 14    │  1291.97 ms │               1313.81 ms │     no change │
│ QQuery 15    │  1261.86 ms │               1267.39 ms │     no change │
│ QQuery 16    │  2719.83 ms │               2737.71 ms │     no change │
│ QQuery 17    │  2710.92 ms │               2742.06 ms │     no change │
│ QQuery 18    │  5919.25 ms │               5077.25 ms │ +1.17x faster │
│ QQuery 19    │   126.87 ms │                120.96 ms │     no change │
│ QQuery 20    │  2104.85 ms │               1933.69 ms │ +1.09x faster │
│ QQuery 21    │  2406.78 ms │               2211.99 ms │ +1.09x faster │
│ QQuery 22    │  4076.54 ms │               3818.94 ms │ +1.07x faster │
│ QQuery 23    │ 12929.23 ms │              12607.37 ms │     no change │
│ QQuery 24    │   211.40 ms │                207.20 ms │     no change │
│ QQuery 25    │   484.28 ms │                477.07 ms │     no change │
│ QQuery 26    │   222.10 ms │                205.52 ms │ +1.08x faster │
│ QQuery 27    │  2839.44 ms │               2746.85 ms │     no change │
│ QQuery 28    │ 23650.44 ms │              23486.62 ms │     no change │
│ QQuery 29    │   970.57 ms │                986.64 ms │     no change │
│ QQuery 30    │  1358.29 ms │               1355.65 ms │     no change │
│ QQuery 31    │  1399.94 ms │               1375.33 ms │     no change │
│ QQuery 32    │  5469.96 ms │               4955.15 ms │ +1.10x faster │
│ QQuery 33    │  6330.13 ms │               5881.95 ms │ +1.08x faster │
│ QQuery 34    │  6616.67 ms │               6409.89 ms │     no change │
│ QQuery 35    │  2074.80 ms │               2052.74 ms │     no change │
│ QQuery 36    │   119.25 ms │                116.62 ms │     no change │
│ QQuery 37    │    51.77 ms │                 51.71 ms │     no change │
│ QQuery 38    │   119.58 ms │                115.50 ms │     no change │
│ QQuery 39    │   199.66 ms │                189.24 ms │ +1.06x faster │
│ QQuery 40    │    44.56 ms │                 42.01 ms │ +1.06x faster │
│ QQuery 41    │    38.38 ms │                 38.30 ms │     no change │
│ QQuery 42    │    31.94 ms │                 31.93 ms │     no change │
└──────────────┴─────────────┴──────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                       │ 98418.61ms │
│ Total Time (alamb_upgrade_arrow_57.1)   │ 95319.78ms │
│ Average Time (HEAD)                     │  2288.80ms │
│ Average Time (alamb_upgrade_arrow_57.1) │  2216.74ms │
│ Queries Faster                          │         10 │
│ Queries Slower                          │          1 │
│ Queries with No Change                  │         32 │
│ Queries with Failure                    │          0 │
└─────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ alamb_upgrade_arrow_57.1 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 139.21 ms │                130.93 ms │ +1.06x faster │
│ QQuery 2     │  29.26 ms │                 29.13 ms │     no change │
│ QQuery 3     │  38.71 ms │                 34.79 ms │ +1.11x faster │
│ QQuery 4     │  29.41 ms │                 28.65 ms │     no change │
│ QQuery 5     │  87.42 ms │                 88.35 ms │     no change │
│ QQuery 6     │  19.55 ms │                 20.18 ms │     no change │
│ QQuery 7     │ 228.31 ms │                227.00 ms │     no change │
│ QQuery 8     │  34.20 ms │                 32.54 ms │     no change │
│ QQuery 9     │  97.79 ms │                110.99 ms │  1.14x slower │
│ QQuery 10    │  64.27 ms │                 63.16 ms │     no change │
│ QQuery 11    │  17.23 ms │                 17.26 ms │     no change │
│ QQuery 12    │  52.89 ms │                 51.90 ms │     no change │
│ QQuery 13    │  46.75 ms │                 46.20 ms │     no change │
│ QQuery 14    │  14.19 ms │                 13.74 ms │     no change │
│ QQuery 15    │  25.06 ms │                 24.65 ms │     no change │
│ QQuery 16    │  25.08 ms │                 25.22 ms │     no change │
│ QQuery 17    │ 147.82 ms │                153.69 ms │     no change │
│ QQuery 18    │ 307.83 ms │                284.87 ms │ +1.08x faster │
│ QQuery 19    │  37.51 ms │                 38.95 ms │     no change │
│ QQuery 20    │  49.58 ms │                 49.73 ms │     no change │
│ QQuery 21    │ 334.74 ms │                321.67 ms │     no change │
│ QQuery 22    │  20.67 ms │                 20.62 ms │     no change │
└──────────────┴───────────┴──────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                       │ 1847.47ms │
│ Total Time (alamb_upgrade_arrow_57.1)   │ 1814.23ms │
│ Average Time (HEAD)                     │   83.98ms │
│ Average Time (alamb_upgrade_arrow_57.1) │   82.46ms │
│ Queries Faster                          │         3 │
│ Queries Slower                          │         1 │
│ Queries with No Change                  │        18 │
│ Queries with Failure                    │         0 │
└─────────────────────────────────────────┴───────────┘

alamb · 2025-11-20T21:11:20Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/upgrade_arrow_57.1.0 (fafb102) to 7d8b860 diff using: clickbench_pushdown
Results will be posted here when complete

alamb · 2025-11-20T20:51:57Z

datafusion/sqllogictest/test_files/array.slt

 query TTT
 select arrow_typeof(column1), arrow_typeof(column2), arrow_typeof(column3) from arrays;
 ----
-List(nullable List(nullable Int64)) List(nullable Float64) List(nullable Utf8)


Previously the DataType parsing code did not handle this syntax (it only supported List(Float64)). We have now made the display and parsing consistent, see apache/arrow-rs#8649 (comment) for background and details

alamb · 2025-11-20T21:24:14Z

datafusion-cli/src/main.rs

-        | alltypes_plain.parquet            | 1851            | 6957                | 2    | page_index=false |
-        | alltypes_tiny_pages.parquet       | 454233          | 267014              | 2    | page_index=true  |
-        | lz4_raw_compressed_larger.parquet | 380836          | 996                 | 2    | page_index=false |
+        | alltypes_plain.parquet            | 1851            | 8882                | 2    | page_index=false |


Actually I looked into it more and I think the size growth is a bug. See

Parquet metadata size accounting overcounts Arc<T> arrow-rs#8897

alamb · 2025-11-20T21:42:32Z

🤖: Benchmark completed

Details

Comparing HEAD and alamb_upgrade_arrow_57.1.0
--------------------
Benchmark clickbench_pushdown.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_upgrade_arrow_57.1 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.19 ms │                  2.67 ms │  1.22x slower │
│ QQuery 1     │    53.32 ms │                 51.49 ms │     no change │
│ QQuery 2     │   141.71 ms │                134.25 ms │ +1.06x faster │
│ QQuery 3     │   167.28 ms │                156.44 ms │ +1.07x faster │
│ QQuery 4     │  1084.35 ms │               1106.65 ms │     no change │
│ QQuery 5     │  1556.95 ms │               1490.85 ms │     no change │
│ QQuery 6     │     2.16 ms │                  2.32 ms │  1.07x slower │
│ QQuery 7     │    74.61 ms │                 66.99 ms │ +1.11x faster │
│ QQuery 8     │  1486.06 ms │               1416.15 ms │     no change │
│ QQuery 9     │  1893.82 ms │               1877.58 ms │     no change │
│ QQuery 10    │   480.16 ms │                496.91 ms │     no change │
│ QQuery 11    │   557.20 ms │                549.42 ms │     no change │
│ QQuery 12    │  1608.14 ms │               1537.86 ms │     no change │
│ QQuery 13    │  2579.34 ms │               2322.88 ms │ +1.11x faster │
│ QQuery 14    │  1693.59 ms │               1457.10 ms │ +1.16x faster │
│ QQuery 15    │  1298.92 ms │               1255.58 ms │     no change │
│ QQuery 16    │  2729.63 ms │               2662.03 ms │     no change │
│ QQuery 17    │  2739.17 ms │               2653.09 ms │     no change │
│ QQuery 18    │  5301.07 ms │               4998.57 ms │ +1.06x faster │
│ QQuery 19    │   149.56 ms │                139.48 ms │ +1.07x faster │
│ QQuery 20    │  2047.96 ms │               1894.37 ms │ +1.08x faster │
│ QQuery 21    │  2453.34 ms │               2307.63 ms │ +1.06x faster │
│ QQuery 22    │  4124.78 ms │               3992.67 ms │     no change │
│ QQuery 23    │  1144.82 ms │               1083.87 ms │ +1.06x faster │
│ QQuery 24    │   258.15 ms │                248.11 ms │     no change │
│ QQuery 25    │   676.36 ms │                648.55 ms │     no change │
│ QQuery 26    │   358.14 ms │                343.48 ms │     no change │
│ QQuery 27    │  3125.24 ms │               3006.62 ms │     no change │
│ QQuery 28    │ 23975.91 ms │              23762.45 ms │     no change │
│ QQuery 29    │   961.14 ms │                989.91 ms │     no change │
│ QQuery 30    │  2163.07 ms │               1380.73 ms │ +1.57x faster │
│ QQuery 31    │  2089.33 ms │               1351.52 ms │ +1.55x faster │
│ QQuery 32    │  4853.46 ms │               4935.63 ms │     no change │
│ QQuery 33    │  6051.64 ms │               5677.56 ms │ +1.07x faster │
│ QQuery 34    │  6305.89 ms │               5969.09 ms │ +1.06x faster │
│ QQuery 35    │  1936.33 ms │               1863.02 ms │     no change │
│ QQuery 36    │    26.21 ms │                 26.35 ms │     no change │
│ QQuery 37    │    26.08 ms │                 25.69 ms │     no change │
│ QQuery 38    │    25.20 ms │                 25.25 ms │     no change │
│ QQuery 39    │    25.50 ms │                 26.15 ms │     no change │
│ QQuery 40    │    26.74 ms │                 27.07 ms │     no change │
│ QQuery 41    │    25.65 ms │                 26.27 ms │     no change │
│ QQuery 42    │    25.35 ms │                 25.87 ms │     no change │
└──────────────┴─────────────┴──────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                       │ 88305.52ms │
│ Total Time (alamb_upgrade_arrow_57.1)   │ 84016.17ms │
│ Average Time (HEAD)                     │  2053.62ms │
│ Average Time (alamb_upgrade_arrow_57.1) │  1953.86ms │
│ Queries Faster                          │         14 │
│ Queries Slower                          │          2 │
│ Queries with No Change                  │         27 │
│ Queries with Failure                    │          0 │
└─────────────────────────────────────────┴────────────┘

alamb · 2025-11-21T14:39:47Z

datafusion-cli/src/main.rs

-        | alltypes_plain.parquet            | 1851            | 6957                | 2    | page_index=false |
-        | alltypes_tiny_pages.parquet       | 454233          | 267014              | 2    | page_index=true  |
-        | lz4_raw_compressed_larger.parquet | 380836          | 996                 | 2    | page_index=false |
+        | alltypes_plain.parquet            | 1851            | 8882                | 2    | page_index=false |


Update: the size is correct. As @etseidl says "the truth hurts"

alamb · 2025-11-21T14:42:18Z

datafusion/common/src/config.rs

        /// the filters are applied in the same order as written in the query
        pub reorder_filters: bool, default = false

+        /// (reading) Force the use of RowSelections for filter results, when


This is an escape valve if we find some issue when using the new adaptive filter from @hhhizzz

[Parquet] Adaptive Parquet Predicate Pushdown arrow-rs#8733

alamb · 2025-11-21T14:43:04Z

datafusion/core/src/physical_planner.rs

        assert_contains!(
            &e,
-            r#"Error during planning: Can not find compatible types to compare Boolean with [Struct("foo": Boolean), Utf8]"#
+            r#"Error during planning: Can not find compatible types to compare Boolean with [Struct("foo": non-null Boolean), Utf8]"#


these are due the changes from apache/arrow-rs#8648 to clean up datatype display. It is a nice improvement in my mind

alamb · 2025-11-21T14:43:30Z

datafusion/core/tests/parquet/filter_pushdown.rs

    // The cache is on by default, and used when filter pushdown is enabled
+    PredicateCacheTest {
+        expected_inner_records: 8,
+        expected_records: 7, // reads more than necessary from the cache as then another bitmap is applied


this behavior changed due to adaptive filtering. I added a new test that turns off adaptive filtering to show doing so restores the old behavior

Dandandan · 2025-11-22T06:34:36Z

🤖: Benchmark completed

Details

Comparing HEAD and alamb_upgrade_arrow_57.1.0
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_upgrade_arrow_57.1 ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │  2669.04 ms │               2708.77 ms │    no change │
│ QQuery 1     │  1235.58 ms │               1311.33 ms │ 1.06x slower │
│ QQuery 2     │  2388.97 ms │               2475.47 ms │    no change │
│ QQuery 3     │  1206.27 ms │               1200.33 ms │    no change │
│ QQuery 4     │  2326.86 ms │               2244.02 ms │    no change │
│ QQuery 5     │ 28556.87 ms │              28558.86 ms │    no change │
│ QQuery 6     │  4095.23 ms │               3958.01 ms │    no change │
│ QQuery 7     │  3903.82 ms │               3868.19 ms │    no change │
└──────────────┴─────────────┴──────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                       │ 46382.65ms │
│ Total Time (alamb_upgrade_arrow_57.1)   │ 46324.97ms │
│ Average Time (HEAD)                     │  5797.83ms │
│ Average Time (alamb_upgrade_arrow_57.1) │  5790.62ms │
│ Queries Faster                          │          0 │
│ Queries Slower                          │          1 │
│ Queries with No Change                  │          7 │
│ Queries with Failure                    │          0 │
└─────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ alamb_upgrade_arrow_57.1 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.11 ms │                  2.51 ms │  1.19x slower │
│ QQuery 1     │    49.63 ms │                 49.65 ms │     no change │
│ QQuery 2     │   137.94 ms │                134.20 ms │     no change │
│ QQuery 3     │   163.23 ms │                154.87 ms │ +1.05x faster │
│ QQuery 4     │  1087.29 ms │               1111.76 ms │     no change │
│ QQuery 5     │  1490.99 ms │               1535.73 ms │     no change │
│ QQuery 6     │     2.22 ms │                  2.19 ms │     no change │
│ QQuery 7     │    54.53 ms │                 54.27 ms │     no change │
│ QQuery 8     │  1489.20 ms │               1499.13 ms │     no change │
│ QQuery 9     │  1864.60 ms │               1878.47 ms │     no change │
│ QQuery 10    │   375.25 ms │                386.99 ms │     no change │
│ QQuery 11    │   428.15 ms │                440.85 ms │     no change │
│ QQuery 12    │  1369.91 ms │               1379.43 ms │     no change │
│ QQuery 13    │  2122.32 ms │               2132.63 ms │     no change │
│ QQuery 14    │  1291.97 ms │               1313.81 ms │     no change │
│ QQuery 15    │  1261.86 ms │               1267.39 ms │     no change │
│ QQuery 16    │  2719.83 ms │               2737.71 ms │     no change │
│ QQuery 17    │  2710.92 ms │               2742.06 ms │     no change │
│ QQuery 18    │  5919.25 ms │               5077.25 ms │ +1.17x faster │
│ QQuery 19    │   126.87 ms │                120.96 ms │     no change │
│ QQuery 20    │  2104.85 ms │               1933.69 ms │ +1.09x faster │
│ QQuery 21    │  2406.78 ms │               2211.99 ms │ +1.09x faster │
│ QQuery 22    │  4076.54 ms │               3818.94 ms │ +1.07x faster │
│ QQuery 23    │ 12929.23 ms │              12607.37 ms │     no change │
│ QQuery 24    │   211.40 ms │                207.20 ms │     no change │
│ QQuery 25    │   484.28 ms │                477.07 ms │     no change │
│ QQuery 26    │   222.10 ms │                205.52 ms │ +1.08x faster │
│ QQuery 27    │  2839.44 ms │               2746.85 ms │     no change │
│ QQuery 28    │ 23650.44 ms │              23486.62 ms │     no change │
│ QQuery 29    │   970.57 ms │                986.64 ms │     no change │
│ QQuery 30    │  1358.29 ms │               1355.65 ms │     no change │
│ QQuery 31    │  1399.94 ms │               1375.33 ms │     no change │
│ QQuery 32    │  5469.96 ms │               4955.15 ms │ +1.10x faster │
│ QQuery 33    │  6330.13 ms │               5881.95 ms │ +1.08x faster │
│ QQuery 34    │  6616.67 ms │               6409.89 ms │     no change │
│ QQuery 35    │  2074.80 ms │               2052.74 ms │     no change │
│ QQuery 36    │   119.25 ms │                116.62 ms │     no change │
│ QQuery 37    │    51.77 ms │                 51.71 ms │     no change │
│ QQuery 38    │   119.58 ms │                115.50 ms │     no change │
│ QQuery 39    │   199.66 ms │                189.24 ms │ +1.06x faster │
│ QQuery 40    │    44.56 ms │                 42.01 ms │ +1.06x faster │
│ QQuery 41    │    38.38 ms │                 38.30 ms │     no change │
│ QQuery 42    │    31.94 ms │                 31.93 ms │     no change │
└──────────────┴─────────────┴──────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                       │ 98418.61ms │
│ Total Time (alamb_upgrade_arrow_57.1)   │ 95319.78ms │
│ Average Time (HEAD)                     │  2288.80ms │
│ Average Time (alamb_upgrade_arrow_57.1) │  2216.74ms │
│ Queries Faster                          │         10 │
│ Queries Slower                          │          1 │
│ Queries with No Change                  │         32 │
│ Queries with Failure                    │          0 │
└─────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ alamb_upgrade_arrow_57.1 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 139.21 ms │                130.93 ms │ +1.06x faster │
│ QQuery 2     │  29.26 ms │                 29.13 ms │     no change │
│ QQuery 3     │  38.71 ms │                 34.79 ms │ +1.11x faster │
│ QQuery 4     │  29.41 ms │                 28.65 ms │     no change │
│ QQuery 5     │  87.42 ms │                 88.35 ms │     no change │
│ QQuery 6     │  19.55 ms │                 20.18 ms │     no change │
│ QQuery 7     │ 228.31 ms │                227.00 ms │     no change │
│ QQuery 8     │  34.20 ms │                 32.54 ms │     no change │
│ QQuery 9     │  97.79 ms │                110.99 ms │  1.14x slower │
│ QQuery 10    │  64.27 ms │                 63.16 ms │     no change │
│ QQuery 11    │  17.23 ms │                 17.26 ms │     no change │
│ QQuery 12    │  52.89 ms │                 51.90 ms │     no change │
│ QQuery 13    │  46.75 ms │                 46.20 ms │     no change │
│ QQuery 14    │  14.19 ms │                 13.74 ms │     no change │
│ QQuery 15    │  25.06 ms │                 24.65 ms │     no change │
│ QQuery 16    │  25.08 ms │                 25.22 ms │     no change │
│ QQuery 17    │ 147.82 ms │                153.69 ms │     no change │
│ QQuery 18    │ 307.83 ms │                284.87 ms │ +1.08x faster │
│ QQuery 19    │  37.51 ms │                 38.95 ms │     no change │
│ QQuery 20    │  49.58 ms │                 49.73 ms │     no change │
│ QQuery 21    │ 334.74 ms │                321.67 ms │     no change │
│ QQuery 22    │  20.67 ms │                 20.62 ms │     no change │
└──────────────┴───────────┴──────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                       ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                       │ 1847.47ms │
│ Total Time (alamb_upgrade_arrow_57.1)   │ 1814.23ms │
│ Average Time (HEAD)                     │   83.98ms │
│ Average Time (alamb_upgrade_arrow_57.1) │   82.46ms │
│ Queries Faster                          │         3 │
│ Queries Slower                          │         1 │
│ Queries with No Change                  │        18 │
│ Queries with Failure                    │         0 │
└─────────────────────────────────────────┴───────────┘

It seems to be quite a bit faster even without filter pushdown 🚀

alamb · 2025-11-22T10:52:46Z

It seems to be quite a bit faster even without filter pushdown 🚀

It is like someone has been optimizing low level filter kernels 😆 (but seriously I think major credit is due to you and @rluvaton )

rluvaton · 2025-11-22T15:48:05Z

Thank you, I have some more in my sleeve.

…7.1.0

rluvaton · 2025-11-26T09:12:15Z

datafusion/sqllogictest/test_files/array.slt

 [1, 2, 3, 4, 5] [h, e, l, l, o]

-# TODO: Enable once arrow_cast supports ListView types.
+# TODO: Enable once array_slice supports LargeListView types.


It seems like you copy pasted from below, but this test ListView and not LargeListView

Suggested change

# TODO: Enable once array_slice supports LargeListView types.

# TODO: Enable once array_slice supports ListView types.

rluvaton · 2025-11-26T09:14:08Z

datafusion/sqllogictest/test_files/array.slt

 # ----
 # [1, 2, 3, 4, 5] [h, e, l, l, o]
-query error DataFusion error: Execution error: Unsupported type 'ListView\(Int64\)'. Must be a supported arrow type name such as 'Int32' or 'Timestamp\(ns\)'. Error unknown token: ListView
+query error Failed to coerce arguments to satisfy a call to 'array_slice' function:


Might not be related to this PR but now I don't know what arguments are the problematic ones with this change

before I could see it's ListView but now I have no idea what arguments is invalid and what their type that is not supported

rluvaton · 2025-11-26T09:24:42Z

datafusion/datasource-parquet/src/opener.rs

+    /// Should we force the reader to use RowSelections for filtering
+    pub force_filter_selections: bool,


What do you think of making it an enum instead to allow for future additions without breaking changes?

(that enum should also be non exhaustive to avoid adding a variant a breaking change)

I also see that the with_row_selection_policy already accept enum.

making it an enum also allow to force mask or configure the threshold in the auto policy. this is also useful for testing to force specific path when creating a reproduction test for a bug

This is a good idea -- I was secretly hoping no one would use this flag and added it as an "escape" valve to go back to the arrow 57.0.0 reader behavior

rluvaton

left some comments, other than that LGTM

what do you think of splitting to 2 PRs
one for the actual upgrade and one for the new flag in parquet reader

not because the PR are large but because they are not required to be in the same PR for the upgrade to be made.

If you decide not to, please update the title and the description so the commit message will include that change making it easier to find later

alamb · 2025-11-26T14:29:24Z

what do you think of splitting to 2 PRs one for the actual upgrade and one for the new flag in parquet reader

not because the PR are large but because they are not required to be in the same PR for the upgrade to be made.

I will do this.

In my mind the config flag is required in order to allow people to opt out of the new behavior

rluvaton · 2025-11-26T15:04:23Z

In my mind the config flag is required in order to allow people to opt out of the new behavior

so if the behavior changed, we want by default to opt out of it for this release, no? or only for this pr,

rluvaton · 2025-11-26T15:05:21Z

if you split the PR it would also be easier for others to create PR with the new arrow version while we discuss this

alamb · 2025-11-26T18:56:18Z

In my mind the config flag is required in order to allow people to opt out of the new behavior

so if the behavior changed, we want by default to opt out of it for this release, no? or only for this pr,

I am not sure

The default behavior of the parquet reader has changed in arrow-rs (in theory it will always be better).

The only usecase I have is adding an "escape valve" so that if someone hit an issue with the new code, there was a way to turn if off without requiring a fork

I don't (yet) have any reason to believe the new behavior isn't always better nor any usecase for tuning the row selector policy from DataFusion

rluvaton

LGTM, I'm eager to create PR that using the new version so approving, if you decide to merge without the enum change

alamb marked this pull request as draft November 19, 2025 14:50

alamb commented Nov 19, 2025

View reviewed changes

tobixdev mentioned this pull request Nov 20, 2025

[DRAFT] Extension Type Registry Draft #18552

Draft

alamb mentioned this pull request Nov 20, 2025

Release DataFusion 51.1.0 (Minor) #18843

Open

13 tasks

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Nov 20, 2025

alamb commented Nov 20, 2025

View reviewed changes

This was referenced Nov 20, 2025

Parquet metadata size accounting overcounts Arc<T> apache/arrow-rs#8897

Closed

Fix Parquet metadata heap size accounting apache/arrow-rs#8898

Closed

alamb force-pushed the alamb/upgrade_arrow_57.1.0 branch from fafb102 to 191db07 Compare November 21, 2025 13:43

github-actions bot added optimizer Optimizer rules core Core DataFusion crate common Related to common crate proto Related to proto crate datasource Changes to the datasource crate labels Nov 21, 2025

tobixdev mentioned this pull request Nov 21, 2025

ScalarValue::to_array_of_size Does Not Correctly Allocate a Values Buffer for ScalarValue::FixedSizeBinary #18870

Open

alamb force-pushed the alamb/upgrade_arrow_57.1.0 branch from 191db07 to 5a91551 Compare November 21, 2025 14:33

github-actions bot added the documentation Improvements or additions to documentation label Nov 21, 2025

alamb force-pushed the alamb/upgrade_arrow_57.1.0 branch from 5a91551 to 9eab2c3 Compare November 21, 2025 14:36

alamb mentioned this pull request Nov 21, 2025

Potential Improved multiple column aggregation performance by using bitmasks rather than Vec<bool> #18676

Open

alamb commented Nov 21, 2025

View reviewed changes

This was referenced Nov 21, 2025

TEST: enable pushdown_filters and reorder_filters by default #18873

Draft

TEST: enable pushdown_filters by default #18874

Draft

alamb force-pushed the alamb/upgrade_arrow_57.1.0 branch from 9eab2c3 to a81c0d3 Compare November 24, 2025 22:50

Update to arrow/parquet 57.1.0

eda1b53

alamb force-pushed the alamb/upgrade_arrow_57.1.0 branch from a81c0d3 to eda1b53 Compare November 24, 2025 22:52

alamb marked this pull request as ready for review November 24, 2025 22:52

alamb changed the title ~~[WIP] Update to arrow, parquet 57.1.0~~ Update to arrow, parquet 57.1.0 Nov 24, 2025

alamb changed the title ~~Update to arrow, parquet 57.1.0~~ Update to arrow, parquet to 57.1.0 Nov 24, 2025

alamb added 2 commits November 25, 2025 15:03

Merge remote-tracking branch 'apache/main' into alamb/upgrade_arrow_5…

b04e50e

…7.1.0

Update versions

7c6d89b

rluvaton reviewed Nov 26, 2025

View reviewed changes

rluvaton approved these changes Nov 27, 2025

View reviewed changes

	# TODO: Enable once array_slice supports LargeListView types.
	# TODO: Enable once array_slice supports ListView types.

		/// Should we force the reader to use RowSelections for filtering
		pub force_filter_selections: bool,

Update to arrow, parquet to 57.1.0 #18820

Are you sure you want to change the base?

Update to arrow, parquet to 57.1.0 #18820

Uh oh!

Conversation

alamb commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Nov 20, 2025

Uh oh!

alamb commented Nov 20, 2025

Uh oh!

alamb commented Nov 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Nov 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Nov 22, 2025

Uh oh!

alamb commented Nov 22, 2025

Uh oh!

rluvaton commented Nov 22, 2025

Uh oh!

rluvaton Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rluvaton Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rluvaton left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rluvaton commented Nov 26, 2025

Uh oh!

rluvaton commented Nov 26, 2025

Uh oh!

alamb commented Nov 26, 2025

Uh oh!

rluvaton left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

Update to `arrow`, `parquet` to `57.1.0` #18820

Update to `arrow`, `parquet` to `57.1.0` #18820

alamb commented Nov 19, 2025 •

edited

Loading

rluvaton Nov 26, 2025 •

edited

Loading

rluvaton Nov 26, 2025 •

edited

Loading

rluvaton left a comment •

edited

Loading

alamb commented Nov 26, 2025 •

edited

Loading