March 17, 2025: This week(s) in DataFusion #15269

alamb · 2025-03-17T12:21:07Z

Introduction

A weekly-ish summary of interesting things happening in DataFusion. Note this is not a complete list (it is what I remember / can find). Please leave comments on this ticket about things that I may have missed or you think should get wider attention by the community.

Side note: I am depressed with the number of great PRs that are open, but waiting on someone to help push them along. I spent some time trying to summarize them / listing them below in hopes of getting others excited. I purposly listed them with the ones that need more help at the top (and ones I am helping at the bottom)

Ongoing Projects

There are several substantial projects in various states. It would be great to get some more community eyes on these PRs -- both to help review, as well as to help figure out which to prioritize

Google Summer Of Code (@oznur-synnada)

We are hosting a Google Summer of Code project which has brought many new people to the community

More info Apache DataFusion Google Summer of Code (GSoC) 2025 Application Guidelines #14577

`async` user defined functions (@goldmedal )

Imagine calling llm functions or network from functions

select ... from my_table where ask_gpt(city, 'this city is in asia');

-- fetch data from a remote URL
select wget(url) from (select distinct url from log);

More info Introduce Async User Defined Functions #14837

Better user defined function interface (@Blizzara @shehabgamin @jayzhan211 )

more infor: Deprecate and eventually remove ScalarUDF::invoke_batch #14652
chore: remove deprecated variants of UDF's invoke (invoke, invoke_no_args, invoke_batch) #15123 / fix: mark ScalarUDFImpl::invoke_batch as deprecated #15049
@Omega359 adding config information to the functions: feat: Add ConfigOptions to ScalarFunctionArgs #13527

🔥 Spark Functions (@andygrove , @shehabgamin )

A bunch of DataFusion users (Sail, @Omega359 , Comet, etc) want to have spark compatibile functions. We are working on getting the basics in place so we can collaborate / maintain such a library togeter.

More info [DISCUSSION] Add separate crate to cover spark builtin functions #5600
Initial PR: feat: Add datafusion-spark crate #15168

Hardening sorting larger-than-memory datasets (@2010YOUY01 @Kontinuation @zhuqi-lucas )

It seems like more and more people are (re) sorting large datasets (seems common for reorganization).

porting tests to use insta (@blaginin)

Imagine: update expected tests as easily as sqllogictests (just run cargo insta review) ❤
@blaginin setup the basic infrastructure, has filed a bunch of tickets, and rallied the community which is now hard at work cranking out the code

More info: [Epic] Add snapshot tests (migrate to insta for tests) #15178

Expression pushdown @adriangb

Some file formats / systems can efficiently push down expression evaluation to the table format (e.g. Vortex, or json). DataFusion doesn't know how to do this yet, but it will!

Metadata columns (@chenkovsky )

Imagine adding synthetic columns to your data source (like row number)

more info feat: metadata columns #14057

Better integration with distributed tracing (@geoffreyclaude)

When using DataFusion in a distributed environment passing through context down to the IO is important for performance analysis. @geoffreyclaude has a PR up to help thread this down

New IO interface (@Xuanwo)

@Xuanwo is thinking of an API for IO in datafusion that is not tied to object_store.

More info discuss: Introduce datafusion-storage as datafusion's own storage interface #14854

Better Error Messages (@eliaperantoni )

Imagine error messages that showed you where in the query the problem was 🤯

More info [EPIC] Attach Diagnostic to more errors #14429

Changing default mapping `VARCHAR` --> `Utf8View` (rather than `Utf8`)

Imagine CREATE TABLE foo(x varchar) will use Utf8View for x.

more info: Change mapping of SQL VARCHAR from Utf8 to Utf8View #15096 (comment)

Predicate pushdown by default (@XiangpengHao)

Long standing feature in parquet reader. This gets 10-20% performance improvement for some queries

Beautiful expalin plans (@irenjj)

Imagine: duckdb style explain plans:

> create table foo(x int) as values (1);
0 row(s) fetched.
Elapsed 0.013 seconds.

> explain format tree select x from foo where x > 5;
+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │    CoalesceBatchesExec    │ |
|               | │    --------------------   │ |
|               | │     target_batch_size:    │ |
|               | │            8192           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │         FilterExec        │ |
|               | │    --------------------   │ |
|               | │      predicate: x > 5     │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │         bytes: 112        │ |
|               | │       format: memory      │ |
|               | │          rows: 1          │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.004 seconds.

More info: [EPIC] Complete SQL EXPLAIN Tree Rendering #14914

TPCH data generator (@clflushopt )

Imagine (with the correct column names):

-- Generate TPCH directly from datafusion-cli
select * from tpch_table('lineitem', 1)
+----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+------------+------------+------------+-------------------+-----------+--------------------------------------------+-----------+
| column_1 | column_2 | column_3 | column_4 | column_5 | column_6 | column_7 | column_8 | column_9 | column_10 | column_11  | column_12  | column_13  | column_14         | column_15 | column_16                                  | column_17 |
+----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+------------+------------+------------+-------------------+-----------+--------------------------------------------+-----------+
| 5007686  | 192006   | 4526     | 4        | 22       | 24156.0  | 0.09     | 0.01     | A        | F         | 1994-02-19 | 1994-03-23 | 1994-03-04 | NONE              | MAIL      | quickly permanent excuses according to the | NULL      |
| 5007686  | 32944    | 7951     | 5        | 20       | 37538.8  | 0.04     | 0.02     | A        | F         | 1994-03-04 | 1994-03-27 | 1994-03-14 | NONE
...

Also, I am going to scratch an itch I have had for 10+ years and generate tpch data with ALL THE CORES really fast so I don't have to wait around anymore. FYI @lmwnshn

More info:

Looking to get more involved? Please help review code! 🎣

DataFusion has a long history of community members contributing in all aspects of the project. Reviewing PRs is an especially great way to get introduced to the project, help the community and grow your own knowledge -- researching and understanding the code enough to review PRs also often inspires additional ideas for improvements.

We have docs about reviews. TLDR is: look for test coverage, if the change is understandable and well documented, and if the code can be improved. When you think the PR looks good to merge, try @ mentioning one of the committers.

Help wanted

I would love to see the community offer additional help performance testing, triaging bugs helping to make DataFusion a more stable foundation for building systems

Please feel leave your own comments on this ticket if you are looking for help

Community

Weekly Call
Slack/Discord: info links

Upcoming meetups:

Help schedule some!

The text was updated successfully, but these errors were encountered:

alamb · 2025-03-17T14:37:46Z

Also, huge thanks to @xudong963 for running the release process

🙏

alamb · 2025-03-17T18:45:22Z

Oh, and of course @timsaucer is cranking out FFI bindings like

alamb · 2025-03-18T08:31:00Z

@milenkovicm has become a committer : https://lists.apache.org/thread/zzqdq8rfbwyqr9zloqt4y89ntml6pq45 🎉

alamb · 2025-03-20T16:50:16Z

New blog post by @XiangpengHao on parquet predicate evaluation:

https://datafusion.apache.org/blog/2025/03/20/parquet-pruning/

alamb · 2025-03-24T16:20:39Z

Another blog post by @XiangpengHao about how to build S3 select in 400 lines of Rust (and FDAP)
https://blog.xiangpeng.systems/posts/build-s3-select/

alamb mentioned this issue Mar 17, 2025

March 4, 2025: This week(s) in DataFusion #15005

Closed

alamb pinned this issue Mar 17, 2025

alamb mentioned this issue Mar 17, 2025

Weekly Plan (Andrew Lamb) March 17, 2025 #15274

Closed

10 tasks

Xuanwo mentioned this issue Mar 18, 2025

Generate weekly summary by using LLM apache/opendal#5802

Closed

alamb mentioned this issue Mar 24, 2025

Weekly Plan (Andrew Lamb) March 24, 2025 #15393

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

March 17, 2025: This week(s) in DataFusion #15269

March 17, 2025: This week(s) in DataFusion #15269

alamb commented Mar 17, 2025 •

edited

Loading

alamb commented Mar 17, 2025

alamb commented Mar 17, 2025 •

edited

Loading

alamb commented Mar 18, 2025

alamb commented Mar 20, 2025

alamb commented Mar 24, 2025

March 17, 2025: This week(s) in DataFusion #15269

March 17, 2025: This week(s) in DataFusion #15269

Comments

alamb commented Mar 17, 2025 • edited Loading

Introduction

Ongoing Projects

Google Summer Of Code (@oznur-synnada)

async user defined functions (@goldmedal )

Better user defined function interface (@Blizzara @shehabgamin @jayzhan211 )

🔥 Spark Functions (@andygrove , @shehabgamin )

Hardening sorting larger-than-memory datasets (@2010YOUY01 @Kontinuation @zhuqi-lucas )

porting tests to use insta (@blaginin)

Expression pushdown @adriangb

Metadata columns (@chenkovsky )

Better integration with distributed tracing (@geoffreyclaude)

New IO interface (@Xuanwo)

Better Error Messages (@eliaperantoni )

Changing default mapping VARCHAR --> Utf8View (rather than Utf8)

Predicate pushdown by default (@XiangpengHao)

Beautiful expalin plans (@irenjj)

TPCH data generator (@clflushopt )

Looking to get more involved? Please help review code! 🎣

Help wanted

Community

Upcoming meetups:

alamb commented Mar 17, 2025

alamb commented Mar 17, 2025 • edited Loading

alamb commented Mar 18, 2025

alamb commented Mar 20, 2025

alamb commented Mar 24, 2025

alamb commented Mar 17, 2025 •

edited

Loading

`async` user defined functions (@goldmedal )

Changing default mapping `VARCHAR` --> `Utf8View` (rather than `Utf8`)

alamb commented Mar 17, 2025 •

edited

Loading