Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

March 17, 2025: This week(s) in DataFusion #15269

Open
alamb opened this issue Mar 17, 2025 · 5 comments
Open

March 17, 2025: This week(s) in DataFusion #15269

alamb opened this issue Mar 17, 2025 · 5 comments

Comments

@alamb
Copy link
Contributor

alamb commented Mar 17, 2025

Introduction

A weekly-ish summary of interesting things happening in DataFusion. Note this is not a complete list (it is what I remember / can find). Please leave comments on this ticket about things that I may have missed or you think should get wider attention by the community.

Side note: I am depressed with the number of great PRs that are open, but waiting on someone to help push them along. I spent some time trying to summarize them / listing them below in hopes of getting others excited. I purposly listed them with the ones that need more help at the top (and ones I am helping at the bottom)

Ongoing Projects

There are several substantial projects in various states. It would be great to get some more community eyes on these PRs -- both to help review, as well as to help figure out which to prioritize

Google Summer Of Code (@oznur-synnada)

We are hosting a Google Summer of Code project which has brought many new people to the community

async user defined functions (@goldmedal )

Imagine calling llm functions or network from functions

select ... from my_table where ask_gpt(city, 'this city is in asia');
-- fetch data from a remote URL
select wget(url) from (select distinct url from log);

Better user defined function interface (@Blizzara @shehabgamin @jayzhan211 )

🔥 Spark Functions (@andygrove , @shehabgamin )

A bunch of DataFusion users (Sail, @Omega359 , Comet, etc) want to have spark compatibile functions. We are working on getting the basics in place so we can collaborate / maintain such a library togeter.

Hardening sorting larger-than-memory datasets (@2010YOUY01 @Kontinuation @zhuqi-lucas )

It seems like more and more people are (re) sorting large datasets (seems common for reorganization).

porting tests to use insta (@blaginin)

Imagine: update expected tests as easily as sqllogictests (just run cargo insta review) ❤
@blaginin setup the basic infrastructure, has filed a bunch of tickets, and rallied the community which is now hard at work cranking out the code

Expression pushdown @adriangb

Some file formats / systems can efficiently push down expression evaluation to the table format (e.g. Vortex, or json). DataFusion doesn't know how to do this yet, but it will!

Metadata columns (@chenkovsky )

Imagine adding synthetic columns to your data source (like row number)

Better integration with distributed tracing (@geoffreyclaude)

When using DataFusion in a distributed environment passing through context down to the IO is important for performance analysis. @geoffreyclaude has a PR up to help thread this down

New IO interface (@Xuanwo)

@Xuanwo is thinking of an API for IO in datafusion that is not tied to object_store.

Better Error Messages (@eliaperantoni )

Imagine error messages that showed you where in the query the problem was 🤯

Changing default mapping VARCHAR --> Utf8View (rather than Utf8)

Imagine CREATE TABLE foo(x varchar) will use Utf8View for x.

Predicate pushdown by default (@XiangpengHao)

Long standing feature in parquet reader. This gets 10-20% performance improvement for some queries

Beautiful expalin plans (@irenjj)

Imagine: duckdb style explain plans:

> create table foo(x int) as values (1);
0 row(s) fetched.
Elapsed 0.013 seconds.

> explain format tree select x from foo where x > 5;
+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │    CoalesceBatchesExec    │ |
|               | │    --------------------   │ |
|               | │     target_batch_size:    │ |
|               | │            8192           │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │         FilterExec        │ |
|               | │    --------------------   │ |
|               | │      predicate: x > 5     │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │         bytes: 112        │ |
|               | │       format: memory      │ |
|               | │          rows: 1          │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.004 seconds.

TPCH data generator (@clflushopt )

Imagine (with the correct column names):

-- Generate TPCH directly from datafusion-cli
select * from tpch_table('lineitem', 1)
+----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+------------+------------+------------+-------------------+-----------+--------------------------------------------+-----------+
| column_1 | column_2 | column_3 | column_4 | column_5 | column_6 | column_7 | column_8 | column_9 | column_10 | column_11  | column_12  | column_13  | column_14         | column_15 | column_16                                  | column_17 |
+----------+----------+----------+----------+----------+----------+----------+----------+----------+-----------+------------+------------+------------+-------------------+-----------+--------------------------------------------+-----------+
| 5007686  | 192006   | 4526     | 4        | 22       | 24156.0  | 0.09     | 0.01     | A        | F         | 1994-02-19 | 1994-03-23 | 1994-03-04 | NONE              | MAIL      | quickly permanent excuses according to the | NULL      |
| 5007686  | 32944    | 7951     | 5        | 20       | 37538.8  | 0.04     | 0.02     | A        | F         | 1994-03-04 | 1994-03-27 | 1994-03-14 | NONE
...

Also, I am going to scratch an itch I have had for 10+ years and generate tpch data with ALL THE CORES really fast so I don't have to wait around anymore. FYI @lmwnshn

More info:

Looking to get more involved? Please help review code! 🎣

DataFusion has a long history of community members contributing in all aspects of the project. Reviewing PRs is an especially great way to get introduced to the project, help the community and grow your own knowledge -- researching and understanding the code enough to review PRs also often inspires additional ideas for improvements.

We have docs about reviews. TLDR is: look for test coverage, if the change is understandable and well documented, and if the code can be improved. When you think the PR looks good to merge, try @ mentioning one of the committers.

Help wanted

  • I would love to see the community offer additional help performance testing, triaging bugs helping to make DataFusion a more stable foundation for building systems

Please feel leave your own comments on this ticket if you are looking for help

Community

Upcoming meetups:

  • Help schedule some!
@alamb
Copy link
Contributor Author

alamb commented Mar 17, 2025

Also, huge thanks to @xudong963 for running the release process

🙏

@alamb
Copy link
Contributor Author

alamb commented Mar 17, 2025

Oh, and of course @timsaucer is cranking out FFI bindings like

@alamb
Copy link
Contributor Author

alamb commented Mar 18, 2025

@milenkovicm has become a committer : https://lists.apache.org/thread/zzqdq8rfbwyqr9zloqt4y89ntml6pq45 🎉

@alamb
Copy link
Contributor Author

alamb commented Mar 20, 2025

New blog post by @XiangpengHao on parquet predicate evaluation:

@alamb
Copy link
Contributor Author

alamb commented Mar 24, 2025

Another blog post by @XiangpengHao about how to build S3 select in 400 lines of Rust (and FDAP)
https://blog.xiangpeng.systems/posts/build-s3-select/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant