-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
March 17, 2025: This week(s) in DataFusion #15269
Comments
Also, huge thanks to @xudong963 for running the release process
🙏 |
Oh, and of course @timsaucer is cranking out FFI bindings like |
@milenkovicm has become a committer : https://lists.apache.org/thread/zzqdq8rfbwyqr9zloqt4y89ntml6pq45 🎉 |
New blog post by @XiangpengHao on parquet predicate evaluation: |
Another blog post by @XiangpengHao about how to build S3 select in 400 lines of Rust (and FDAP) |
Introduction
A weekly-ish summary of interesting things happening in DataFusion. Note this is not a complete list (it is what I remember / can find). Please leave comments on this ticket about things that I may have missed or you think should get wider attention by the community.
Side note: I am depressed with the number of great PRs that are open, but waiting on someone to help push them along. I spent some time trying to summarize them / listing them below in hopes of getting others excited. I purposly listed them with the ones that need more help at the top (and ones I am helping at the bottom)
Ongoing Projects
There are several substantial projects in various states. It would be great to get some more community eyes on these PRs -- both to help review, as well as to help figure out which to prioritize
Google Summer Of Code (@oznur-synnada)
We are hosting a Google Summer of Code project which has brought many new people to the community
async
user defined functions (@goldmedal )Imagine calling llm functions or network from functions
Better user defined function interface (@Blizzara @shehabgamin @jayzhan211 )
ScalarUDF::invoke_batch
#14652🔥 Spark Functions (@andygrove , @shehabgamin )
A bunch of DataFusion users (Sail, @Omega359 , Comet, etc) want to have spark compatibile functions. We are working on getting the basics in place so we can collaborate / maintain such a library togeter.
datafusion-spark
crate #15168Hardening sorting larger-than-memory datasets (@2010YOUY01 @Kontinuation @zhuqi-lucas )
It seems like more and more people are (re) sorting large datasets (seems common for reorganization).
max_temp_directory_size
to limit max disk usage for spilling queries #14975porting tests to use insta (@blaginin)
Imagine: update expected tests as easily as sqllogictests (just run
cargo insta review
) ❤@blaginin setup the basic infrastructure, has filed a bunch of tickets, and rallied the community which is now hard at work cranking out the code
insta
for tests) #15178Expression pushdown @adriangb
Some file formats / systems can efficiently push down expression evaluation to the table format (e.g. Vortex, or json). DataFusion doesn't know how to do this yet, but it will!
TableProviders
#14993Metadata columns (@chenkovsky )
Imagine adding synthetic columns to your data source (like row number)
Better integration with distributed tracing (@geoffreyclaude)
When using DataFusion in a distributed environment passing through context down to the IO is important for performance analysis. @geoffreyclaude has a PR up to help thread this down
JoinSetTracer
trait for tracing context propagation in spawned tasks #14547New IO interface (@Xuanwo)
@Xuanwo is thinking of an API for IO in datafusion that is not tied to
object_store
.datafusion-storage
as datafusion's own storage interface #14854Better Error Messages (@eliaperantoni )
Imagine error messages that showed you where in the query the problem was 🤯
Diagnostic
to more errors #14429Changing default mapping
VARCHAR
-->Utf8View
(rather thanUtf8
)Imagine
CREATE TABLE foo(x varchar)
will useUtf8View
for x.VARCHAR
fromUtf8
toUtf8View
#15096 (comment)Predicate pushdown by default (@XiangpengHao)
Long standing feature in parquet reader. This gets 10-20% performance improvement for some queries
Beautiful expalin plans (@irenjj)
Imagine: duckdb style explain plans:
SQL EXPLAIN
Tree Rendering #14914TPCH data generator (@clflushopt )
Imagine (with the correct column names):
Also, I am going to scratch an itch I have had for 10+ years and generate tpch data with ALL THE CORES really fast so I don't have to wait around anymore. FYI @lmwnshn
More info:
Looking to get more involved? Please help review code! 🎣
DataFusion has a long history of community members contributing in all aspects of the project. Reviewing PRs is an especially great way to get introduced to the project, help the community and grow your own knowledge -- researching and understanding the code enough to review PRs also often inspires additional ideas for improvements.
We have docs about reviews. TLDR is: look for test coverage, if the change is understandable and well documented, and if the code can be improved. When you think the PR looks good to merge, try
@
mentioning one of the committers.Help wanted
Please feel leave your own comments on this ticket if you are looking for help
Community
Upcoming meetups:
The text was updated successfully, but these errors were encountered: