Skip to content

[branch-46] feat: introduce JoinSetTracer trait for tracing context propagation in spawned tasks #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 20, 2025

Conversation

geoffreyclaude
Copy link

Which issue does this PR close?

Relates to apache#9415, but does not fully close it. It lays groundwork for optional instrumentation of async tasks in DataFusion.
Approved upstream PR: apache#14547

Rationale for this change

This PR introduces a general mechanism enabling DataFusion to propagate user-defined context (such as tracing spans, logging, or metrics) across thread boundaries without depending on any specific instrumentation library.

Previously, tasks spawned on new threads—such as those performing repartitioning or Parquet file reads—would lose thread-local context, making instrumentation challenging for users. The introduced approach addresses this gap by allowing users to inject custom instrumentation via the new JoinSetTracer trait. This ensures context is preserved seamlessly, keeping DataFusion lightweight by not adding any direct instrumentation dependencies.

What changes are included in this PR?

  • New JoinSetTracer trait: Defines how to instrument futures or blocking closures when tasks are spawned on threads.
  • Global tracer registration: Adds a set_join_set_tracer function for registering a custom tracer at startup. If no tracer is set, a no-op implementation is used by default.
  • Refactored JoinSet: Introduces a wrapper around Tokio's JoinSet that leverages the registered tracer (if available) to instrument spawned tasks transparently.
  • Integration Example: Provides an illustrative example in datafusion-examples/examples/tracing.rs, demonstrating how users can integrate their tracing implementations. This example does not impose any direct tracing dependency on DataFusion users.

Are these changes tested?

Yes. There are no dedicated unit tests specifically for the tracer injection, but the example in datafusion-examples/examples/tracing.rs shows a working end-to-end setup using tracing. By running that example, you can confirm that tasks spawned on multiple threads inherit whichever span is active at the moment they are spawned—if a tracer is registered.

Are there any user-facing changes?

  • Users who do not register a tracer see no differences (and incur no overhead). Everything works as before.
  • Users who do want instrumentation can implement JoinSetTracer and call set_join_set_tracer(...). This approach is fully optional.

The upshot is that DataFusion now provides a pluggable way to connect with tracing or other instrumentation without pulling those dependencies into DataFusion by default.

@geoffreyclaude geoffreyclaude merged commit a96503b into branch-46 Mar 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant