You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-50559][SQL] Store Except, Intersect and Union's outputs as lazy vals
### What changes were proposed in this pull request?
Store `Except`, `Intersect` and `Union`'s outputs as lazy vals.
### Why are the changes needed?
Currently `Union`'s (same is for `Except` and `Intersect`) `output` is a `def`. This creates performance issues for queries with large number of stacked `UNION`s because of rules like `WidenSetOperationTypes` that traverse the logical plan and call `output` on each `Union` node. This has quadratic complexity: O(number_of_unions * (1 + 2 + 3 + ... + number_of_unions)).
Profile:


[flamegraph.tar.gz](https://github.com/user-attachments/files/18118260/flamegraph.tar.gz)
The improvement in parsing + analysis wall-clock time for a query with 500 UNIONs over 30 columns each is 13x (5.5s -> 400ms):

Repro:
```
def genValues(num: Int) = s"VALUES (${(0 until num).mkString(", ")})"
def genUnions(numUnions: Int, numValues: Int) = (0 until numUnions).map(_ => genValues(numValues)).mkString(" UNION ALL ")
spark.time { spark.sql(s"SELECT * FROM ${genUnions(numUnions = 500, numValues = 30)}").queryExecution.analyzed }
```
For `EXCEPT` the perf difference is not that noticeable. Perhaps because it reuses the same `Seq` (it just calls `left.output`).
### Does this PR introduce _any_ user-facing change?
No, this is an optimization.
### How was this patch tested?
- Ran the async-profiler
- Ran the benchmark in spark-shell.
- Existing tests.
### Was this patch authored or co-authored using generative AI tooling?
copilot.nvim.
Closesapache#49166 from vladimirg-db/vladimirg-db/store-union-output-as-lazy-val.
Authored-by: Vladimir Golubev <[email protected]>
Signed-off-by: Max Gekk <[email protected]>
0 commit comments