[DO-NOT-MERGE] Pythonic approach of setting Spark SQL configurations #49297

HyukjinKwon · 2024-12-26T06:58:53Z

What changes were proposed in this pull request?

This PR proposes Pythonic approach of setting Spark SQL configurations as below.

Get/set/unset the configurations

>>> spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"] = "false"
>>> spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"]
'false'
>>> spark.conf.spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled = "true"
>>> spark.conf.spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled
'true'

>>> spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"] = "false"
>>> spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"]
'false'
>>> del spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"]
>>> spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"]
'true'

List sub configurations

>>> dir(spark.conf["spark.sql.optimizer"])
['avoidCollapseUDFWithExpensiveExpr', 'collapseProjectAlwaysInline', 'dynamicPartitionPruning.enabled', 'enableCsvExpressionOptimization', 'enableJsonExpressionOptimization', 'excludedRules', 'ptimizer.excludedRules', 'runtime.bloomFilter.applicationSideScanSizeThreshold', 'runtime.bloomFilter.creationSideThreshold', 'runtime.bloomFilter.enabled', 'runtime.bloomFilter.expectedNumItems', 'runtime.bloomFilter.maxNumBits', 'runtime.bloomFilter.maxNumItems', 'runtime.bloomFilter.numBits', 'runtime.rowLevelOperationGroupFilter.enabled', 'runtimeFilter.number.threshold']
>>> dir(spark.conf.spark.sql.optimizer)
['avoidCollapseUDFWithExpensiveExpr', 'collapseProjectAlwaysInline', 'dynamicPartitionPruning.enabled', 'enableCsvExpressionOptimization', 'enableJsonExpressionOptimization', 'excludedRules', 'ptimizer.excludedRules', 'runtime.bloomFilter.applicationSideScanSizeThreshold', 'runtime.bloomFilter.creationSideThreshold', 'runtime.bloomFilter.enabled', 'runtime.bloomFilter.expectedNumItems', 'runtime.bloomFilter.maxNumBits', 'runtime.bloomFilter.maxNumItems', 'runtime.bloomFilter.numBits', 'runtime.rowLevelOperationGroupFilter.enabled', 'runtimeFilter.number.threshold']

Get documentation from the configuration

>>> spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"].desc()
"Enables runtime group filtering for group-based row-level operations. Data sources that replace groups of data (e.g. files, partitions) may prune entire groups using provided data source filters when planning a row-level operation scan. However, such filtering is limited as not all expressions can be converted into data source filters and some expressions can only be evaluated by Spark (e.g. subqueries). Since rewriting groups is expensive, Spark can execute a query at runtime to find what records match the condition of the row-level operation. The information about matching records will be passed back to the row-level operation scan, allowing data sources to discard groups that don't have to be rewritten."

>>> spark.conf.spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled.version()
'3.4.0'

Why are the changes needed?

To provide Pythonic ways of setting options. This is also supported in pandas as a reference (https://pandas.pydata.org/docs/user_guide/options.html).

This should be pretty useful for interactive shell users in a way that they do not have to open a SQL configuration documentation, and check configurations and their documentation.

Does this PR introduce any user-facing change?

Yes, it provides users more Pythonic way of setting SQL configurations as demonstrated above.

How was this patch tested?

TBD

Was this patch authored or co-authored using generative AI tooling?

No.

python/pyspark/sql/conf.py

POC

dadd623

HyukjinKwon force-pushed the poc-string branch from 9d1a5cf to dadd623 Compare December 26, 2024 06:59

HyukjinKwon marked this pull request as draft December 26, 2024 06:59

github-actions bot added SQL PYTHON labels Dec 26, 2024

ueshin reviewed Dec 26, 2024

View reviewed changes

python/pyspark/sql/conf.py Outdated Show resolved Hide resolved

fixup

87ef983

HyukjinKwon force-pushed the poc-string branch from ac2e69f to 87ef983 Compare December 27, 2024 05:25

HyukjinKwon closed this Dec 27, 2024

HyukjinKwon reopened this Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DO-NOT-MERGE] Pythonic approach of setting Spark SQL configurations #49297

[DO-NOT-MERGE] Pythonic approach of setting Spark SQL configurations #49297

HyukjinKwon commented Dec 26, 2024 •

edited

Loading

[DO-NOT-MERGE] Pythonic approach of setting Spark SQL configurations #49297

Are you sure you want to change the base?

[DO-NOT-MERGE] Pythonic approach of setting Spark SQL configurations #49297

Conversation

HyukjinKwon commented Dec 26, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

HyukjinKwon commented Dec 26, 2024 •

edited

Loading