Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DO-NOT-MERGE] Pythonic approach of setting Spark SQL configurations #49297

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Dec 26, 2024

What changes were proposed in this pull request?

This PR proposes Pythonic approach of setting Spark SQL configurations as below.

Get/set/unset the configurations

>>> spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"] = "false"
>>> spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"]
'false'
>>> spark.conf.spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled = "true"
>>> spark.conf.spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled
'true'
>>> spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"] = "false"
>>> spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"]
'false'
>>> del spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"]
>>> spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"]
'true'

List sub configurations

>>> dir(spark.conf["spark.sql.optimizer"])
['avoidCollapseUDFWithExpensiveExpr', 'collapseProjectAlwaysInline', 'dynamicPartitionPruning.enabled', 'enableCsvExpressionOptimization', 'enableJsonExpressionOptimization', 'excludedRules', 'ptimizer.excludedRules', 'runtime.bloomFilter.applicationSideScanSizeThreshold', 'runtime.bloomFilter.creationSideThreshold', 'runtime.bloomFilter.enabled', 'runtime.bloomFilter.expectedNumItems', 'runtime.bloomFilter.maxNumBits', 'runtime.bloomFilter.maxNumItems', 'runtime.bloomFilter.numBits', 'runtime.rowLevelOperationGroupFilter.enabled', 'runtimeFilter.number.threshold']
>>> dir(spark.conf.spark.sql.optimizer)
['avoidCollapseUDFWithExpensiveExpr', 'collapseProjectAlwaysInline', 'dynamicPartitionPruning.enabled', 'enableCsvExpressionOptimization', 'enableJsonExpressionOptimization', 'excludedRules', 'ptimizer.excludedRules', 'runtime.bloomFilter.applicationSideScanSizeThreshold', 'runtime.bloomFilter.creationSideThreshold', 'runtime.bloomFilter.enabled', 'runtime.bloomFilter.expectedNumItems', 'runtime.bloomFilter.maxNumBits', 'runtime.bloomFilter.maxNumItems', 'runtime.bloomFilter.numBits', 'runtime.rowLevelOperationGroupFilter.enabled', 'runtimeFilter.number.threshold']

Get documentation from the configuration

>>> spark.conf["spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled"].desc()
"Enables runtime group filtering for group-based row-level operations. Data sources that replace groups of data (e.g. files, partitions) may prune entire groups using provided data source filters when planning a row-level operation scan. However, such filtering is limited as not all expressions can be converted into data source filters and some expressions can only be evaluated by Spark (e.g. subqueries). Since rewriting groups is expensive, Spark can execute a query at runtime to find what records match the condition of the row-level operation. The information about matching records will be passed back to the row-level operation scan, allowing data sources to discard groups that don't have to be rewritten."
>>> spark.conf.spark.sql.optimizer.runtime.rowLevelOperationGroupFilter.enabled.version()
'3.4.0'

Why are the changes needed?

To provide Pythonic ways of setting options. This is also supported in pandas as a reference (https://pandas.pydata.org/docs/user_guide/options.html).

This should be pretty useful for interactive shell users in a way that they do not have to open a SQL configuration documentation, and check configurations and their documentation.

Does this PR introduce any user-facing change?

Yes, it provides users more Pythonic way of setting SQL configurations as demonstrated above.

How was this patch tested?

TBD

Was this patch authored or co-authored using generative AI tooling?

No.

python/pyspark/sql/conf.py Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants