feat: add few missing `SparkLikeExpr` methods #1721

Dhanunjaya-Elluri · 2025-01-04T18:38:07Z

What type of PR is this? (check all applicable)

Related issues

Related issue [Enh]: Spark Expr missing methods #1714

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below

Added the following missing methods in SparkLikeExpr:
abs, median, clip, is_between, is_duplicated, is_finite, is_in, is_nan, is_unique, len, n_unique, round, skew

Dhanunjaya-Elluri · 2025-01-04T18:58:36Z

Hi @FBruzzesi, I've added some missing methods to SparkLikeExpr related to #1714. Would appreciate feedback if some changes is required!

FBruzzesi

Serious great effort @Dhanunjaya-Elluri! Thanks a ton for this, I am very excited 🚀

I was in a bit of a rush but manage to leave some feedback.
I think the main critical point for many methods is null's and/or nan's cases where relevant. The best way to catch different behaviour of methods is to test against data containing null's and/or nan's, by adapting the existing tests with the pyspark_constructor.

narwhals/_spark_like/expr.py

FBruzzesi · 2025-01-04T19:46:24Z

narwhals/_spark_like/expr.py

+    def clip(
+        self,
+        lower_bound: Any | None = None,
+        upper_bound: Any | None = None,
+    ) -> Self:


We recently introduced support for lower_bound and upper_bound to be other Expr's.
I am ok to keep it as a follow up, but definitly something we would look forward to.

narwhals/_spark_like/expr.py

Dhanunjaya-Elluri · 2025-01-05T12:50:47Z

Hi @FBruzzesi, made some changes now. Just one question as requested in the review.

EdAbati

Thank you very much from me too 🙏🏼

I just had a brief look from mobile for now, it's great work!

EdAbati · 2025-01-05T13:48:18Z

CONTRIBUTING.md

@@ -78,6 +78,29 @@ where `YOUR-GITHUB-USERNAME` will be your GitHub user name.

 Here's how you can set up your local development environment to contribute.

+#### Prerequisites for PySpark tests


Should narwhals suggest (and maintain :)) a guide to install Java for pyspark?
Or should we just add a note to say that pyspark needs Java and add a link to the pyspark documentation?

There may be different ways one wants to install Java on their machine.
For example, on macOS I prefer using openjdk installed via homebrew.

What do you think?

I think it would be simple to just say that pyspark needs java installed and add a link to pyspark documentation.

tests/spark_like_test.py

FBruzzesi

Thanks for adjusting @Dhanunjaya-Elluri ! We are getting closer and closer 👌
I just have a couple of comments that need to be addressed

FBruzzesi · 2025-01-05T19:25:34Z

tests/spark_like_test.py

+    data = {
+        "a": [1.0, None, None, 3.0],
+        "b": [1.0, None, 4, 5.0],
+    }
+    df = nw.from_native(pyspark_constructor(data))
+    result = df.select(
+        a=nw.col("a").n_unique(),
+        b=nw.col("b").n_unique(),
+    )
+    expected = {"a": [2], "b": [3]}
+    assert_equal_data(result, expected)


None and nan should also count toward n_unique: see polars.Expr.n_unique

tests/spark_like_test.py

FBruzzesi · 2025-01-05T19:30:30Z

tests/spark_like_test.py

+
+
+# copied from tests/expr_and_series/is_duplicated_test.py
+def test_is_duplicated(pyspark_constructor: Constructor) -> None:


Can we test this one and unique with None's as well?

The expected behaviour is:

data = {"a": [1, 1, 2, None], "b": [1, 2, None, None], "level_0": [0, 1, 2, 3]} pl.DataFrame(data).select(pl.col("a", "b").is_duplicated()) shape: (4, 2) ┌───────┬───────┐ │ a ┆ b │ │ --- ┆ --- │ │ bool ┆ bool │ ╞═══════╪═══════╡ │ true ┆ false │ │ true ┆ false │ │ false ┆ true │ │ false ┆ true │ └───────┴───────┘

…into feat/spark-expr-methods

Dhanunjaya-Elluri · 2025-01-06T16:51:48Z

Hey @FBruzzesi, done changes now. Let me know if there needs to be any changes

narwhals/_spark_like/expr.py

FBruzzesi · 2025-01-07T08:41:34Z

Hey @Dhanunjaya-Elluri I just took another look at this.

For is_duplicated, the following should be all we need:

def is_duplicated(self) -> Self:
    def _is_duplicated(_input: Column) -> Column:
        from pyspark.sql import Window
        from pyspark.sql import functions as F  # noqa: N812

        # Create a window spec that treats each value separately.
        return F.count("*").over(Window.partitionBy(_input)) > 1

    return self._from_call(
        _is_duplicated, "is_duplicated", returns_scalar=self._returns_scalar
    )

Similarly for is_unique.

However, n_unique is definitly more tricky (and we can consider to postpone it for now) as F.count_distinct ignores nulls

…into feat/spark-expr-methods

Dhanunjaya-Elluri · 2025-01-07T09:52:46Z

Hi @FBruzzesi, thanks for taking deeper look. This looks much simpler 👍. For now I'll push these changes then

FBruzzesi

Nice one! Amazing work @Dhanunjaya-Elluri 🙌🏼

FBruzzesi reviewed Jan 4, 2025

View reviewed changes

FBruzzesi added the enhancement New feature or request label Jan 4, 2025

FBruzzesi changed the title ~~Feat(spark): add few missing Expr methods~~ feat: add few missing SparkLikeExpr methods Jan 4, 2025

Dhanunjaya-Elluri force-pushed the feat/spark-expr-methods branch from ca44572 to 7c9d37d Compare January 5, 2025 12:45

EdAbati reviewed Jan 5, 2025

View reviewed changes

Dhanunjaya-Elluri added 6 commits January 5, 2025 19:35

feat(spark): add missing methods to SparkLikeExpr

ea03eaf

feat(spark): add few missing methods

57ab8a0

fix: add xfail to median when python<3.9

f3ab9e2

fix: fixing reviewd requests & updated tests

c470ece

fix: fix PYSPARK_VERSION for median calculation

e569f83

fix: fix refactor issue

120ea3b

FBruzzesi reviewed Jan 5, 2025

View reviewed changes

Dhanunjaya-Elluri added 2 commits January 6, 2025 17:20

fix: remove is_nan method

b00c6dc

Merge branch 'main' of https://github.com/Dhanunjaya-Elluri/narwhals …

fbdd61c

…into feat/spark-expr-methods

Dhanunjaya-Elluri force-pushed the feat/spark-expr-methods branch from 2aa436d to fbdd61c Compare January 6, 2025 16:22

FBruzzesi reviewed Jan 6, 2025

View reviewed changes

narwhals/_spark_like/expr.py Outdated Show resolved Hide resolved

FBruzzesi reviewed Jan 6, 2025

View reviewed changes

narwhals/_spark_like/expr.py Outdated Show resolved Hide resolved

FBruzzesi reviewed Jan 6, 2025

View reviewed changes

narwhals/_spark_like/expr.py Outdated Show resolved Hide resolved

Merge branch 'main' of https://github.com/Dhanunjaya-Elluri/narwhals …

6e8292b

…into feat/spark-expr-methods

fix: fixing is_duplicated & is_unique & remove n_unique

9ac23e6

FBruzzesi approved these changes Jan 7, 2025

View reviewed changes

Merge branch 'main' into feat/spark-expr-methods

155f654

FBruzzesi merged commit 46a030a into narwhals-dev:main Jan 7, 2025
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add few missing `SparkLikeExpr` methods #1721

feat: add few missing `SparkLikeExpr` methods #1721

Dhanunjaya-Elluri commented Jan 4, 2025 •

edited

Loading

Dhanunjaya-Elluri commented Jan 4, 2025

FBruzzesi left a comment

FBruzzesi Jan 4, 2025

Dhanunjaya-Elluri commented Jan 5, 2025

EdAbati left a comment

EdAbati Jan 5, 2025

Dhanunjaya-Elluri Jan 5, 2025

FBruzzesi left a comment

FBruzzesi Jan 5, 2025

FBruzzesi Jan 5, 2025

Dhanunjaya-Elluri Jan 6, 2025

Dhanunjaya-Elluri commented Jan 6, 2025

FBruzzesi commented Jan 7, 2025 •

edited

Loading

Dhanunjaya-Elluri commented Jan 7, 2025 •

edited

Loading

FBruzzesi left a comment •

edited

Loading

		@@ -78,6 +78,29 @@ where `YOUR-GITHUB-USERNAME` will be your GitHub user name.

		Here's how you can set up your local development environment to contribute.

		#### Prerequisites for PySpark tests



		# copied from tests/expr_and_series/is_duplicated_test.py
		def test_is_duplicated(pyspark_constructor: Constructor) -> None:

feat: add few missing SparkLikeExpr methods #1721

feat: add few missing SparkLikeExpr methods #1721

Conversation

Dhanunjaya-Elluri commented Jan 4, 2025 • edited Loading

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below

Dhanunjaya-Elluri commented Jan 4, 2025

FBruzzesi left a comment

Choose a reason for hiding this comment

FBruzzesi Jan 4, 2025

Choose a reason for hiding this comment

Dhanunjaya-Elluri commented Jan 5, 2025

EdAbati left a comment

Choose a reason for hiding this comment

EdAbati Jan 5, 2025

Choose a reason for hiding this comment

Dhanunjaya-Elluri Jan 5, 2025

Choose a reason for hiding this comment

FBruzzesi left a comment

Choose a reason for hiding this comment

FBruzzesi Jan 5, 2025

Choose a reason for hiding this comment

FBruzzesi Jan 5, 2025

Choose a reason for hiding this comment

Dhanunjaya-Elluri Jan 6, 2025

Choose a reason for hiding this comment

Dhanunjaya-Elluri commented Jan 6, 2025

FBruzzesi commented Jan 7, 2025 • edited Loading

Dhanunjaya-Elluri commented Jan 7, 2025 • edited Loading

FBruzzesi left a comment • edited Loading

Choose a reason for hiding this comment

feat: add few missing `SparkLikeExpr` methods #1721

feat: add few missing `SparkLikeExpr` methods #1721

Dhanunjaya-Elluri commented Jan 4, 2025 •

edited

Loading

FBruzzesi commented Jan 7, 2025 •

edited

Loading

Dhanunjaya-Elluri commented Jan 7, 2025 •

edited

Loading

FBruzzesi left a comment •

edited

Loading