Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add few missing SparkLikeExpr methods #1721

Merged

Conversation

Dhanunjaya-Elluri
Copy link
Contributor

@Dhanunjaya-Elluri Dhanunjaya-Elluri commented Jan 4, 2025

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Related issues

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

Added the following missing methods in SparkLikeExpr:
abs, median, clip, is_between, is_duplicated, is_finite, is_in, is_nan, is_unique, len, n_unique, round, skew

@Dhanunjaya-Elluri
Copy link
Contributor Author

Hi @FBruzzesi, I've added some missing methods to SparkLikeExpr related to #1714. Would appreciate feedback if some changes is required!

Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serious great effort @Dhanunjaya-Elluri! Thanks a ton for this, I am very excited πŸš€

I was in a bit of a rush but manage to leave some feedback.
I think the main critical point for many methods is null's and/or nan's cases where relevant. The best way to catch different behaviour of methods is to test against data containing null's and/or nan's, by adapting the existing tests with the pyspark_constructor.

narwhals/_spark_like/expr.py Outdated Show resolved Hide resolved
Comment on lines +292 to +260
def clip(
self,
lower_bound: Any | None = None,
upper_bound: Any | None = None,
) -> Self:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We recently introduced support for lower_bound and upper_bound to be other Expr's.
I am ok to keep it as a follow up, but definitly something we would look forward to.

narwhals/_spark_like/expr.py Outdated Show resolved Hide resolved
narwhals/_spark_like/expr.py Outdated Show resolved Hide resolved
narwhals/_spark_like/expr.py Outdated Show resolved Hide resolved
@FBruzzesi FBruzzesi added the enhancement New feature or request label Jan 4, 2025
@FBruzzesi FBruzzesi changed the title Feat(spark): add few missing Expr methods feat: add few missing SparkLikeExpr methods Jan 4, 2025
@Dhanunjaya-Elluri
Copy link
Contributor Author

Hi @FBruzzesi, made some changes now. Just one question as requested in the review.

Copy link
Collaborator

@EdAbati EdAbati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much from me too πŸ™πŸΌ

I just had a brief look from mobile for now, it's great work!

@@ -78,6 +78,29 @@ where `YOUR-GITHUB-USERNAME` will be your GitHub user name.

Here's how you can set up your local development environment to contribute.

#### Prerequisites for PySpark tests
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should narwhals suggest (and maintain :)) a guide to install Java for pyspark?
Or should we just add a note to say that pyspark needs Java and add a link to the pyspark documentation?

There may be different ways one wants to install Java on their machine.
For example, on macOS I prefer using openjdk installed via homebrew.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be simple to just say that pyspark needs java installed and add a link to pyspark documentation.

tests/spark_like_test.py Outdated Show resolved Hide resolved
Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adjusting @Dhanunjaya-Elluri ! We are getting closer and closer πŸ‘Œ
I just have a couple of comments that need to be addressed

Comment on lines 1082 to 1080
data = {
"a": [1.0, None, None, 3.0],
"b": [1.0, None, 4, 5.0],
}
df = nw.from_native(pyspark_constructor(data))
result = df.select(
a=nw.col("a").n_unique(),
b=nw.col("b").n_unique(),
)
expected = {"a": [2], "b": [3]}
assert_equal_data(result, expected)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None and nan should also count toward n_unique: see polars.Expr.n_unique

tests/spark_like_test.py Outdated Show resolved Hide resolved


# copied from tests/expr_and_series/is_duplicated_test.py
def test_is_duplicated(pyspark_constructor: Constructor) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test this one and unique with None's as well?

The expected behaviour is:

data = {"a": [1, 1, 2, None], "b": [1, 2, None, None], "level_0": [0, 1, 2, 3]}
pl.DataFrame(data).select(pl.col("a", "b").is_duplicated())
shape: (4, 2)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”
β”‚ a     ┆ b     β”‚
β”‚ ---   ┆ ---   β”‚
β”‚ bool  ┆ bool  β”‚
β•žβ•β•β•β•β•β•β•β•ͺ═══════║
β”‚ true  ┆ false β”‚
β”‚ true  ┆ false β”‚
β”‚ false ┆ true  β”‚
β”‚ false ┆ true  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Dhanunjaya-Elluri
Copy link
Contributor Author

Hey @FBruzzesi, done changes now. Let me know if there needs to be any changes

@FBruzzesi
Copy link
Member

FBruzzesi commented Jan 7, 2025

Hey @Dhanunjaya-Elluri I just took another look at this.

For is_duplicated, the following should be all we need:

def is_duplicated(self) -> Self:
    def _is_duplicated(_input: Column) -> Column:
        from pyspark.sql import Window
        from pyspark.sql import functions as F  # noqa: N812

        # Create a window spec that treats each value separately.
        return F.count("*").over(Window.partitionBy(_input)) > 1

    return self._from_call(
        _is_duplicated, "is_duplicated", returns_scalar=self._returns_scalar
    )

Similarly for is_unique.

However, n_unique is definitly more tricky (and we can consider to postpone it for now) as F.count_distinct ignores nulls

@Dhanunjaya-Elluri
Copy link
Contributor Author

Dhanunjaya-Elluri commented Jan 7, 2025

Hi @FBruzzesi, thanks for taking deeper look. This looks much simpler πŸ‘. For now I'll push these changes then

Copy link
Member

@FBruzzesi FBruzzesi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one! Amazing work @Dhanunjaya-Elluri πŸ™ŒπŸΌ

NarwhalSpinGIF

@FBruzzesi FBruzzesi merged commit 46a030a into narwhals-dev:main Jan 7, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants