Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix binary operations on attrs for Series and DataFrame #59636

Merged
merged 37 commits into from
Apr 3, 2025

Conversation

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small change to prefer fixtures to writing out our own binop implementations, but generally lgtm. I don't think current CI failures are related.

@mroeschke any thoughts here?

df_2 = DataFrame({"A": [-3, 9]})
attrs = {"info": "DataFrame"}
df_1.attrs = attrs
assert (df_1 + df_2).attrs == attrs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than doing this you can just use the all_binary_operators fixture from conftest.py (I think)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the change.

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think attrs propagation logic should should only be handled by __finalize__, so these binary operations should dispatch to that method

@mroeschke mroeschke added metadata _metadata, .attrs Numeric Operations Arithmetic, Comparison, and Logical operations labels Sep 25, 2024
@fbourgey
Copy link
Contributor Author

@mroeschke should everything be rewritten using finalize then?

@mroeschke
Copy link
Member

Yes, or __finalize__ needs to be probably be called somewhere

Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Nov 29, 2024
@fbourgey
Copy link
Contributor Author

fbourgey commented Dec 2, 2024

@mroeschke, @WillAyd, I tried using __finalize__ instead. What do you think?

@WillAyd WillAyd removed the Stale label Dec 3, 2024
@WillAyd
Copy link
Member

WillAyd commented Dec 3, 2024

I think it looks good but will defer to @mroeschke

@@ -7875,13 +7875,19 @@ class diet
def _cmp_method(self, other, op):
axis: Literal[1] = 1 # only relevant for Series other case

if not getattr(self, "attrs", None) and getattr(other, "attrs", None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should need these anymore here since this should be handled in _construct_result

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that sometimes

self, other = self._align_for_op(other, axis, flex=False, level=None)

resets other.attrs to {}.

This is why I kept it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it because other is getting overridden here? Otherwise, _align_for_op should also preserve the attrs of other.

@@ -8212,6 +8208,9 @@ def to_series(right):
)
right = left._maybe_align_series_as_frame(right, axis)
Copy link
Contributor Author

@fbourgey fbourgey Apr 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this resets the attrs of right

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would consider that a bug. attrs should be preserved in this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I fix it in this PR or raise a different issue?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can fix it in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested something below

@@ -8283,6 +8285,8 @@ def _construct_result(self, result) -> DataFrame:
-------
DataFrame
"""
if not getattr(self, "attrs", None) and getattr(other, "attrs", None):
self.__finalize__(other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do out = out.__finalize(other) instead?


def _construct_result(self, result) -> DataFrame:
def _construct_result(self, result, other=None) -> DataFrame:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _construct_result(self, result, other=None) -> DataFrame:
def _construct_result(self, result, other) -> DataFrame:

Might as well make this required

Comment on lines 8289 to 8290
if not getattr(self, "attrs", None) and getattr(other, "attrs", None):
out = out.__finalize__(other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if not getattr(self, "attrs", None) and getattr(other, "attrs", None):
out = out.__finalize__(other)
out = out.__finalize__(other)

Appears __finalize__ will correctly skip if other has a populated attrs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this breaks the following test:

FAILED pandas/tests/generic/test_duplicate_labels.py::TestPreserves::test_binops[other1-True-add] - AssertionError
FAILED pandas/tests/generic/test_duplicate_labels.py::TestPreserves::test_binops[other1-True-sub] - AssertionError

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line __finalize__ needs a fix:

self.flags.allows_duplicate_labels = other.flags.allows_duplicate_labels

Prioritizing False if self.flags.allows_duplicate_labels or other.flags.allows_duplicate_labels is False

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about doing in __finalize__

if isinstance(other, NDFrame):
    if other.attrs:
        # We want attrs propagation to have minimal performance
        # impact if attrs are not used; i.e. attrs is an empty dict.
        # One could make the deepcopy unconditionally, but a deepcopy
        # of an empty dict is 50x more expensive than the empty check.
        self.attrs = deepcopy(other.attrs)
    self.flags.allows_duplicate_labels = (
        self.flags.allows_duplicate_labels
        and other.flags.allows_duplicate_labels
    )

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup that's the correct location to fix this


@final
def _construct_result(self, result, name):
def _construct_result(self, result, name, other=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _construct_result(self, result, name, other=None):
def _construct_result(self, result, name, other):

self,
result: ArrayLike | tuple[ArrayLike, ArrayLike],
name: Hashable,
other: AnyArrayLike | DataFrame | None = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
other: AnyArrayLike | DataFrame | None = None,
other: AnyArrayLike | DataFrame,

@@ -5943,6 +5947,7 @@ def _construct_result(
----------
result : ndarray or ExtensionArray
name : Label
other : Series, DataFrame or array-like, default None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
other : Series, DataFrame or array-like, default None
other : Series, DataFrame or array-like

Comment on lines 5973 to 5974
if getattr(other, "attrs", None):
out.__finalize__(other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if getattr(other, "attrs", None):
out.__finalize__(other)
out = out.__finalize__(other)

Copy link
Contributor Author

@fbourgey fbourgey Apr 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this breaks:

FAILED pandas/tests/generic/test_duplicate_labels.py::TestPreserves::test_binops[other1-False-add] - AssertionError
FAILED pandas/tests/generic/test_duplicate_labels.py::TestPreserves::test_binops[other1-False-sub] - AssertionError

something to do with flags.allows_duplicate_labels


def _construct_result(self, result, name):
def _construct_result(self, result, name, other=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _construct_result(self, result, name, other=None):
def _construct_result(self, result, name, other):

@@ -8101,6 +8100,7 @@ def _align_for_op(
left : DataFrame
right : Any
"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

@@ -8200,15 +8200,13 @@ def to_series(right):
"`left, right = left.align(right, axis=1)` "
"before operating."
)

left, right = left.align(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
left, right = left.align(
left, right = left.align(

@mroeschke mroeschke added this to the 3.0 milestone Apr 3, 2025
@mroeschke mroeschke merged commit 04356be into pandas-dev:main Apr 3, 2025
42 checks passed
@mroeschke
Copy link
Member

Thanks for sticking with this @fbourgey!

@fbourgey fbourgey deleted the series-sum-attrs branch April 3, 2025 19:17
@fbourgey
Copy link
Contributor Author

fbourgey commented Apr 3, 2025

Thanks for the help @mroeschke!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metadata _metadata, .attrs Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: binary operations don't propogate attrs depending on order with Series and/or DataFrame/Series
4 participants