-
-
Notifications
You must be signed in to change notification settings - Fork 19k
ENH: adds columnwise fillna support #62393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This is my first PR on an open source project, so I'm still a little unfamiliar with the work flow. I may have merged in too many changes by mistake; I intended to only update generic.py, the relevant tests, and the what's new doc. If I need to update anything to reduce the scope I'm happy to make whatever changes are needed. |
Hey @a18rhodes no worries - we can fix this up. Looking at the commit history, I think you accidentally picked up commits from other users, which is causing the extra files to show as modified. Here are the errant commit hashes: Essentially you will want to remove those from your branch. This link shows two ways of achieving that: https://www.abrahamberg.com/blog/git-remove-commits-from-branch-after-push-reset-revert-or-rebase/ I'll leave it up to you to try either approach. Once you have those commits fixed, you will want to pull the changes from the main branch of pandas, which you can do with: git remote add upstream [email protected]:pandas-dev/pandas.git # only need to do once, can also use https
git pull upstream main while on your working branch. In terms of resolving any conflicts, just try to play close attention to what git is telling you. It can definitely be intimidating, but usually a conflict needs your input. git will tell you "hey you made changes X,Y,Z while another user made different changes to that same file." There is never a single answer on how to resolve that, so just give it your best shot and if you have questions you can highlight them here |
c0044cf
to
9fde541
Compare
Thanks @WillAyd! I think I resolved the issue, it looks like I have the correct 3 files now and have dropped the other user's commits. |
9fde541
to
5380528
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work overall
94ce8be
to
f3d404f
Compare
I did some benchmarking to find the best approach. As it turns out, frames with nrows >> ncols perform better with
|
pandas/core/generic.py
Outdated
f"(value.dtype={value_dtype} vs {dtypes[0]})" | ||
) | ||
nrows, ncols = self.shape | ||
if nrows > 1000 * ncols: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate what you are trying to do here but this heuristic is pretty arbitrary; its better to just keep with the transpose method (at smaller data volumes, the difference is likely negligible anyway)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I posted this before seeing your benchmarks. Even still, I don't think its that important to optimize for extremely wide dataframes like that. We can leave that to another PR if we even cared for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, as you mentioned, the heuristic is pretty arbitrarily based on my machine performance, etc.
I'll update using just the transpose method.
f3d404f
to
9974aad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally this lgtm @mroeschke any feedback?
df.fillna(df.max(axis=1), axis=1, inplace=True) | ||
expected = DataFrame( | ||
{ | ||
"a": [1.0, 1, 2, 3, 4], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultra nit but can you represent these all as float values? I had to double take why only the first row of data would be floats, but in actuality all of the data is float and this just relies on implicit conversion.
I think it would be less confusing to just label all values as 1.0, 2.0, 3.0, 3.0, 4.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, not sure why I got lazy on this of all things. I will update it.
9974aad
to
e5c55a5
Compare
pandas/core/generic.py
Outdated
dtypes = [result[col].dtype for col in result.columns] | ||
if len(set(dtypes)) > 1: | ||
raise ValueError( | ||
"All columns must have the same dtype, but got dtypes: " | ||
f"{dict(zip(result.columns, dtypes))}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dtypes = [result[col].dtype for col in result.columns] | |
if len(set(dtypes)) > 1: | |
raise ValueError( | |
"All columns must have the same dtype, but got dtypes: " | |
f"{dict(zip(result.columns, dtypes))}" | |
unique_dtypes = np.unique(self._mgr.get_dtypes()) | |
if len(unique_dtypes) > 1: | |
raise ValueError( | |
"All columns must have the same dtype, but got dtypes: " | |
f"{list(unique_dtypes)}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, not sure how I missed this suggestion. will implement in my next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On further inspection, the np.unique
approach fails for non-homogeneous column dtypes due to"<" comparsion between mixed types (see test "test_fillna_dict_series_axis_1_mismatch_cols").
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.unique approach fails
Pandas has its own unique function, check if it works.
see test "test_fillna_dict_series_axis_1_mismatch_cols"
When you amend and force push, makes it hard to see previous CI runs results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give that a shot.
Yeah, that's a good point, I've been trying to keep a clean commit history since I have had so many, but it probably adds to the confusion in some cases. That said, I never committed the failing code, just pointing out a new unit test I added that fails when trying np.unique.
pandas/core/generic.py
Outdated
if (value_dtype := np.asarray(value).dtype) != dtypes[0]: | ||
raise ValueError( | ||
"Dtype mismatch for value " | ||
f"(value.dtype={value_dtype} vs {dtypes[0]})" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try removing this case and seeing if a similar validation still hits? I'm hopping fillna
after the transpose enforces this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked in to this. Dropping this check results in no fill happening at all, rather than an exception of any type being raised.
(Note these results are without the check in question)
def test_fillna_dict_series_axis_1_value_mismatch_with_cols(self):
df = DataFrame(
{
"a": [np.nan, 1, 2, np.nan, np.nan],
"b": [1, 2, 3, np.nan, np.nan],
"c": [np.nan, 1, 2, 3, 4],
}
)
with pytest.raises(Exception):
print()
print(df.fillna(Series({"a": "abc", "b": "def", "c": "hij"}), axis=1))
pytest ./pandas/tests/frame/methods/test_fillna.py --ff -x
+ /root/virtualenvs/pandas-dev/bin/ninja
[1/1] Generating write_version_file with a custom command
==================================== test session starts ====================================
platform linux -- Python 3.11.13, pytest-8.4.2, pluggy-1.6.0
PyQt5 5.15.11 -- Qt runtime 5.15.17 -- Qt compiled 5.15.14
rootdir: /workspaces/pandas
configfile: pyproject.toml
plugins: hypothesis-6.139.2, localserver-0.9.0.post0, anyio-4.10.0, qt-4.5.0, cov-7.0.0, xdist-3.8.0, cython-0.3.1
collected 65 items
run-last-failure: rerun previous 1 failure first
pandas/tests/frame/methods/test_fillna.py
a b c
0 NaN 1.0 NaN
1 1.0 2.0 1.0
2 2.0 3.0 2.0
3 NaN NaN 3.0
4 NaN NaN 4.0
F
========================================= FAILURES ==========================================
____________ TestFillNA.test_fillna_dict_series_axis_1_value_mismatch_with_cols _____________
self = <pandas.tests.frame.methods.test_fillna.TestFillNA object at 0x70da9893a790>
def test_fillna_dict_series_axis_1_value_mismatch_with_cols(self):
df = DataFrame(
{
"a": [np.nan, 1, 2, np.nan, np.nan],
"b": [1, 2, 3, np.nan, np.nan],
"c": [np.nan, 1, 2, 3, 4],
}
)
> with pytest.raises(Exception):
E Failed: DID NOT RAISE <class 'Exception'>
pandas/tests/frame/methods/test_fillna.py:503: Failed
------------------- generated xml file: /workspaces/pandas/test-data.xml --------------------
=================================== slowest 30 durations ====================================
0.03s call pandas/tests/frame/methods/test_fillna.py::TestFillNA::test_fillna_dict_series_axis_1_value_mismatch_with_cols
0.01s teardown pandas/tests/frame/methods/test_fillna.py::TestFillNA::test_fillna_dict_series_axis_1_value_mismatch_with_cols
(1 durations < 0.005s hidden. Use -vv to show these durations.)
================================== short test summary info ==================================
FAILED pandas/tests/frame/methods/test_fillna.py::TestFillNA::test_fillna_dict_series_axis_1_value_mismatch_with_cols - Failed: DID NOT RAISE <class 'Exception'>
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
===================================== 1 failed in 0.62s =====================================
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jbrockmendel do you know of a function that achieves something equivalent to the check above (maybe can_hold_element
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yah can_hold_element seems like the right thing here (though i also like the idea of "just let it fall through"). Potential trouble is that for EAs can_hold_element always returns True.
e5c55a5
to
6abc1ff
Compare
6abc1ff
to
54f5395
Compare
Adds support for columnwise fillna.
Only supports homogeneous column and replacement value dtypes, raises ValueError otherwise. Tests added.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.