Skip to content

Conversation

a18rhodes
Copy link

Adds support for columnwise fillna.

Only supports homogeneous column and replacement value dtypes, raises ValueError otherwise. Tests added.

@a18rhodes
Copy link
Author

This is my first PR on an open source project, so I'm still a little unfamiliar with the work flow.

I may have merged in too many changes by mistake; I intended to only update generic.py, the relevant tests, and the what's new doc. If I need to update anything to reduce the scope I'm happy to make whatever changes are needed.

@WillAyd
Copy link
Member

WillAyd commented Sep 22, 2025

Hey @a18rhodes no worries - we can fix this up.

Looking at the commit history, I think you accidentally picked up commits from other users, which is causing the extra files to show as modified. Here are the errant commit hashes:

e817930
2b25842
08d21d7

Essentially you will want to remove those from your branch. This link shows two ways of achieving that:

https://www.abrahamberg.com/blog/git-remove-commits-from-branch-after-push-reset-revert-or-rebase/

I'll leave it up to you to try either approach.

Once you have those commits fixed, you will want to pull the changes from the main branch of pandas, which you can do with:

git remote add upstream [email protected]:pandas-dev/pandas.git  # only need to do once, can also use https
git pull upstream main

while on your working branch.

In terms of resolving any conflicts, just try to play close attention to what git is telling you. It can definitely be intimidating, but usually a conflict needs your input. git will tell you "hey you made changes X,Y,Z while another user made different changes to that same file." There is never a single answer on how to resolve that, so just give it your best shot and if you have questions you can highlight them here

@a18rhodes
Copy link
Author

Thanks @WillAyd! I think I resolved the issue, it looks like I have the correct 3 files now and have dropped the other user's commits.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work overall

@a18rhodes a18rhodes force-pushed the fillna-colwise-4515 branch 2 times, most recently from 94ce8be to f3d404f Compare September 24, 2025 05:03
@a18rhodes
Copy link
Author

I did some benchmarking to find the best approach. As it turns out, frames with nrows >> ncols perform better with apply, otherwise, the Transpose approach is better (@kdebrab pointed out the performance issues of the Transpose approach in the issue with the apply approach as the solution).

import pandas as pd
import numpy as np

def approach_transpose(df_to_fill, value):
    """vectorized approach using transpose."""
    return df_to_fill.T.fillna(value=value).T

def approach_apply(df_to_fill, value):
    """DataFrame.apply() on columns (axis=0)."""
    return df_to_fill.apply(lambda col: col.fillna(value=value))

for num_rows, num_cols in ((10, 10_000), (1_000, 1_000), (10_000, 10)):
    print(f"Setting up sample data for {num_rows:_}x{num_cols:_}...")
    data = np.random.rand(num_rows, num_cols)
    data[data < 0.25] = np.nan  # Make 25% of the data NaN
    df = pd.DataFrame(data)
    update_data = {i: float(i) for i in range(num_rows)}  # just use the row number for fill
    print(f"\tCreated a {num_rows:_}x{num_cols:_} DataFrame and a fill dictionary.")
    print(f"\t----Testing the .T (transpose) approach on {num_rows:_}x{num_cols:_}----", end="\n\t")
    %timeit c = approach_transpose(df, update_data)
    print(f"\t----Testing the df.apply() approach on {num_rows:_}x{num_cols:_}----", end="\n\t")
    %timeit c = approach_apply(df, update_data)
    print()
Setting up sample data for 10x10_000...
        Created a 10x10_000 DataFrame and a fill dictionary.
        ----Testing the .T (transpose) approach on 10x10_000----
        2.23 ms ± 226 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
        ----Testing the df.apply() approach on 10x10_000----
        1.8 s ± 44.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Setting up sample data for 1_000x1_000...
        Created a 1_000x1_000 DataFrame and a fill dictionary.
        ----Testing the .T (transpose) approach on 1_000x1_000----
        147 ms ± 6.15 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
        ----Testing the df.apply() approach on 1_000x1_000----
        296 ms ± 14.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Setting up sample data for 10_000x10...
        Created a 10_000x10 DataFrame and a fill dictionary.
        ----Testing the .T (transpose) approach on 10_000x10----
        1.88 s ± 78.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
        ----Testing the df.apply() approach on 10_000x10----
        13.5 ms ± 273 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

f"(value.dtype={value_dtype} vs {dtypes[0]})"
)
nrows, ncols = self.shape
if nrows > 1000 * ncols:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate what you are trying to do here but this heuristic is pretty arbitrary; its better to just keep with the transpose method (at smaller data volumes, the difference is likely negligible anyway)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I posted this before seeing your benchmarks. Even still, I don't think its that important to optimize for extremely wide dataframes like that. We can leave that to another PR if we even cared for it

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, as you mentioned, the heuristic is pretty arbitrarily based on my machine performance, etc.

I'll update using just the transpose method.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally this lgtm @mroeschke any feedback?

df.fillna(df.max(axis=1), axis=1, inplace=True)
expected = DataFrame(
{
"a": [1.0, 1, 2, 3, 4],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultra nit but can you represent these all as float values? I had to double take why only the first row of data would be floats, but in actuality all of the data is float and this just relies on implicit conversion.

I think it would be less confusing to just label all values as 1.0, 2.0, 3.0, 3.0, 4.0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, not sure why I got lazy on this of all things. I will update it.

Comment on lines 7124 to 7128
dtypes = [result[col].dtype for col in result.columns]
if len(set(dtypes)) > 1:
raise ValueError(
"All columns must have the same dtype, but got dtypes: "
f"{dict(zip(result.columns, dtypes))}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dtypes = [result[col].dtype for col in result.columns]
if len(set(dtypes)) > 1:
raise ValueError(
"All columns must have the same dtype, but got dtypes: "
f"{dict(zip(result.columns, dtypes))}"
unique_dtypes = np.unique(self._mgr.get_dtypes())
if len(unique_dtypes) > 1:
raise ValueError(
"All columns must have the same dtype, but got dtypes: "
f"{list(unique_dtypes)}"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, not sure how I missed this suggestion. will implement in my next commit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On further inspection, the np.unique approach fails for non-homogeneous column dtypes due to"<" comparsion between mixed types (see test "test_fillna_dict_series_axis_1_mismatch_cols").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.unique approach fails

Pandas has its own unique function, check if it works.

see test "test_fillna_dict_series_axis_1_mismatch_cols"

When you amend and force push, makes it hard to see previous CI runs results.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give that a shot.

Yeah, that's a good point, I've been trying to keep a clean commit history since I have had so many, but it probably adds to the confusion in some cases. That said, I never committed the failing code, just pointing out a new unit test I added that fails when trying np.unique.

Comment on lines 7130 to 7134
if (value_dtype := np.asarray(value).dtype) != dtypes[0]:
raise ValueError(
"Dtype mismatch for value "
f"(value.dtype={value_dtype} vs {dtypes[0]})"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try removing this case and seeing if a similar validation still hits? I'm hopping fillna after the transpose enforces this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked in to this. Dropping this check results in no fill happening at all, rather than an exception of any type being raised.
(Note these results are without the check in question)

    def test_fillna_dict_series_axis_1_value_mismatch_with_cols(self):
        df = DataFrame(
            {
                "a": [np.nan, 1, 2, np.nan, np.nan],
                "b": [1, 2, 3, np.nan, np.nan],
                "c": [np.nan, 1, 2, 3, 4],
            }
        )
        with pytest.raises(Exception):
            print()
            print(df.fillna(Series({"a": "abc", "b": "def", "c": "hij"}), axis=1))
pytest ./pandas/tests/frame/methods/test_fillna.py --ff -x
+ /root/virtualenvs/pandas-dev/bin/ninja
[1/1] Generating write_version_file with a custom command
==================================== test session starts ====================================
platform linux -- Python 3.11.13, pytest-8.4.2, pluggy-1.6.0
PyQt5 5.15.11 -- Qt runtime 5.15.17 -- Qt compiled 5.15.14
rootdir: /workspaces/pandas
configfile: pyproject.toml
plugins: hypothesis-6.139.2, localserver-0.9.0.post0, anyio-4.10.0, qt-4.5.0, cov-7.0.0, xdist-3.8.0, cython-0.3.1
collected 65 items                                                                          
run-last-failure: rerun previous 1 failure first

pandas/tests/frame/methods/test_fillna.py 
     a    b    c
0  NaN  1.0  NaN
1  1.0  2.0  1.0
2  2.0  3.0  2.0
3  NaN  NaN  3.0
4  NaN  NaN  4.0
F

========================================= FAILURES ==========================================
____________ TestFillNA.test_fillna_dict_series_axis_1_value_mismatch_with_cols _____________

self = <pandas.tests.frame.methods.test_fillna.TestFillNA object at 0x70da9893a790>

    def test_fillna_dict_series_axis_1_value_mismatch_with_cols(self):
        df = DataFrame(
            {
                "a": [np.nan, 1, 2, np.nan, np.nan],
                "b": [1, 2, 3, np.nan, np.nan],
                "c": [np.nan, 1, 2, 3, 4],
            }
        )
>       with pytest.raises(Exception):
E       Failed: DID NOT RAISE <class 'Exception'>

pandas/tests/frame/methods/test_fillna.py:503: Failed
------------------- generated xml file: /workspaces/pandas/test-data.xml --------------------
=================================== slowest 30 durations ====================================
0.03s call     pandas/tests/frame/methods/test_fillna.py::TestFillNA::test_fillna_dict_series_axis_1_value_mismatch_with_cols
0.01s teardown pandas/tests/frame/methods/test_fillna.py::TestFillNA::test_fillna_dict_series_axis_1_value_mismatch_with_cols

(1 durations < 0.005s hidden.  Use -vv to show these durations.)
================================== short test summary info ==================================
FAILED pandas/tests/frame/methods/test_fillna.py::TestFillNA::test_fillna_dict_series_axis_1_value_mismatch_with_cols - Failed: DID NOT RAISE <class 'Exception'>
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
===================================== 1 failed in 0.62s =====================================

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jbrockmendel do you know of a function that achieves something equivalent to the check above (maybe can_hold_element?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yah can_hold_element seems like the right thing here (though i also like the idea of "just let it fall through"). Potential trouble is that for EAs can_hold_element always returns True.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

column-wise fillna with Series/dict NotImplemented
5 participants