Skip to content

GroupBy.idxmin and idxmax are very broken #584

Open
@phofl

Description

@phofl

The implementation is broken as soon as the max or min value is not in the first partition

This was broken for years, so only adding for API compatibility now but we should fix this

def test_df_groupby_idxmax():
    pdf = pd.DataFrame(
        {"idx": list(range(4)), "group": [1, 1, 2, 2, 1], "value": [10, 20, 20, 10, 40]}
    ).set_index("idx")

    ddf = dd.from_pandas(pdf, npartitions=3)

    expected = pd.DataFrame({"group": [1, 2], "value": [4, 2]}).set_index("group")

    result_pd = pdf.groupby("group").idxmax()
    result_dd = ddf.groupby("group").idxmax()

    assert_eq(result_pd, result_dd)
    assert_eq(expected, result_dd)

This is a very simple reproducer for the underlying issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions