shuffle as boolean while converting dataset to dataclass #1213

mrastgoo · 2025-01-12T20:19:35Z

adding shuffling as boolean in _fetch_dataset_as_dataclass and setting it to True in fetch_employee_salaries

Vincent-Maladiere

Hey @mrastgoo, thank you for this PR!

Your tests fail because of network issues with OpenML. This is unrelated to your PR, so I reran them.

Vincent-Maladiere · 2025-01-13T08:42:31Z

skrub/datasets/_fetching.py

@@ -702,6 +703,8 @@ def _fetch_dataset_as_dataclass(
            df = pd.read_parquet(info["path"])
        else:
            df = pd.read_csv(info["path"], **read_csv_kwargs)
+        if shuffling:
+            df = shuffle(df, random_state=42).reset_index(drop=True)


I think the random state should be a function parameter, as it would allow for control for multiple seeds during testing and more fine-grained control on reproducibility.

I'm not 100% sure we should reset indices, because it could be surprising for users. WDYT?

jeromedockes · 2025-01-13T12:08:04Z

Thanks @mrastgoo! I'm not sure we need to add a parameter to the fetcher, at least it is not required for addressing #1178 . we could shuffle it in the the example code (which by default is folded for the tabular_learner example, and which is not shown for the tablereport) rather than in the fetcher itself. WDYT?

GaelVaroquaux · 2025-01-15T17:58:59Z

I'm not sure we need to add a parameter to the fetcher, at least it is not required for addressing [1]#1178 . we could shuffle it in the the example code (which by default is folded for the tabular_learner example, and which is not shown for the tablereport) rather than in the fetcher itself

I'd rather keep the code examples as simple as possible (every character counts), and shuffle in the fetcher, by default, in a reproducible way. And for this, an easy way of doing things would be to cut and reorder the dataset at a fix index, like "cutting" a deck of cards.

mrastgoo · 2025-01-15T21:29:50Z

Hey @mrastgoo, thank you for this PR!

Your tests fail because of network issues with OpenML. This is unrelated to your PR, so I reran them.

Thanks @mrastgoo! I'm not sure we need to add a parameter to the fetcher, at least it is not required for addressing #1178 . we could shuffle it in the the example code (which by default is folded for the tabular_learner example, and which is not shown for the tablereport) rather than in the fetcher itself. WDYT?

I can see why you prefer that, in this way the fetcher will return the original data as it is, with the same index. However it will make the example longer.

mrastgoo · 2025-01-15T21:31:41Z

I'm not sure we need to add a parameter to the fetcher, at least it is not required for addressing [1]#1178 . we could shuffle it in the the example code (which by default is folded for the tabular_learner example, and which is not shown for the tablereport) rather than in the fetcher itself
I'd rather keep the code examples as simple as possible (every character counts), and shuffle in the fetcher, by default, in a reproducible way. And for this, an easy way of doing things would be to cut and reorder the dataset at a fix index, like "cutting" a deck of cards.

I am not sure to understand the purpose of "cutting" @GaelVaroquaux, how many cuts do we do ? Do we want to shuffle as well ? and should we keep the orginal index or reorder them ?

GaelVaroquaux · 2025-01-15T21:37:27Z

I am not sure to understand the purpose of "cutting" @GaelVaroquaux,

The whole goal of the modification is to have the first few lines not as nasty.

how many cuts do we do ?

Let's try one.

Do we want to shuffle as well ?

No, that way it's stable and reproducible (random number generators are not always reproducible across hardware)

and should we keep the orginal index or reorder them ?

Not reorder, but reset to avoid something looking strange

jeromedockes · 2025-01-16T08:56:27Z

and IIUC the reordering is not optional and no new parameter is exposed to the user right?

GaelVaroquaux · 2025-01-16T09:00:12Z

and IIUC the reordering is not optional and no new parameter is exposed to the user right?

As you wish

jeromedockes · 2025-01-16T09:11:05Z

> and IIUC the reordering is not optional and no new parameter is exposed to the user right? As you wish

Ok in that case I'd rather not add a parameter, just do the transformation every time

Vincent-Maladiere · 2025-01-16T16:56:30Z

Then what about editing the dataset, putting the reordered version on Figshare and fetching that directly?

GaelVaroquaux · 2025-01-16T17:20:29Z

Then what about editing the dataset, putting the reordered version on Figshare and fetching that directly?

I always like having a form of traceability of the data, so I think that I would prefer not changing the upstream data

shuffle as boolean while converting dataset to dataclass

9be7245

Vincent-Maladiere reviewed Jan 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

shuffle as boolean while converting dataset to dataclass #1213

shuffle as boolean while converting dataset to dataclass #1213

mrastgoo commented Jan 12, 2025

Vincent-Maladiere left a comment •

edited

Loading

Vincent-Maladiere Jan 13, 2025

jeromedockes commented Jan 13, 2025 •

edited

Loading

GaelVaroquaux commented Jan 15, 2025 via email

mrastgoo commented Jan 15, 2025

mrastgoo commented Jan 15, 2025

GaelVaroquaux commented Jan 15, 2025 via email

jeromedockes commented Jan 16, 2025

GaelVaroquaux commented Jan 16, 2025 via email

jeromedockes commented Jan 16, 2025 via email

Vincent-Maladiere commented Jan 16, 2025

GaelVaroquaux commented Jan 16, 2025 via email

shuffle as boolean while converting dataset to dataclass #1213

Are you sure you want to change the base?

shuffle as boolean while converting dataset to dataclass #1213

Conversation

mrastgoo commented Jan 12, 2025

Vincent-Maladiere left a comment • edited Loading

Choose a reason for hiding this comment

Vincent-Maladiere Jan 13, 2025

Choose a reason for hiding this comment

jeromedockes commented Jan 13, 2025 • edited Loading

GaelVaroquaux commented Jan 15, 2025 via email

mrastgoo commented Jan 15, 2025

mrastgoo commented Jan 15, 2025

GaelVaroquaux commented Jan 15, 2025 via email

jeromedockes commented Jan 16, 2025

GaelVaroquaux commented Jan 16, 2025 via email

jeromedockes commented Jan 16, 2025 via email

Vincent-Maladiere commented Jan 16, 2025

GaelVaroquaux commented Jan 16, 2025 via email

Vincent-Maladiere left a comment •

edited

Loading

jeromedockes commented Jan 13, 2025 •

edited

Loading