Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shuffle as boolean while converting dataset to dataclass #1213

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mrastgoo
Copy link
Contributor

closes #1178

adding shuffling as boolean in _fetch_dataset_as_dataclass and setting it to True in fetch_employee_salaries

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mrastgoo, thank you for this PR!

Your tests fail because of network issues with OpenML. This is unrelated to your PR, so I reran them.

@@ -702,6 +703,8 @@ def _fetch_dataset_as_dataclass(
df = pd.read_parquet(info["path"])
else:
df = pd.read_csv(info["path"], **read_csv_kwargs)
if shuffling:
df = shuffle(df, random_state=42).reset_index(drop=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the random state should be a function parameter, as it would allow for control for multiple seeds during testing and more fine-grained control on reproducibility.

I'm not 100% sure we should reset indices, because it could be surprising for users. WDYT?

@jeromedockes
Copy link
Member

jeromedockes commented Jan 13, 2025

Thanks @mrastgoo! I'm not sure we need to add a parameter to the fetcher, at least it is not required for addressing #1178 . we could shuffle it in the the example code (which by default is folded for the tabular_learner example, and which is not shown for the tablereport) rather than in the fetcher itself. WDYT?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jan 15, 2025 via email

@mrastgoo
Copy link
Contributor Author

Hey @mrastgoo, thank you for this PR!

Your tests fail because of network issues with OpenML. This is unrelated to your PR, so I reran them.

Thanks @mrastgoo! I'm not sure we need to add a parameter to the fetcher, at least it is not required for addressing #1178 . we could shuffle it in the the example code (which by default is folded for the tabular_learner example, and which is not shown for the tablereport) rather than in the fetcher itself. WDYT?

I can see why you prefer that, in this way the fetcher will return the original data as it is, with the same index. However it will make the example longer.

@mrastgoo
Copy link
Contributor Author

I'm not sure we need to add a parameter to the fetcher, at least it is not required for addressing [1]#1178 . we could shuffle it in the the example code (which by default is folded for the tabular_learner example, and which is not shown for the tablereport) rather than in the fetcher itself
I'd rather keep the code examples as simple as possible (every character counts), and shuffle in the fetcher, by default, in a reproducible way. And for this, an easy way of doing things would be to cut and reorder the dataset at a fix index, like "cutting" a deck of cards.

I am not sure to understand the purpose of "cutting" @GaelVaroquaux, how many cuts do we do ? Do we want to shuffle as well ? and should we keep the orginal index or reorder them ?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jan 15, 2025 via email

@jeromedockes
Copy link
Member

and IIUC the reordering is not optional and no new parameter is exposed to the user right?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jan 16, 2025 via email

@jeromedockes
Copy link
Member

jeromedockes commented Jan 16, 2025 via email

@Vincent-Maladiere
Copy link
Member

Then what about editing the dataset, putting the reordered version on Figshare and fetching that directly?

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jan 16, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Example TableReport in the home page
4 participants