Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH introduce SKRUB_DATA_DIRECTORY envar to control the data directory #1216

Open
thomass-dev opened this issue Jan 14, 2025 · 10 comments · May be fixed by #1215
Open

ENH introduce SKRUB_DATA_DIRECTORY envar to control the data directory #1216

thomass-dev opened this issue Jan 14, 2025 · 10 comments · May be fixed by #1215

Comments

@thomass-dev
Copy link

It should be awesome to control the data cache directory from another place than the python code itself.
Especially in a CI context, where we should ensure that the data is at the right place.

I propose to use envar to add another way to define the data directory.

@Vincent-Maladiere
Copy link
Member

Hey @thomass-dev, thank you for the suggestion. Although this is not a big change, I'm not 100% sure we need this, as we use tempfile.TemporaryDirectory() in our own CI. Could you motivate this a little bit? Is this something you need for a project, e.g., skore? I don't have a strong opinion on this. WDYT @jeromedockes?

@jeromedockes
Copy link
Member

I think that's definitely useful even for individual users (actually I would have guessed we already had that). Some users will want to choose the location of the data directory, eg to put it in a location shared with other users, on some different storage, somewhere that is not backed up, or simply to avoid cluttering their home directory. as this is a preference of the user / local machine, allowing to control it from the python code is not a solution, it should be a configuration file or env variable ( I prefer the env variable )

@Vincent-Maladiere
Copy link
Member

as this is a preference of the user / local machine, allowing to control it from the python code is not a solution

I'm not sure I understand this bit, what do you mean?

@jeromedockes
Copy link
Member

I mean the python code may be shared with other developers, whereas the directory is a user-specific preference. for example if I want to run the skrub examples but have them store the data in ~/whatever/datasets/skrub/ I cannot hardcode that location in the python scripts, I want to specify it through some other channel

@Vincent-Maladiere
Copy link
Member

Ok, I understand your point. Just for the sake of argument, couldn't the shared python code also accept a directory parameter from the user and route that to skrub?

@jeromedockes
Copy link
Member

yes you are right -- basically the options for passing that info to a program could be a command-line argument, an env variable, or a config file. as a user I want to set it once and forget about it, not pass it every time I invoke a script that uses skrub, and as a developer I don't want to add boiler plate to all my scripts to expose that argument (for example the skrub examples don't have it, and if they did sphinx wouldn't pass it when building the doc).

@Vincent-Maladiere
Copy link
Member

Ok, that makes sense, thanks for detailing your thoughts.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jan 15, 2025 via email

@thomass-dev
Copy link
Author

thomass-dev commented Jan 15, 2025

I want to run the skrub examples but have them store the data in ~/whatever/datasets/skrub/ I cannot hardcode that location in the python scripts, I want to specify it through some other channel

This is exactly the purpose of this issue. Thanks @jeromedockes to have deep-dived in my mind 😄 .

To be explicit, my use-case is:

In skore we built example scripts using skrub that are intend to be shared to users. I want a fine control over data location, especially in the CI to make cache between pipelines (and on my personal computer), without hard-coding the location directly in the example scripts.

@Vincent-Maladiere
Copy link
Member

Nice, could you share the link to this dev?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants