Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH introduce SKRUB_DATA_DIRECTORY envar to control the data directory #1215

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

thomass-dev
Copy link

@thomass-dev thomass-dev commented Jan 14, 2025

Closes #1216.

Adding a new way to control the location of the data directory, using envar.

It can be useful when you want to set the data cache directory from outside the code, especially in CI.
In addition, fix the expansion of ~ when passing programmatically an explicit path to get_data_home.

@thomass-dev thomass-dev force-pushed the control-skrub-data-home-using-envvar branch from 39c9f22 to ceeb5ec Compare January 14, 2025 22:30
@thomass-dev
Copy link
Author

Can i instantiate a logger from stdlib?
I've seen the use of loguru in the benchmark folder, so i reproduced, but i see that it's not a dependency.

@glemaitre
Copy link
Member

loguru will not be a dependency for the main package (the benchmark lives on its own).

I see that we are using logger:

https://github.com/probabl-ai/skore/blob/418b72ca6711ef467c95f67416f2162382c645a7/skore/src/skore/__init__.py#L21-L23

I did not recall because in scikit-learn, we using some verbose parameter + print function. But since, the logger is here, it should be used I assume.

@jeromedockes can probably confirm it

@glemaitre
Copy link
Member

Whoops, it is actually late and I'm confusing the sk*** pacakge. So we don't use logger which is more what I thought. I would expect that we use verbose + print.

@jeromedockes could confirm

@glemaitre
Copy link
Member

we will need a small test to check that we write in the good directory as well. I think that we only need to test the get_data_home for that.

Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

from loguru import logger

DATA_HOME_ENVAR_NAME = "SKRUB_DATA_DIRECTORY"
DATA_HOME_ENVAR = environ.get(DATA_HOME_ENVAR_NAME)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to do this in the fetching function rather than upon import. for one thing it will make testing easier

Copy link
Author

@thomass-dev thomass-dev Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I knew I'd get a comment like that ^^.

Indeed the test is a bit more complicated, because we need to reload the module to reset the value of the global variables. But my intention was to set these values only once during the import, to avoid side-effect if the user changes the envar at runtime. I wanted 2 successive and identical calls to be coherent and consistent.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You decide.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not so easy to change the environment of a running process from outside after it has started that users would be likely do it inadvertently. If they change it from the python code itself, they probably intend the change to be reflected by the behavior of the fetcher -- otherwise there's not reason to change it. either way in the vast majority of cases it will probably stay the same so let's do what is easiest for us and read it in the function :) that's also what scikit-learn does

skrub/datasets/_utils.py Show resolved Hide resolved
skrub/datasets/_utils.py Outdated Show resolved Hide resolved
skrub/datasets/_utils.py Outdated Show resolved Hide resolved
@@ -10,6 +23,9 @@ def get_data_home(data_home=None):
By default the data directory is set to a folder named 'skrub_data' in the
user home folder.

You can even customize the default data directory by setting in your environment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

users do not use that function directly only through the fetcher functions so they will not see this. maybe in addition we could mention the env variable in the "getting started" example:

https://github.com/skrub-data/skrub/blob/main/examples/00_getting_started.py#L17

I guess we could mention it as well in the fetchers' docstrings but not sure that's necessary, or it could be done in another PR as at the moment they don't even mention the location of the data directory

@jeromedockes
Copy link
Member

the tests failure are not related to your PR but to #1217 :/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH introduce SKRUB_DATA_DIRECTORY envar to control the data directory
3 participants