ENH introduce `SKRUB_DATA_DIRECTORY` envar to control the data directory #1215

thomass-dev · 2025-01-14T22:27:04Z

Closes #1216.

Adding a new way to control the location of the data directory, using envar.

It can be useful when you want to set the data cache directory from outside the code, especially in CI.
In addition, fix the expansion of ~ when passing programmatically an explicit path to get_data_home.

…directory

thomass-dev · 2025-01-14T22:33:46Z

Can i instantiate a logger from stdlib?
I've seen the use of loguru in the benchmark folder, so i reproduced, but i see that it's not a dependency.

glemaitre · 2025-01-14T22:50:24Z

loguru will not be a dependency for the main package (the benchmark lives on its own).

I see that we are using logger:

https://github.com/probabl-ai/skore/blob/418b72ca6711ef467c95f67416f2162382c645a7/skore/src/skore/__init__.py#L21-L23

I did not recall because in scikit-learn, we using some verbose parameter + print function. But since, the logger is here, it should be used I assume.

@jeromedockes can probably confirm it

glemaitre · 2025-01-14T22:55:33Z

Whoops, it is actually late and I'm confusing the sk*** pacakge. So we don't use logger which is more what I thought. I would expect that we use verbose + print.

@jeromedockes could confirm

glemaitre · 2025-01-14T23:00:57Z

we will need a small test to check that we write in the good directory as well. I think that we only need to test the get_data_home for that.

jeromedockes

thanks!

jeromedockes · 2025-01-15T09:32:24Z

skrub/datasets/_utils.py

+from loguru import logger
+
+DATA_HOME_ENVAR_NAME = "SKRUB_DATA_DIRECTORY"
+DATA_HOME_ENVAR = environ.get(DATA_HOME_ENVAR_NAME)


I would prefer to do this in the fetching function rather than upon import. for one thing it will make testing easier

I knew I'd get a comment like that ^^.

Indeed the test is a bit more complicated, because we need to reload the module to reset the value of the global variables. But my intention was to set these values only once during the import, to avoid side-effect if the user changes the envar at runtime. I wanted 2 successive and identical calls to be coherent and consistent.

You decide.

It's not so easy to change the environment of a running process from outside after it has started that users would be likely do it inadvertently. If they change it from the python code itself, they probably intend the change to be reflected by the behavior of the fetcher -- otherwise there's not reason to change it. either way in the vast majority of cases it will probably stay the same so let's do what is easiest for us and read it in the function :) that's also what scikit-learn does

skrub/datasets/_utils.py

jeromedockes · 2025-01-15T16:12:45Z

skrub/datasets/_utils.py

@@ -10,6 +23,9 @@ def get_data_home(data_home=None):
    By default the data directory is set to a folder named 'skrub_data' in the
    user home folder.

+    You can even customize the default data directory by setting in your environment


users do not use that function directly only through the fetcher functions so they will not see this. maybe in addition we could mention the env variable in the "getting started" example:

https://github.com/skrub-data/skrub/blob/main/examples/00_getting_started.py#L17

I guess we could mention it as well in the fetchers' docstrings but not sure that's necessary, or it could be done in another PR as at the moment they don't even mention the location of the data directory

jeromedockes · 2025-01-15T16:24:45Z

the tests failure are not related to your PR but to #1217 :/

ENH introduce SKRUB_DATA_DIRECTORY envar to control the skrub data …

ceeb5ec

…directory

thomass-dev force-pushed the control-skrub-data-home-using-envvar branch from 39c9f22 to ceeb5ec Compare January 14, 2025 22:30

Add tests

6f9cafd

jeromedockes reviewed Jan 15, 2025

View reviewed changes

thomass-dev added 3 commits January 15, 2025 15:23

Use print instead of logger

0380dc8

Use Path.home() instead of Path("~")

f4b564c

Fix linter

20bbd5d

jeromedockes reviewed Jan 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH introduce `SKRUB_DATA_DIRECTORY` envar to control the data directory #1215

ENH introduce `SKRUB_DATA_DIRECTORY` envar to control the data directory #1215

thomass-dev commented Jan 14, 2025 •

edited

Loading

thomass-dev commented Jan 14, 2025

glemaitre commented Jan 14, 2025

glemaitre commented Jan 14, 2025

glemaitre commented Jan 14, 2025

jeromedockes left a comment

jeromedockes Jan 15, 2025

thomass-dev Jan 15, 2025 •

edited

Loading

thomass-dev Jan 15, 2025

jeromedockes Jan 15, 2025

jeromedockes Jan 15, 2025

jeromedockes commented Jan 15, 2025

ENH introduce SKRUB_DATA_DIRECTORY envar to control the data directory #1215

Are you sure you want to change the base?

ENH introduce SKRUB_DATA_DIRECTORY envar to control the data directory #1215

Conversation

thomass-dev commented Jan 14, 2025 • edited Loading

thomass-dev commented Jan 14, 2025

glemaitre commented Jan 14, 2025

glemaitre commented Jan 14, 2025

glemaitre commented Jan 14, 2025

jeromedockes left a comment

Choose a reason for hiding this comment

jeromedockes Jan 15, 2025

Choose a reason for hiding this comment

thomass-dev Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

thomass-dev Jan 15, 2025

Choose a reason for hiding this comment

jeromedockes Jan 15, 2025

Choose a reason for hiding this comment

jeromedockes Jan 15, 2025

Choose a reason for hiding this comment

jeromedockes commented Jan 15, 2025

ENH introduce `SKRUB_DATA_DIRECTORY` envar to control the data directory #1215

ENH introduce `SKRUB_DATA_DIRECTORY` envar to control the data directory #1215

thomass-dev commented Jan 14, 2025 •

edited

Loading

thomass-dev Jan 15, 2025 •

edited

Loading