-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH introduce SKRUB_DATA_DIRECTORY
envar to control the data directory
#1215
base: main
Are you sure you want to change the base?
ENH introduce SKRUB_DATA_DIRECTORY
envar to control the data directory
#1215
Conversation
39c9f22
to
ceeb5ec
Compare
Can i instantiate a logger from stdlib? |
I see that we are using I did not recall because in scikit-learn, we using some verbose parameter + @jeromedockes can probably confirm it |
Whoops, it is actually late and I'm confusing the @jeromedockes could confirm |
we will need a small test to check that we write in the good directory as well. I think that we only need to test the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
from loguru import logger | ||
|
||
DATA_HOME_ENVAR_NAME = "SKRUB_DATA_DIRECTORY" | ||
DATA_HOME_ENVAR = environ.get(DATA_HOME_ENVAR_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to do this in the fetching function rather than upon import. for one thing it will make testing easier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I knew I'd get a comment like that ^^.
Indeed the test is a bit more complicated, because we need to reload the module to reset the value of the global variables. But my intention was to set these values only once during the import, to avoid side-effect if the user changes the envar at runtime. I wanted 2 successive and identical calls to be coherent and consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You decide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not so easy to change the environment of a running process from outside after it has started that users would be likely do it inadvertently. If they change it from the python code itself, they probably intend the change to be reflected by the behavior of the fetcher -- otherwise there's not reason to change it. either way in the vast majority of cases it will probably stay the same so let's do what is easiest for us and read it in the function :) that's also what scikit-learn does
@@ -10,6 +23,9 @@ def get_data_home(data_home=None): | |||
By default the data directory is set to a folder named 'skrub_data' in the | |||
user home folder. | |||
|
|||
You can even customize the default data directory by setting in your environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
users do not use that function directly only through the fetcher functions so they will not see this. maybe in addition we could mention the env variable in the "getting started" example:
https://github.com/skrub-data/skrub/blob/main/examples/00_getting_started.py#L17
I guess we could mention it as well in the fetchers' docstrings but not sure that's necessary, or it could be done in another PR as at the moment they don't even mention the location of the data directory
the tests failure are not related to your PR but to #1217 :/ |
Closes #1216.
Adding a new way to control the location of the data directory, using envar.
It can be useful when you want to set the data cache directory from outside the code, especially in CI.
In addition, fix the expansion of
~
when passing programmatically an explicit path toget_data_home
.