-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build: make NumPy, pandas, and Arrow deps optional #152
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #152 +/- ##
==========================================
+ Coverage 84.94% 85.24% +0.29%
==========================================
Files 26 27 +1
Lines 1920 1938 +18
==========================================
+ Hits 1631 1652 +21
+ Misses 289 286 -3 ☔ View full report in Codecov by Sentry. |
We should not force the user to install the DuckDB backend, when IbisML can work without it. Option 1 is a better answer, but it still forces the user to install NumPy and pandas; do they need these dependencies unless they are using pandas or NumPy inputs or scikit-learn? The correct answer is probably to move the imports into the scope where they're used (see how it's done for something like Polars). |
local import numpy pandas pyarrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor change request. Can you also add a test case to make sure things work without the pandas/NumPy dependency?
@@ -181,8 +190,11 @@ def _(X, y=None, maintain_order=False): | |||
return ibis.memtable(table), targets, index | |||
|
|||
|
|||
@normalize_table.register(pa.Table) | |||
@normalize_table.register("pa.Table") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes for PyArrow are not necessary, as it is a required dependency. Can you undo them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not find it is required dependency in Ibis and IbisML
https://github.com/ibis-project/ibis/blob/main/pyproject.toml#L49 was marked as optional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, makes sense; I forgot about ibis-project/ibis#9552.
Guess this is fine then, to avoid having to maintain PyArrow bounds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bad idea to not explicitly include dependencies that you are explicitly importing. You shouldn't really ever depend on another project to have specific dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bad idea to not explicitly include dependencies that you are explicitly importing. You shouldn't really ever depend on another project to have specific dependencies.
Do you mean we should explicitly include the dependency in IbisML. It is convenient users successfully imported IbisML, but when they use it, they'll get ModuleNotFound error, have to install it.
@deepyaman do you have any comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't explicitly require pandas/NumPy; we don't need to include those dependencies, unless a user is using a backend/ML framework that requires it. Those are specified in extras.
For PyArrow... trying to think, can you reasonably use IbisML without PyArrow? I took some looks at the linked PR in Ibis, and it seems there was some use case from people using Ibis internally in their product that could not need the PyArrow dependency. Is that also possible with IbisML? I wasn't completely sure...
In this case, I'm personally OK with leaving it as is, since IbisML isn't just getting a transitive dependency from Ibis—it completely relies on Ibis under the hood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To unblock the release, let's keep it as it is now. We could change this later if we have new findings.
should we be raising something if the relevant libraries aren't installed? or is the import error a user would get fine? idk what the standard practice would be |
Import error is fine IMO for now. I'm not sure how far we're getting realistically without a backend. |
add test_import IbisML
6858224
to
9e3e466
Compare
Added tests:
|
tests/test_init.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you do this in a separate file instead of, say, test_core.py
? Is there a practical reason?
If it's in a separate file, I can't see how if "ibis_ml" in sys.modules
can even be True
entering the tests. Did this not work properly in test_core.py
, or you didn't try?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not try puting it in the test_core, the reason I put it in the new file becuase 1) some other files, such as _discretize.py also dependes on numpy 2) It is testing how the dependency impact the IbisML loading. I named it test_dependenct.py
before, I am not sure if it is a good name.
For Ibis default installation, it removed numpy and pandas as required dependencies: ibis-project/ibis#9564
I did not have experience with package dependency managment, I have two options
I have no idea which one is better, or there is other better ones,
I took option 2, please take a look @lostmygithubaccount @deepyaman
If it is not urgent, we could wait for Deepyaman back.
----update---
Final decision is to make numpy, pandas, pyarrow imported in the function scope
resolve #151