Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: make NumPy, pandas, and Arrow deps optional #152

Merged
merged 11 commits into from
Sep 13, 2024
3 changes: 2 additions & 1 deletion ibis_ml/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,11 +44,12 @@ def _auto_patch_skorch() -> None:
return

import ibis.expr.types as ir
import numpy as np

old_fit = skorch.net.NeuralNet.fit

def fit(self, X, y=None, **fit_params):
import numpy as np

if isinstance(y, ir.Column):
y = np.asarray(y)

Expand Down
32 changes: 26 additions & 6 deletions ibis_ml/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,14 @@
import ibis
import ibis.expr.operations as ops
import ibis.expr.types as ir
import numpy as np
import pandas as pd
import pyarrow as pa
from ibis.common.dispatch import lazy_singledispatch

if TYPE_CHECKING:
import dask.dataframe as dd
import numpy as np
import pandas as pd
import polars as pl
import pyarrow as pa
import xgboost as xgb
from sklearn.utils._estimator_html_repr import _VisualBlock

Expand All @@ -45,6 +45,9 @@ def _ibis_table_to_numpy(table: ir.Table) -> np.ndarray:

def _y_as_dataframe(y: Any) -> pd.DataFrame:
"""Coerce `y` to a pandas dataframe"""
import numpy as np
import pandas as pd

if isinstance(y, pd.DataFrame):
return y
elif isinstance(y, pd.Series):
Expand Down Expand Up @@ -144,8 +147,11 @@ def _(X, y=None, maintain_order=False):
return table, tuple(y.columns), None


@normalize_table.register(pd.DataFrame)
@normalize_table.register("pd.DataFrame")
def _(X, y=None, maintain_order=False):
import numpy as np
import pandas as pd

if y is not None:
y = _y_as_dataframe(y)
table = pd.concat([X, y], axis=1)
Expand All @@ -162,8 +168,11 @@ def _(X, y=None, maintain_order=False):
return ibis.memtable(table), targets, index


@normalize_table.register(np.ndarray)
@normalize_table.register("np.ndarray")
def _(X, y=None, maintain_order=False):
import numpy as np
import pandas as pd

X = pd.DataFrame(X, columns=[f"x{i}" for i in range(X.shape[-1])])
if y is not None:
y = _y_as_dataframe(y)
Expand All @@ -181,8 +190,11 @@ def _(X, y=None, maintain_order=False):
return ibis.memtable(table), targets, index


@normalize_table.register(pa.Table)
@normalize_table.register("pa.Table")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes for PyArrow are not necessary, as it is a required dependency. Can you undo them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not find it is required dependency in Ibis and IbisML

https://github.com/ibis-project/ibis/blob/main/pyproject.toml#L49 was marked as optional.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, makes sense; I forgot about ibis-project/ibis#9552.

Guess this is fine then, to avoid having to maintain PyArrow bounds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bad idea to not explicitly include dependencies that you are explicitly importing. You shouldn't really ever depend on another project to have specific dependencies.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bad idea to not explicitly include dependencies that you are explicitly importing. You shouldn't really ever depend on another project to have specific dependencies.

Do you mean we should explicitly include the dependency in IbisML. It is convenient users successfully imported IbisML, but when they use it, they'll get ModuleNotFound error, have to install it.

@deepyaman do you have any comment?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't explicitly require pandas/NumPy; we don't need to include those dependencies, unless a user is using a backend/ML framework that requires it. Those are specified in extras.

For PyArrow... trying to think, can you reasonably use IbisML without PyArrow? I took some looks at the linked PR in Ibis, and it seems there was some use case from people using Ibis internally in their product that could not need the PyArrow dependency. Is that also possible with IbisML? I wasn't completely sure...

In this case, I'm personally OK with leaving it as is, since IbisML isn't just getting a transitive dependency from Ibis—it completely relies on Ibis under the hood.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To unblock the release, let's keep it as it is now. We could change this later if we have new findings.

def _(X, y=None, maintain_order=False):
import numpy as np
import pyarrow as pa

if y is not None:
if isinstance(y, (pa.ChunkedArray, pa.Array)):
y = pa.Table.from_pydict({"y": y})
Expand Down Expand Up @@ -246,6 +258,8 @@ def get_categories(self, column: str) -> pa.Array | None:
return self.categories.get(column)

def set_categories(self, column: str, values: pa.Array | list[Any]) -> None:
import pyarrow as pa

self.categories[column] = pa.array(values)

def drop_categories(self, column: str) -> None:
Expand All @@ -255,6 +269,8 @@ def drop_categories(self, column: str) -> None:
def _categorize_wrap_reader(
reader: pa.RecordBatchReader, categories: dict[str, pa.Array]
) -> Iterable[pa.RecordBatch]:
import pyarrow as pa

for batch in reader:
out = {}
for name, col in zip(batch.schema.names, batch.columns):
Expand Down Expand Up @@ -620,6 +636,8 @@ def _categorize_pandas(self, df: pd.DataFrame) -> pd.DataFrame:
return df

def _categorize_pyarrow(self, table: pa.Table) -> pa.Table:
import pyarrow as pa

if not self.metadata_.categories:
return table

Expand All @@ -645,6 +663,8 @@ def _categorize_dask_dataframe(self, ddf: dd.DataFrame) -> dd.DataFrame:
def _categorize_pyarrow_batches(
self, reader: pa.RecordBatchReader
) -> pa.RecordBatchReader:
import pyarrow as pa

if not self.metadata_.categories:
return reader

Expand Down
5 changes: 4 additions & 1 deletion ibis_ml/steps/_discretize.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@

import ibis
import ibis.expr.types as ir
import numpy as np

from ibis_ml.core import Metadata, Step
from ibis_ml.select import SelectionType, selector
Expand Down Expand Up @@ -94,6 +93,8 @@ def fit_table(self, table: ir.Table, metadata: Metadata) -> None:
def _fit_uniform_strategy(
self, table: ir.Table, columns: list[str]
) -> dict[str, list[float]]:
import numpy as np

aggs = []
for col_name in columns:
col = table[col_name]
Expand All @@ -117,6 +118,8 @@ def _fit_uniform_strategy(
def _fit_quantile_strategy(
self, table: ir.Table, columns: list[str]
) -> dict[str, list[float]]:
import numpy as np

aggs = []
percentiles = np.linspace(0, 1, self.n_bins + 1)
for col_name in columns:
Expand Down
25 changes: 25 additions & 0 deletions tests/test_init.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you do this in a separate file instead of, say, test_core.py? Is there a practical reason?

If it's in a separate file, I can't see how if "ibis_ml" in sys.modules can even be True entering the tests. Did this not work properly in test_core.py, or you didn't try?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not try puting it in the test_core, the reason I put it in the new file becuase 1) some other files, such as _discretize.py also dependes on numpy 2) It is testing how the dependency impact the IbisML loading. I named it test_dependenct.py before, I am not sure if it is a good name.

Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
import sys
from importlib import import_module, reload
from unittest import mock

import pytest


@pytest.mark.parametrize("optional_dep", ["pandas", "numpy", "pyarrow"])
def test_optional_dependencies(optional_dep, caplog):
jitingxu1 marked this conversation as resolved.
Show resolved Hide resolved
with mock.patch.dict(sys.modules):
sys.modules[optional_dep] = None
if "ibis_ml" in sys.modules:
reload(sys.modules["ibis_ml"])
else:
import_module("ibis_ml")

assert "ibis_ml" in sys.modules
jitingxu1 marked this conversation as resolved.
Show resolved Hide resolved


def test_import():
try:
import_module("ibis_ml")
except AttributeError:
raise ImportError("cannot import IbisML") from None
assert "ibis_ml" in sys.modules
jitingxu1 marked this conversation as resolved.
Show resolved Hide resolved
Loading