Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build: make NumPy, pandas, and Arrow deps optional #152

Merged
merged 11 commits into from
Sep 13, 2024
32 changes: 26 additions & 6 deletions ibis_ml/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,14 @@
import ibis
import ibis.expr.operations as ops
import ibis.expr.types as ir
import numpy as np
import pandas as pd
import pyarrow as pa
from ibis.common.dispatch import lazy_singledispatch

if TYPE_CHECKING:
import dask.dataframe as dd
import numpy as np
import pandas as pd
import polars as pl
import pyarrow as pa
import xgboost as xgb
from sklearn.utils._estimator_html_repr import _VisualBlock

Expand All @@ -45,6 +45,9 @@ def _ibis_table_to_numpy(table: ir.Table) -> np.ndarray:

def _y_as_dataframe(y: Any) -> pd.DataFrame:
"""Coerce `y` to a pandas dataframe"""
import numpy as np
import pandas as pd

if isinstance(y, pd.DataFrame):
return y
elif isinstance(y, pd.Series):
Expand Down Expand Up @@ -144,8 +147,11 @@ def _(X, y=None, maintain_order=False):
return table, tuple(y.columns), None


@normalize_table.register(pd.DataFrame)
@normalize_table.register("pd.DataFrame")
def _(X, y=None, maintain_order=False):
import numpy as np
import pandas as pd

if y is not None:
y = _y_as_dataframe(y)
table = pd.concat([X, y], axis=1)
Expand All @@ -162,8 +168,11 @@ def _(X, y=None, maintain_order=False):
return ibis.memtable(table), targets, index


@normalize_table.register(np.ndarray)
@normalize_table.register("np.ndarray")
def _(X, y=None, maintain_order=False):
import numpy as np
import pandas as pd

X = pd.DataFrame(X, columns=[f"x{i}" for i in range(X.shape[-1])])
if y is not None:
y = _y_as_dataframe(y)
Expand All @@ -181,8 +190,11 @@ def _(X, y=None, maintain_order=False):
return ibis.memtable(table), targets, index


@normalize_table.register(pa.Table)
@normalize_table.register("pa.Table")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes for PyArrow are not necessary, as it is a required dependency. Can you undo them?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not find it is required dependency in Ibis and IbisML

https://github.com/ibis-project/ibis/blob/main/pyproject.toml#L49 was marked as optional.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, makes sense; I forgot about ibis-project/ibis#9552.

Guess this is fine then, to avoid having to maintain PyArrow bounds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bad idea to not explicitly include dependencies that you are explicitly importing. You shouldn't really ever depend on another project to have specific dependencies.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bad idea to not explicitly include dependencies that you are explicitly importing. You shouldn't really ever depend on another project to have specific dependencies.

Do you mean we should explicitly include the dependency in IbisML. It is convenient users successfully imported IbisML, but when they use it, they'll get ModuleNotFound error, have to install it.

@deepyaman do you have any comment?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't explicitly require pandas/NumPy; we don't need to include those dependencies, unless a user is using a backend/ML framework that requires it. Those are specified in extras.

For PyArrow... trying to think, can you reasonably use IbisML without PyArrow? I took some looks at the linked PR in Ibis, and it seems there was some use case from people using Ibis internally in their product that could not need the PyArrow dependency. Is that also possible with IbisML? I wasn't completely sure...

In this case, I'm personally OK with leaving it as is, since IbisML isn't just getting a transitive dependency from Ibis—it completely relies on Ibis under the hood.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To unblock the release, let's keep it as it is now. We could change this later if we have new findings.

def _(X, y=None, maintain_order=False):
import numpy as np
import pyarrow as pa

if y is not None:
if isinstance(y, (pa.ChunkedArray, pa.Array)):
y = pa.Table.from_pydict({"y": y})
Expand Down Expand Up @@ -246,6 +258,8 @@ def get_categories(self, column: str) -> pa.Array | None:
return self.categories.get(column)

def set_categories(self, column: str, values: pa.Array | list[Any]) -> None:
import pyarrow as pa

self.categories[column] = pa.array(values)

def drop_categories(self, column: str) -> None:
Expand All @@ -255,6 +269,8 @@ def drop_categories(self, column: str) -> None:
def _categorize_wrap_reader(
reader: pa.RecordBatchReader, categories: dict[str, pa.Array]
) -> Iterable[pa.RecordBatch]:
import pyarrow as pa

for batch in reader:
out = {}
for name, col in zip(batch.schema.names, batch.columns):
Expand Down Expand Up @@ -620,6 +636,8 @@ def _categorize_pandas(self, df: pd.DataFrame) -> pd.DataFrame:
return df

def _categorize_pyarrow(self, table: pa.Table) -> pa.Table:
import pyarrow as pa

if not self.metadata_.categories:
return table

Expand All @@ -645,6 +663,8 @@ def _categorize_dask_dataframe(self, ddf: dd.DataFrame) -> dd.DataFrame:
def _categorize_pyarrow_batches(
self, reader: pa.RecordBatchReader
) -> pa.RecordBatchReader:
import pyarrow as pa

if not self.metadata_.categories:
return reader

Expand Down
5 changes: 4 additions & 1 deletion ibis_ml/steps/_discretize.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,6 @@

import ibis
import ibis.expr.types as ir
import numpy as np

from ibis_ml.core import Metadata, Step
from ibis_ml.select import SelectionType, selector
Expand Down Expand Up @@ -94,6 +93,8 @@ def fit_table(self, table: ir.Table, metadata: Metadata) -> None:
def _fit_uniform_strategy(
self, table: ir.Table, columns: list[str]
) -> dict[str, list[float]]:
import numpy as np

aggs = []
for col_name in columns:
col = table[col_name]
Expand All @@ -117,6 +118,8 @@ def _fit_uniform_strategy(
def _fit_quantile_strategy(
self, table: ir.Table, columns: list[str]
) -> dict[str, list[float]]:
import numpy as np

aggs = []
percentiles = np.linspace(0, 1, self.n_bins + 1)
for col_name in columns:
Expand Down
Loading