Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow MapieRegressor to use K-fold iterator variants with stratification and groups. #393

Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
ebfb6be
Add support for stratified and group cross validation
pidefrem Jan 3, 2024
cc15afc
Fix subsample get_n_splits with new args
pidefrem Jan 3, 2024
609d164
Update estimator.interface with group arg
pidefrem Jan 3, 2024
f376205
Add test
pidefrem Jan 3, 2024
bf81194
Add groups arg at the end
pidefrem Jan 3, 2024
7411e64
Update HISTORY
pidefrem Jan 3, 2024
1d0c0b9
Make sure new staff awagent everyone localaccounts _appserverusr admi…
pidefrem Jan 3, 2024
9fe7fe9
Update HISTORY.rst
pidefrem Jan 3, 2024
c1fd023
Update HISTORY.rst
pidefrem Jan 3, 2024
b81cc9e
Update HISTORY.rst
pidefrem Jan 3, 2024
b1b93de
Update HISTORY.rst
pidefrem Jan 3, 2024
d332dba
Update HISTORY.rst
pidefrem Jan 3, 2024
824518d
Update HISTORY.rst
pidefrem Jan 3, 2024
98fadea
Update AUTHORS
pidefrem Jan 3, 2024
87e6684
Add :meth: keyword in HISTORY
pidefrem Jan 3, 2024
99e1133
Fix typo in utils.py
pidefrem Jan 3, 2024
e2bbb5f
Fix change log
pidefrem Jan 3, 2024
1b589b1
Merge branch 'master' into 202-estimator-groupkfold-split-strategy
pidefrem Jan 3, 2024
4f4305d
Merge branch 'master' into 202-estimator-groupkfold-split-strategy
pidefrem Jan 3, 2024
592e151
Update mapie/estimator/estimator.py
pidefrem Jan 11, 2024
8ee2bfe
Update mapie/subsample.py
pidefrem Jan 11, 2024
4084e81
Update HISTORY.rst
pidefrem Jan 11, 2024
69cce60
Update AUTHORS
pidefrem Jan 11, 2024
2104ffa
Update HISTORY.rst
pidefrem Jan 11, 2024
b6dd81b
Merge branch 'master' into 202-estimator-groupkfold-split-strategy
pidefrem Jan 12, 2024
c009bbc
Fix merge, silent mypy error in quantile reg, start new test
pidefrem Jan 12, 2024
67bbdfb
Fix merge, silent mypy error in quantile reg, start new test
pidefrem Jan 12, 2024
97815a6
Continue test
pidefrem Jan 12, 2024
cda63e3
Continue test
pidefrem Jan 12, 2024
5777331
Finish test in test_regression
pidefrem Jan 12, 2024
7ddf6b8
Start test for classif
pidefrem Jan 12, 2024
abc80a0
Start test for classif
pidefrem Jan 13, 2024
160e225
Start test for classif
pidefrem Jan 13, 2024
d36f06f
Start test for classif
pidefrem Jan 13, 2024
7609cbd
Fix lint
pidefrem Jan 13, 2024
39e23a5
Fix type-check
pidefrem Jan 13, 2024
c4d0444
Continue test for classif
pidefrem Jan 13, 2024
065a4cc
Continue test for classif
pidefrem Jan 13, 2024
33766df
Continue test for classif
pidefrem Jan 13, 2024
891517f
Fix typo
pidefrem Jan 13, 2024
85bea5c
Update HISTORY
pidefrem Jan 13, 2024
f240e8b
Update docstring
pidefrem Jan 13, 2024
d448546
Update docstring
pidefrem Jan 13, 2024
e32c154
Update docstring
pidefrem Jan 13, 2024
b9f25fa
Rm .venv
pidefrem Feb 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions AUTHORS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -36,5 +36,6 @@ Contributors
* Arthur Phan <[email protected]>
* Rafael Saraiva <[email protected]>
* Mehdi Elion <[email protected]>
* Pierre de Fréminville <[email protected]>

To be continued ...
3 changes: 3 additions & 0 deletions HISTORY.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,9 @@ History

##### (##########)
------------------
* Allow the use of `y` and `groups` arguments in cross validator methods `get_n_splits`
and `split` to enable more cv-split variants for :class:`MapieRegressor`
(e.g. :class:`GroupKFold`, stratified continuous split).
pidefrem marked this conversation as resolved.
Show resolved Hide resolved

0.8.0 (2024-01-03)
------------------
Expand Down
4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
.PHONY: tests doc build

lint:
flake8 . --exclude=doc
lint:
flake8 . --exclude=doc,.venv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you encountered a problem that justifies this addition to the file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @thibaultcordier. By default the lint command in your Makefile will scan every folder in the current workspace. .venv is the default location when using vscode and virtual envs, so your lint command was also scanning every dependency in the virtual env folder.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @pidefrem,
Could you please remove the environment? You can also save the env somewhere else than in the MAPIE folder.
Thank you!


type-check:
mypy mapie
Expand Down
10 changes: 9 additions & 1 deletion mapie/classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -1047,6 +1047,7 @@ def fit(
y: ArrayLike,
sample_weight: Optional[ArrayLike] = None,
size_raps: Optional[float] = .2,
groups: Optional[ArrayLike] = None,
) -> MapieClassifier:
"""
Fit the base estimator or use the fitted base estimator.
Expand Down Expand Up @@ -1074,6 +1075,11 @@ def fit(

By default ``.2``.

groups: Optional[ArrayLike] of shape (n_samples,)
Group labels for the samples used while splitting the dataset into
train/test set.

By default ``None``.

Returns
-------
Expand Down Expand Up @@ -1163,7 +1169,9 @@ def fit(
k,
sample_weight,
)
for k, (train_index, val_index) in enumerate(cv.split(X))
for k, (train_index, val_index) in enumerate(
cv.split(X, y_enc, groups)
)
)
(
self.estimators_,
Expand Down
35 changes: 29 additions & 6 deletions mapie/estimator/estimator.py
Original file line number Diff line number Diff line change
Expand Up @@ -322,7 +322,11 @@ def _pred_multi(self, X: ArrayLike) -> NDArray:
y_pred_multi = self._aggregate_with_mask(y_pred_multi, self.k_)
return y_pred_multi

def predict_calib(self, X: ArrayLike) -> NDArray:
def predict_calib(
self,
X: ArrayLike,
y: Optional[ArrayLike] = None,
groups: Optional[ArrayLike] = None) -> NDArray:
pidefrem marked this conversation as resolved.
Show resolved Hide resolved
"""
Perform predictions on X : the calibration set.

Expand All @@ -331,6 +335,17 @@ def predict_calib(self, X: ArrayLike) -> NDArray:
X: ArrayLike of shape (n_samples_test, n_features)
Input data

y: ArrayLike of shape (n_samples,)
Input labels.

By default ``None``.

groups: Optional[ArrayLike] of shape (n_samples,)
Group labels for the samples used while splitting the dataset into
train/test set.

By default ``None``.

Returns
-------
NDArray of shape (n_samples_test, 1)
Expand All @@ -349,15 +364,17 @@ def predict_calib(self, X: ArrayLike) -> NDArray:
delayed(self._predict_oof_estimator)(
estimator, X, calib_index,
)
for (_, calib_index), estimator in zip(cv.split(X),
self.estimators_)
for (_, calib_index), estimator in zip(
cv.split(X, y, groups),
self.estimators_
)
)
predictions, indices = map(
list, zip(*outputs)
)
n_samples = _num_samples(X)
pred_matrix = np.full(
shape=(n_samples, cv.get_n_splits(X)),
shape=(n_samples, cv.get_n_splits(X, y, groups)),
fill_value=np.nan,
dtype=float,
)
Expand All @@ -377,6 +394,7 @@ def fit(
X: ArrayLike,
y: ArrayLike,
sample_weight: Optional[ArrayLike] = None,
groups: Optional[ArrayLike] = None
) -> EnsembleRegressor:
"""
Fit the base estimator under the ``single_estimator_`` attribute.
Expand All @@ -397,6 +415,11 @@ def fit(
Sample weights. If None, then samples are equally weighted.
By default ``None``.

groups: Optional[ArrayLike] of shape (n_samples,)
Group labels for the samples used while splitting the dataset into
train/test set.
By default ``None``.

Returns
-------
EnsembleRegressor
Expand All @@ -423,7 +446,7 @@ def fit(
)
cv = cast(BaseCrossValidator, cv)
self.k_ = np.full(
shape=(n_samples, cv.get_n_splits(X, y)),
shape=(n_samples, cv.get_n_splits(X, y, groups)),
fill_value=np.nan,
dtype=float,
)
Expand All @@ -434,7 +457,7 @@ def fit(
delayed(self._fit_oof_estimator)(
clone(estimator), X, y, train_index, sample_weight
)
for train_index, _ in cv.split(X)
for train_index, _ in cv.split(X, y, groups)
)
# In split-CP, we keep only the model fitted on train dataset
if self.use_split_method_:
Expand Down
6 changes: 6 additions & 0 deletions mapie/estimator/interface.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ def fit(
X: ArrayLike,
y: ArrayLike,
sample_weight: Optional[ArrayLike] = None,
groups: Optional[ArrayLike] = None,
) -> EnsembleEstimator:
"""
Fit the base estimator under the ``single_estimator_`` attribute.
Expand All @@ -41,6 +42,11 @@ def fit(
Sample weights. If None, then samples are equally weighted.
By default ``None``.

groups: Optional[ArrayLike] of shape (n_samples,)
Group labels for the samples used while splitting the dataset into
train/test set.
By default ``None``.

Returns
-------
EnsembleRegressor
Expand Down
29 changes: 25 additions & 4 deletions mapie/regression/regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -392,6 +392,7 @@ def _check_fit_parameters(
X: ArrayLike,
y: ArrayLike,
sample_weight: Optional[ArrayLike] = None,
groups: Optional[ArrayLike] = None
):
"""
Perform several checks on class parameters.
Expand All @@ -407,6 +408,11 @@ def _check_fit_parameters(
sample_weight: Optional[NDArray] of shape (n_samples,)
Non-null sample weights.

groups: Optional[ArrayLike] of shape (n_samples,)
Group labels for the samples used while splitting the dataset into
train/test set.
By default ``None``.

Raises
------
ValueError
Expand Down Expand Up @@ -449,14 +455,21 @@ def _check_fit_parameters(
X = cast(NDArray, X)
y = cast(NDArray, y)
sample_weight = cast(Optional[NDArray], sample_weight)
groups = cast(Optional[NDArray], groups)

return estimator, cs_estimator, agg_function, cv, X, y, sample_weight
return (
estimator, cs_estimator,
agg_function, cv,
X, y,
sample_weight, groups
)

def fit(
self,
X: ArrayLike,
y: ArrayLike,
sample_weight: Optional[ArrayLike] = None,
groups: Optional[ArrayLike] = None
) -> MapieRegressor:
"""
Fit estimator and compute conformity scores used for
Expand Down Expand Up @@ -484,6 +497,11 @@ def fit(

By default ``None``.

groups: Optional[ArrayLike] of shape (n_samples,)
Group labels for the samples used while splitting the dataset into
train/test set.
By default ``None``.

Returns
-------
MapieRegressor
Expand All @@ -496,7 +514,8 @@ def fit(
cv,
X,
y,
sample_weight) = self._check_fit_parameters(X, y, sample_weight)
sample_weight,
groups) = self._check_fit_parameters(X, y, sample_weight, groups)

self.estimator_ = EnsembleRegressor(
estimator,
Expand All @@ -509,10 +528,12 @@ def fit(
self.verbose
)
# Fit the prediction function
self.estimator_ = self.estimator_.fit(X, y, sample_weight)
self.estimator_ = self.estimator_.fit(
X, y, sample_weight=sample_weight, groups=groups
)

# Predict on calibration data
y_pred = self.estimator_.predict_calib(X)
y_pred = self.estimator_.predict_calib(X, y=y, groups=groups)

# Compute the conformity scores (manage jk-ab case)
self.conformity_scores_ = \
Expand Down
7 changes: 4 additions & 3 deletions mapie/subsample.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ def __init__(
self.random_state = random_state

def split(
self, X: NDArray
self, X: NDArray, *args: Any, **kargs: Any
) -> Generator[Tuple[NDArray, NDArray], None, None]:
"""
Generate indices to split data into training and test sets.
Expand Down Expand Up @@ -89,7 +89,8 @@ def split(
test_index = np.setdiff1d(indices, train_index)
yield train_index, test_index

def get_n_splits(self, *args: Any, **kargs: Any) -> int:
def get_n_splits(
self, *args: Any, **kargs: Any) -> int:
pidefrem marked this conversation as resolved.
Show resolved Hide resolved
"""
Returns the number of splitting iterations in the cross-validator.

Expand Down Expand Up @@ -154,7 +155,7 @@ def __init__(
self.random_state = random_state

def split(
self, X: NDArray
self, X: NDArray, *args: Any, **kargs: Any
) -> Generator[Tuple[NDArray, NDArray], None, None]:
"""
Generate indices to split data into training and test sets.
Expand Down
22 changes: 22 additions & 0 deletions mapie/tests/test_regression.py
Original file line number Diff line number Diff line change
Expand Up @@ -361,6 +361,28 @@ def test_results_with_constant_sample_weights(strategy: str) -> None:
np.testing.assert_allclose(y_pis1, y_pis2)


@pytest.mark.parametrize("strategy", [*STRATEGIES])
def test_results_with_constant_groups(strategy: str) -> None:
"""
Test predictions when groups are None
or constant with different values.
"""
pidefrem marked this conversation as resolved.
Show resolved Hide resolved
n_samples = len(X)
mapie0 = MapieRegressor(**STRATEGIES[strategy])
mapie1 = MapieRegressor(**STRATEGIES[strategy])
mapie2 = MapieRegressor(**STRATEGIES[strategy])
mapie0.fit(X, y, groups=None)
mapie1.fit(X, y, groups=np.ones(shape=n_samples))
mapie2.fit(X, y, groups=np.ones(shape=n_samples) * 5)
y_pred0, y_pis0 = mapie0.predict(X, alpha=0.05)
y_pred1, y_pis1 = mapie1.predict(X, alpha=0.05)
y_pred2, y_pis2 = mapie2.predict(X, alpha=0.05)
np.testing.assert_allclose(y_pred0, y_pred1)
np.testing.assert_allclose(y_pred1, y_pred2)
np.testing.assert_allclose(y_pis0, y_pis1)
np.testing.assert_allclose(y_pis1, y_pis2)


@pytest.mark.parametrize("strategy", [*STRATEGIES])
def test_prediction_between_low_up(strategy: str) -> None:
"""Test that prediction lies between low and up prediction intervals."""
Expand Down
17 changes: 15 additions & 2 deletions mapie/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -214,6 +214,8 @@ def check_no_agg_cv(
X: ArrayLike,
cv: Union[int, str, BaseCrossValidator, BaseShuffleSplit],
no_agg_cv_array: list,
y: Optional[ArrayLike] = None,
groups: Optional[ArrayLike] = None
) -> bool:
"""
Check if cross-validator is ``"prefit"``, ``"split"`` or any split
Expand All @@ -230,6 +232,17 @@ def check_no_agg_cv(
no_agg_cv_array: list
List of all non-aggregated cv methods.

y: Opional[ArrayLike] of shape (n_samples,)
Input labels.

By default ``None``.

groups: Optional[ArrayLike] of shape (n_samples,)
Group labels for the samples used while splitting the dataset into
train/test set.

By default ``None``.

Returns
-------
bool
Expand All @@ -240,7 +253,7 @@ def check_no_agg_cv(
elif isinstance(cv, int):
return cv == 1
elif hasattr(cv, "get_n_splits"):
return cv.get_n_splits(X) == 1
return cv.get_n_splits(X, y, groups) == 1
else:
raise ValueError(
"Invalid cv argument. "
Expand Down Expand Up @@ -598,7 +611,7 @@ def check_lower_upper_bounds(
if any_final_inversion:
warnings.warn(
"WARNING: The predictions have issues.\n"
+ "The upper predictions are lower than"
+ "The upper predictions are lower than "
+ "the lower predictions at some points."
)

Expand Down
Loading