Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Multi-Target Output #26

Merged
merged 12 commits into from
Feb 11, 2024
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -76,3 +76,4 @@ target/

paper/paper.pdf
paper/paper.jats
docs/sg_execution_times.rst
6 changes: 5 additions & 1 deletion docs/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,11 @@ Test and Coverage

To test the code::

$ pytest quantile_forest -v
$ python -m pytest quantile_forest -v

To test the documentation::

$ python -m pytest docs/*rst

Documentation
=============
Expand Down
37 changes: 36 additions & 1 deletion docs/user_guide.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,10 @@ This approach was first proposed by :cite:t:`2006:meinshausen`.
Fitting and Predicting
----------------------

Quantile forests can be fit and used to predict like standard scikit-learn estimators.
Quantile forests can be fit and used to predict like standard scikit-learn estimators. In this package, the quantile forests extend standard scikit-learn forest regressors and inherent their model parameters, in addition to offering additional parameters related to quantile regression. We'll discuss many of the important model parameters below.

Fitting a Model
~~~~~~~~~~~~~~~

Let's fit a quantile forest on a simple regression dataset::

Expand All @@ -52,6 +55,9 @@ Let's fit a quantile forest on a simple regression dataset::

During model initialization, the parameter `max_samples_leaf` can be specified, which determines the maximum number of samples per leaf node to retain. If `max_samples_leaf` is smaller than the number of samples in a given leaf node, then a subset of values are randomly selected. By default, the model retains one randomly selected sample per leaf node (`max_samples_leaf = 1`), which enables the use of optimizations at prediction time that are not available when a variable number of samples may be retained per leaf. All samples can be retained by specifying `max_samples_leaf = None`. Note that the number of retained samples can materially impact the size of the model object.

Making Predictions
~~~~~~~~~~~~~~~~~~

A notable advantage of quantile forests is that they can be fit once, while arbitrary quantiles can be estimated at prediction time. Accordingly, since the quantiles can be specified at prediction time, the model accepts an optional parameter during the call to the `predict` method, which can be a float or list of floats that specify the empirical quantiles to return::

>>> y_pred = reg.predict(X_test, quantiles=[0.25, 0.5, 0.75]) # returns three columns per row
Expand Down Expand Up @@ -80,6 +86,29 @@ The output of the `predict` method is an array with one column for each specifie
>>> (y_pred[:, 0] >= y_pred[:, 1]).all()
True

Multi-target quantile regression is also supported. If the target values are multi-dimensional, then the final output column will correspond to the number of targets::

>>> from sklearn import datasets
>>> from sklearn.model_selection import train_test_split
>>> from quantile_forest import RandomForestQuantileRegressor
>>> X, y = datasets.make_regression(n_samples=10, n_features=5, n_targets=2, random_state=0)
>>> reg_multi = RandomForestQuantileRegressor()
>>> reg_multi.fit(X, y)
RandomForestQuantileRegressor()
>>> quantiles = [0.25, 0.5, 0.75]
>>> y_pred = reg_multi.predict(X, quantiles=quantiles)
>>> y_pred.ndim == 3
True
>>> y_pred.shape[0] == len(X)
True
>>> y_pred.shape[1] == len(quantiles)
True
>>> y_pred.shape[-1] == y.shape[1]
True

Quantile Weighting
~~~~~~~~~~~~~~~~~~

By default, the predict method calculates quantiles by weighting each sample inversely according to the size of its leaf node (`weighted_leaves = True`). If `weighted_leaves = False`, each sample in a leaf (including repeated bootstrap samples) will be given equal weight. Note that this leaf-based weighting can only be used with weighted quantiles.

By default, the predict method calculates quantiles using a weighted quantile method (`weighted_quantile = True`), which assigns a weight to each sample in the training set based on the number of times that it co-occurs in the same leaves as the test sample. When the number of samples in the training set is larger than the expected size of this list (i.e., :math:`n_{train} \gg n_{trees} \cdot n_{leaves} \cdot n_{leafsamples}`), it can be more efficient to calculate an unweighted quantile (`weighted_quantile = False`), which aggregates the list of training `y` values for each leaf node to which the test sample belongs across all trees. For a given input, both methods can return the same output values::
Expand All @@ -91,6 +120,9 @@ By default, the predict method calculates quantiles using a weighted quantile me
>>> np.allclose(y_pred_weighted, y_pred_unweighted)
True

Out-of-Bag Estimation
~~~~~~~~~~~~~~~~~~~~~

Out-of-bag (OOB) predictions can be returned by specifying `oob_score = True`::

>>> y_pred_oob = reg.predict(X_train, quantiles=[0.25, 0.5, 0.75], oob_score=True)
Expand All @@ -106,6 +138,9 @@ By default, when the `predict` method is called with the OOB flag set to True, i

This allows all samples, both from the training and test sets, to be scored with a single call to `predict`, whereby OOB predictions are returned for the training samples and IB (i.e., non-OOB) predictions are returned for the test samples.

Random Forest Predictions
~~~~~~~~~~~~~~~~~~~~~~~~~

The predictions of a standard random forest can also be recovered from a quantile forest without retraining by passing `quantiles = "mean"` and `aggregate_leaves_first = False`, the latter which specifies a Boolean flag to average the leaf values before aggregating the leaves across trees. This configuration essentially replicates the prediction process used by a standard random forest regressor, which is an averaging of mean leaf values across trees::

>>> import numpy as np
Expand Down
66 changes: 66 additions & 0 deletions examples/plot_quantile_multioutput.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
"""
================================
Multi-target Quantile Regression
================================

An example on a toy dataset that demonstrates fitting a single quantile
regressor for multiple target variables. For each target, multiple quantiles
can be estimated simulatenously.

"""

print(__doc__)

import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split

from quantile_forest import RandomForestQuantileRegressor

np.random.seed(0)


n_samples = 10000
bounds = [-0, 25]
funcs = [
lambda x: np.sin(x) + np.sqrt(x),
lambda x: np.cos(x),
lambda x: np.sin(x) - np.sqrt(x),
]


def make_Xy(funcs, bounds, n_samples):
x = np.linspace(bounds[0], bounds[1], n_samples)
y = np.empty((len(x), 3))
y[:, 0] = funcs[0](x) + np.random.normal(scale=0.01 * np.abs(x))
y[:, 1] = funcs[1](x) + np.random.normal(scale=0.01 * np.abs(x))
y[:, 2] = funcs[2](x) + np.random.normal(scale=0.01 * np.abs(x))
return x, y


X, y = make_Xy(funcs, bounds, n_samples)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

qrf = RandomForestQuantileRegressor(random_state=0)
qrf.fit(X_train.reshape(-1, 1), y_train)

y_pred = qrf.predict(X.reshape(-1, 1), quantiles=[0.025, 0.5, 0.975])


def plot_multioutputs(colors, funcs, X, y):
for i in range(y.shape[-1]):
y1 = y_pred[:, 0, i]
y2 = y_pred[:, 2, i]
plt.fill_between(X, y1, y2, color=colors[i], label=f"Target {i}")
plt.plot(X, funcs[i](X), c="black")
plt.xlim(bounds)
plt.ylim([-8, 8])
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.legend(loc="upper left")
plt.title("Multi-target Prediction Intervals")
plt.show()


colors = ["#f2a619", "#006aff", "#001751"]
plot_multioutputs(colors, funcs, X, y)
Loading
Loading