zillow · reidjohnson · Feb 11, 2024 · Feb 7, 2024 · Feb 8, 2024 · Feb 9, 2024
diff --git a/.gitignore b/.gitignore
@@ -76,3 +76,4 @@ target/
 
 paper/paper.pdf
 paper/paper.jats
+docs/sg_execution_times.rst
diff --git a/docs/install.rst b/docs/install.rst
@@ -46,7 +46,11 @@ Test and Coverage
 
 To test the code::
 
-  $ pytest quantile_forest -v
+  $ python -m pytest quantile_forest -v
+
+To test the documentation::
+
+  $ python -m pytest docs/*rst
 
 Documentation
 =============

diff --git a/docs/user_guide.rst b/docs/user_guide.rst
@@ -37,7 +37,10 @@ This approach was first proposed by :cite:t:`2006:meinshausen`.
 Fitting and Predicting
 ----------------------
 
-Quantile forests can be fit and used to predict like standard scikit-learn estimators.
+Quantile forests can be fit and used to predict like standard scikit-learn estimators. In this package, the quantile forests extend standard scikit-learn forest regressors and inherent their model parameters, in addition to offering additional parameters related to quantile regression. We'll discuss many of the important model parameters below.
+
+Fitting a Model
+~~~~~~~~~~~~~~~
 
 Let's fit a quantile forest on a simple regression dataset::
 
@@ -52,6 +55,9 @@ Let's fit a quantile forest on a simple regression dataset::
 
 During model initialization, the parameter `max_samples_leaf` can be specified, which determines the maximum number of samples per leaf node to retain. If `max_samples_leaf` is smaller than the number of samples in a given leaf node, then a subset of values are randomly selected. By default, the model retains one randomly selected sample per leaf node (`max_samples_leaf = 1`), which enables the use of optimizations at prediction time that are not available when a variable number of samples may be retained per leaf. All samples can be retained by specifying `max_samples_leaf = None`. Note that the number of retained samples can materially impact the size of the model object.
 
+Making Predictions
+~~~~~~~~~~~~~~~~~~
+
 A notable advantage of quantile forests is that they can be fit once, while arbitrary quantiles can be estimated at prediction time. Accordingly, since the quantiles can be specified at prediction time, the model accepts an optional parameter during the call to the `predict` method, which can be a float or list of floats that specify the empirical quantiles to return::
 
     >>> y_pred = reg.predict(X_test, quantiles=[0.25, 0.5, 0.75])  # returns three columns per row
@@ -80,6 +86,29 @@ The output of the `predict` method is an array with one column for each specifie
     >>> (y_pred[:, 0] >= y_pred[:, 1]).all()
     True
 
+Multi-target quantile regression is also supported. If the target values are multi-dimensional, then the final output column will correspond to the number of targets::
+
+    >>> from sklearn import datasets
+    >>> from sklearn.model_selection import train_test_split
+    >>> from quantile_forest import RandomForestQuantileRegressor
+    >>> X, y = datasets.make_regression(n_samples=10, n_features=5, n_targets=2, random_state=0)
+    >>> reg_multi = RandomForestQuantileRegressor()
+    >>> reg_multi.fit(X, y)
+    RandomForestQuantileRegressor()
+    >>> quantiles = [0.25, 0.5, 0.75]
+    >>> y_pred = reg_multi.predict(X, quantiles=quantiles)
+    >>> y_pred.ndim == 3
+    True
+    >>> y_pred.shape[0] == len(X)
+    True
+    >>> y_pred.shape[1] == len(quantiles)
+    True
+    >>> y_pred.shape[-1] == y.shape[1]
+    True
+
+Quantile Weighting
+~~~~~~~~~~~~~~~~~~
+
 By default, the predict method calculates quantiles by weighting each sample inversely according to the size of its leaf node (`weighted_leaves = True`). If `weighted_leaves = False`, each sample in a leaf (including repeated bootstrap samples) will be given equal weight. Note that this leaf-based weighting can only be used with weighted quantiles.
 
 By default, the predict method calculates quantiles using a weighted quantile method (`weighted_quantile = True`), which assigns a weight to each sample in the training set based on the number of times that it co-occurs in the same leaves as the test sample. When the number of samples in the training set is larger than the expected size of this list (i.e., :math:`n_{train} \gg n_{trees} \cdot n_{leaves} \cdot n_{leafsamples}`), it can be more efficient to calculate an unweighted quantile (`weighted_quantile = False`), which aggregates the list of training `y` values for each leaf node to which the test sample belongs across all trees. For a given input, both methods can return the same output values::
@@ -91,6 +120,9 @@ By default, the predict method calculates quantiles using a weighted quantile me
     >>> np.allclose(y_pred_weighted, y_pred_unweighted)
     True
 
+Out-of-Bag Estimation
+~~~~~~~~~~~~~~~~~~~~~
+
 Out-of-bag (OOB) predictions can be returned by specifying `oob_score = True`::
 
     >>> y_pred_oob = reg.predict(X_train, quantiles=[0.25, 0.5, 0.75], oob_score=True)
@@ -106,6 +138,9 @@ By default, when the `predict` method is called with the OOB flag set to True, i
 
 This allows all samples, both from the training and test sets, to be scored with a single call to `predict`, whereby OOB predictions are returned for the training samples and IB (i.e., non-OOB) predictions are returned for the test samples.
 
+Random Forest Predictions
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
 The predictions of a standard random forest can also be recovered from a quantile forest without retraining by passing `quantiles = "mean"` and `aggregate_leaves_first = False`, the latter which specifies a Boolean flag to average the leaf values before aggregating the leaves across trees. This configuration essentially replicates the prediction process used by a standard random forest regressor, which is an averaging of mean leaf values across trees::
 
     >>> import numpy as np

diff --git a/examples/plot_quantile_multioutput.py b/examples/plot_quantile_multioutput.py
@@ -0,0 +1,66 @@
+"""
+================================
+Multi-target Quantile Regression
+================================
+
+An example on a toy dataset that demonstrates fitting a single quantile
+regressor for multiple target variables. For each target, multiple quantiles
+can be estimated simulatenously.
+
+"""
+
+print(__doc__)
+
+import matplotlib.pyplot as plt
+import numpy as np
+from sklearn.model_selection import train_test_split
+
+from quantile_forest import RandomForestQuantileRegressor
+
+np.random.seed(0)
+
+
+n_samples = 10000
+bounds = [-0, 25]
+funcs = [
+    lambda x: np.sin(x) + np.sqrt(x),
+    lambda x: np.cos(x),
+    lambda x: np.sin(x) - np.sqrt(x),
+]
+
+
+def make_Xy(funcs, bounds, n_samples):
+    x = np.linspace(bounds[0], bounds[1], n_samples)
+    y = np.empty((len(x), 3))
+    y[:, 0] = funcs[0](x) + np.random.normal(scale=0.01 * np.abs(x))
+    y[:, 1] = funcs[1](x) + np.random.normal(scale=0.01 * np.abs(x))
+    y[:, 2] = funcs[2](x) + np.random.normal(scale=0.01 * np.abs(x))
+    return x, y
+
+
+X, y = make_Xy(funcs, bounds, n_samples)
+X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
+
+qrf = RandomForestQuantileRegressor(random_state=0)
+qrf.fit(X_train.reshape(-1, 1), y_train)
+
+y_pred = qrf.predict(X.reshape(-1, 1), quantiles=[0.025, 0.5, 0.975])
+
+
+def plot_multioutputs(colors, funcs, X, y):
+    for i in range(y.shape[-1]):
+        y1 = y_pred[:, 0, i]
+        y2 = y_pred[:, 2, i]
+        plt.fill_between(X, y1, y2, color=colors[i], label=f"Target {i}")
+        plt.plot(X, funcs[i](X), c="black")
+    plt.xlim(bounds)
+    plt.ylim([-8, 8])
+    plt.xlabel("$x$")
+    plt.ylabel("$y$")
+    plt.legend(loc="upper left")
+    plt.title("Multi-target Prediction Intervals")
+    plt.show()
+
+
+colors = ["#f2a619", "#006aff", "#001751"]
+plot_multioutputs(colors, funcs, X, y)
Original file line number	Diff line number	Diff line change
Expand Up		@@ -76,3 +76,4 @@ target/

		paper/paper.pdf
		paper/paper.jats
		docs/sg_execution_times.rst