-
Notifications
You must be signed in to change notification settings - Fork 12
/
Copy pathcubist.Rmd
204 lines (135 loc) · 9.37 KB
/
cubist.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
title: "Cubist Regresion Models"
vignette: >
%\VignetteEngine{knitr::rmarkdown}
%\VignetteIndexEntry{Cubist Regresion Models}
output:
knitr:::html_vignette:
toc: yes
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(Cubist)
library(dplyr)
library(rlang)
library(rules)
options(digits = 3)
```
`Cubist` is an `R` port of the Cubist GPL `C` code released by RuleQuest at [`http://rulequest.com/cubist-info.html`](http://rulequest.com/cubist-info.html). See the last section of this document for information on the porting. The other parts describes the functionality of the `R` package.
## Model Trees
Cubist is a rule-based model that is an extension of Quinlan's M5 model tree. A tree is grown where the terminal leaves contain linear regression models. These models are based on the predictors used in previous splits. Also, there are intermediate linear models at each step of the tree. A prediction is made using the linear regression model at the terminal node of the tree, but is "smoothed" by taking into account the prediction from the linear model in the previous node of the tree (which also occurs recursively up the tree). The tree is reduced to a set of rules, which initially are paths from the top of the tree to the bottom. Rules are eliminated via pruning and/or combined for simplification.
This is explained better in Quinlan (1992). Wang and Witten (1997) attempted to recreate this model using a "rational reconstruction" of Quinlan (1992) that is the basis for the `M5P` model in `Weka` (and the R package `RWeka`).
An example of a model tree can be illustrated using the Ames housing data in the `modeldata` package.
```{r bh1}
library(Cubist)
data(ames, package = "modeldata")
# model the data on the log10 scale
ames$Sale_Price <- log10(ames$Sale_Price)
set.seed(11)
in_train_set <- sample(1:nrow(ames), floor(.8*nrow(ames)))
predictors <-
c("Lot_Area", "Alley", "Lot_Shape", "Neighborhood", "Bldg_Type",
"Year_Built", "Total_Bsmt_SF", "Central_Air", "Gr_Liv_Area",
"Bsmt_Full_Bath", "Bsmt_Half_Bath", "Full_Bath", "Half_Bath",
"TotRms_AbvGrd", "Year_Sold", "Longitude", "Latitude")
train_pred <- ames[ in_train_set, predictors]
test_pred <- ames[-in_train_set, predictors]
train_resp <- ames$Sale_Price[ in_train_set]
test_resp <- ames$Sale_Price[-in_train_set]
model_tree <- cubist(x = train_pred, y = train_resp)
model_tree
```
```{r bh2}
summary(model_tree)
```
There is no formula method for `cubist()`; the predictors are specified as matrix or data frame, The outcome is a numeric vector.
There is a predict method for the model:
```{r bh3}
model_tree_pred <- predict(model_tree, test_pred)
## Test set RMSE
sqrt(mean((model_tree_pred - test_resp)^2))
## Test set R^2
cor(model_tree_pred, test_resp)^2
```
## Ensembles By Committees
The Cubist model can also use a boosting-like scheme called _committees_ where iterative model trees are created in sequence. The first tree follows the procedure described in the last section. Subsequent trees are created using adjusted versions to the training set outcome: if the model over-predicted a value, the response is adjusted downward for the next model (and so on, see [this blog post](https://rviews.rstudio.com/2020/05/21/modern-rule-based-models)). Unlike traditional boosting, stage weights for each committee are not used to average the predictions from each model tree; the final prediction is a simple average of the predictions from each model tree.
The `committee` option can be used to control number of model trees:
```{r bh4}
set.seed(1)
com_model <- cubist(x = train_pred, y = train_resp, committees = 3)
summary(com_model)
```
For this model:
```{r bh5}
com_pred <- predict(com_model, test_pred)
## RMSE
sqrt(mean((com_pred - test_resp)^2))
## R^2
cor(com_pred, test_resp)^2
```
## Instance-Based Corrections
Another innovation in Cubist using nearest-neighbors to adjust the predictions from the rule-based model. First, a model tree (with or without committees) is created. Once a sample is predicted by this model, Cubist can find it's nearest neighbors and determine the average of these training set points. See Quinlan (1993a) for the details of the adjustment as well as [this blog post](https://rviews.rstudio.com/2020/05/21/modern-rule-based-models).
The development of rules and committees is independent of the choice of using instances. The original `C` code allowed the program to choose whether to use instances, not use them or let the program decide. Our approach is to build a model with the `cubist()` function that is ignorant to the decision about instances. When samples are predicted, the argument `neighbors` can be used to adjust the rule-based model predictions (or not).
We can add instances to the previously fit committee model:
```{r bh6}
inst_pred <- predict(com_model, test_pred, neighbors = 5)
## RMSE
sqrt(mean((inst_pred - test_resp)^2))
## R^2
cor(inst_pred, test_resp)^2
```
Note that the previous models used the implicit default of `neighbors = 0` for their predictions.
It may also be useful to see how the different models fit a single predictor. Here is the test set data for a model with one predictor (`Gr_Liv_Area`), 100 committees, and various values of `neighbors`:
```{r echo = FALSE, fig.align='center'}
knitr::include_graphics("neighbors.gif")
```
After the initial use of the instance-based correction, there is very little change in the mainstream of the data.
## Model tuning
R modeling packages such as `caret`, `tidymodels`, and `mlr3` can be used to tune the model. See the [examples here](https://topepo.github.io/Cubist/articles/extras/tuning.html) for more details.
It should be noted that this variable importance measure does not capture the influence of the predictors when using the instance-based correction.
## Extracting Rules
Rules from a Cubist model can be viewed using `summary` as follows:
```{r summary-tree}
summary(model_tree)
```
The `tidy()` function in the [`rules` package](https://rules.tidymodels.org) returns rules in a tibble (an extension of data frames) with one row per rule. The tibble provides information about the rule and can be used to programatically extra data from the model. For example:
```{r}
#| label: rule-extraction
library(rules)
rule_df <- tidy(model_tree)
rule_df
rule_df$estimate[[1]]
rule_df$statistic[[1]]
```
The `rule` column can be converted to an R expression that can be used to pull data used by that rule. For example, for the seventh rule:
```{r}
#| label: rule-data
# Text
rule_7 <- rule_df$rule[7]
# Convert to an expression
rule_7 <- rlang::parse_expr(rule_7)
rule_7
# Use in a dplyr filter:
nrow(train_pred)
library(dplyr)
train_pred %>% filter(!!rule_7) %>% nrow()
```
## Variable Importance
The `summary()` method for Cubist shows the usage of each variable in either the rule conditions or the (terminal) linear model. In actuality, many more linear models are used in prediction that are shown in the output. Because of this, the variable usage statistics shown at the end of the output of the `summary()` function will probably be inconsistent with the rules also shown in the output. At each split of the tree, Cubist saves a linear model (after feature selection) that is allowed to have terms for each variable used in the current split or any split above it. Quinlan (1992) discusses a smoothing algorithm where each model prediction is a linear combination of the parent and child model along the tree. As such, the final prediction is a function of all the linear models from the initial node to the terminal node. The percentages shown in the Cubist output reflects all the models involved in prediction (as opposed to the terminal models shown in the output).
The raw usage statistics are contained in a data frame called `usage` in the `cubist` object.
The `caret` and `vip` packages have general variable importance functions `caret::varImp()` and `vip::vi()`. When using this function on a `cubist` argument, the variable importance is a linear combination of the usage in the rule conditions and the model.
For example, to compute the scores:
```{r vimp, eval = FALSE}
caret::varImp(model_tree)
# or
vip::vi(model_tree)
```
## Exporting the Model Using the RuleQuest file format
As previously mentioned, this code is a port of the command-line `C` code. To run the `C` code, the training set data must be converted to a specific file format as detailed on the RuleQuest website. Two files are created. The `file.data` file is a header-less, comma delimited version of the data (the `file` part is a name given by the user). The `file.names` file provides information about the columns (eg. levels for categorical data and so on). After running the `C` program, another text file called `file.models`, which contains the information needed for prediction.
Once a model has been built with the `R` `cubist` package, the `exportCubistFiles` can be used to create the `.data`, `.names` and `.model` files so that the same model can be run at the command-line.
## Current Limitations
There are a few features in the `C` code that are not yet operational in the `R` package:
* only continuous and categorical predictors can be used (the original source code allows for other data types)
* there is an option to let the `C` code decide on using instances or not. The choice is more explicit in this package
* non-standard predictor names are not currently checked/fixed
* the `C` code supports binning of predictors