AutoTuner `$model` is set to some bogus value after `resample()` call. #1091

JHarrisonEcoEvo · 2021-08-17T16:23:57Z

Hi there,

Thanks for the package. I recently encountered difficulty extracting variable importance metrics from a ranger learner after building a pipeline that did some feature engineering, etc. Working example below with possible points of discussion included as comments.

In short, the difficulty in extracting expected model outputs will stymie new users very easily...it did me, for sure. If possible, it seems that the base learner should always be attached to mlr3 objects at the top level as the model object...e.g., learner$model should return the most base model. I get that this might be hard if one is wrapping learners with other learners, as I do here.

Perhaps a solution if multiple learners are being wrapped/combined, is to have users specify which learner is the one they want to "track" through the modeling process. That way after byzantine tuning, resampling, etc. they could say give me the model that I marked. And, by default that model would fill the top slot of learner$model. Perhaps this suggestion is bad since I know very little about the mlr3 coding architecture. But, if at all possible, it should be made easier to get at the most base learner ....

Thanks!

#MWE------------------
library("mlr3")

boston_task = tsk("boston_housing")
paste(
  "This data set contains", boston_task$nrow, "observations and",
  boston_task$ncol - 1, "features."
)

#remove the factor features
boston_task$col_roles$feature <-  boston_task$feature_names[!boston_task$feature_names %in% c("chas", "town")]

#Build pipeline for scaling and imputation

imp_missind <- po("missind")

#Note that imputation of numeric will NOT work with integer class features.
#I found this confusing at first as well since R typically treats integers
#and numbers the same

imp_num <- po("imputehist", param_vals = list(affect_columns = selector_type("numeric")))

# build learner with pipeline/graph #

graph <-  po("imputehist", param_vals = list(affect_columns = selector_type("numeric"))) %>>% 
  po("scale") %>>%
  po( lrn("regr.ranger", importance = "permutation"))

g1 <- GraphLearner$new(graph)

# tuning #

params <- ps(
  regr.ranger.num.trees = p_int(lower = 10, upper = 200, tags = "budget"),
  regr.ranger.splitrule = p_fct(levels = c("extratrees", "variance"))
)

at <- AutoTuner$new(
  learner = g1,
  resampling = rsmp("cv", folds = 2), 
  measure = msr("regr.rsq"),
  terminator = trm("none"),
  tuner = tnr("hyperband", eta = 3),
  search_space = params, 
  store_models = TRUE)

#pass back to resampling for nested resampling

outer_resampling <- rsmp("cv", folds = 2)
rr <- resample(task = boston_task,
               learner = at,
               outer_resampling,
               store_models = TRUE)

#I expected this to have our ranger model, but it is NULL. I now understand that this is
#because the AutoTuner wraps a model and so it is not clear what to return. See previous comments
#regarding a possible solution for this.

at$model

#I also expected this method to return the ranger model, but it doesn't

basemodel <- at$base_learner()
basemodel$model

#It took me a long time to figure out that I needed to train the AutoTuner learner then dig DEEP into it to find 
#variable importance. Since str() doesn't work on these objects, this was very time consuming. I spent several hours googling and messing around before I figured this out. 

at$train(boston_task) 

#This is pretty unintuitive because we are calling several model objects...

at$learner$model$regr.ranger$model$variable.importance

#I thought that all the machinations that happened within resampling and tuning would generate 
#variable importance metrics. Particularly, because importance is listed as a parameter
#of the AutoTuner object. Perhaps if this parameter was removed from the stdout of the
#AutoTuner object it would make it more
#obvious that variable importance metrics don't exist because this is a wrapper of other learners.

Thanks, for your time!

sebffischer · 2024-08-16T15:08:19Z

@mb706

mb706 · 2024-08-17T16:18:38Z

Thank you @JHarrisonEcoEvo for your thoughtful feedback. (And sorry that you are getting a reply three years later...). There are a few points here that basically boil down to design decisions, but there is also an error in your code.

If possible, it seems that the base learner should always be attached to mlr3 objects at the top level as the model object...e.g., learner$model should return the most base model.

We basically have that, and you almost found it: <AutoTunerObject>$base_learner()$model. We use the $model field to contain the totality of the information obtained by the $train() process.

We could argue, whether a $base_model or something should exist (pointing to $base_learner()$model), but it would not make life that much easier.

It would probably also make sense to expose certain meta-information about the base-model, e.g. $importance (mlr-org/mlr3pipelines#291), $selected_features() etc. (i.e. these).

Why does at$base_learner()$model not work, at that point in your code? The reason here is that you are accessing at after it was used for resampling. It was not really "trained" at all. All the information that you do get ouf of the at object is there by accident, to some degree, I think (@be-marc ?). You can get a "properly" trained AutoTuner object by extracting it out of the rr:

rr$learners[[1]]
#> <AutoTuner:imputehist.scale.regr.ranger.tuned>
#> * Model: auto_tuner_model
#> * Parameters: list()
#> * Validate: NULL
#> * Packages: mlr3, mlr3tuning, mlr3pipelines, graphics, mlr3learners,
#>   ranger
#> * Predict Types:  [response], se, distr
#> * Feature Types: logical, integer, numeric, character, factor, ordered,
#>   POSIXct
#> * Properties: featureless, hotstart_backward, hotstart_forward,
#>   importance, loglik, marshal, missings, oob_error, selected_features,
#>   weights
#> * Search Space:
#>                       id    class lower upper nlevels
#>                   <char>   <char> <num> <num>   <num>
#> 1: regr.ranger.num.trees ParamInt    10   200     191
#> 2: regr.ranger.splitrule ParamFct    NA    NA       2

# using the `$importance()` interface:
rr$learners[[1]]$base_learner()$importance()
#>     lstat        rm     tract       nox       dis     indus      crim       lon
#> 48.148068 31.680772 11.407258  8.476442  7.139589  6.704006  5.855580  4.117204
#>       age   ptratio       rad       tax        zn         b       lat
#>  3.422048  3.398435  2.558387  2.257436  1.950319  1.790536  1.782093

# alternatively, going into the model
rr$learners[[1]]$base_learner()$model$variable.importance
#>       age         b      crim       dis     indus       lat       lon     lstat
#>  3.422048  1.790536  5.855580  7.139589  6.704006  1.782093  4.117204 48.148068
#>       nox   ptratio       rad        rm       tax     tract        zn
#>  8.476442  3.398435  2.558387 31.680772  2.257436 11.407258  1.950319

If you train the at object in the normal way, by using $train(), you get a functioning object that possibly works more in the way you expect:

at$train(boston_task)
at$base_learner()$model
#> Ranger result
#> 
#> Call:
#>  ranger::ranger(dependent.variable.name = task$target_names, data = task$data(),      case.weights = task$weights$weight, importance = "permutation",      num.threads = 1L, num.trees = 22L, splitrule = "variance") 
#> 
#> Type:                             Regression 
#> Number of trees:                  22 
#> Sample size:                      506 
#> Number of independent variables:  15 
#> Mtry:                             3 
#> Target node size:                 5 
#> Variable importance mode:         permutation 
#> Splitrule:                        variance 
#> OOB prediction error (MSE):       12.33812 
#> R squared (OOB):                  0.8536618

To the degree that GraphLearner does not have an $importance() yet, I consider this to be a duplicate of mlr-org/mlr3pipelines#291. However, I am wondering whether the fact that at retains some things from the resample() call is a bug in the AutoTuner. After running the resample() call in OP's code, I get

at$model
#> $learner
#> $learner$predict_type
#> [1] "response"

this may be confusing? I think at should ideally be entirely untrained at this point, with no model at all.

mb706 · 2024-08-17T16:21:29Z

Bumping this back to mlr3. The remaining issue -- at$model is not NULL after resample() -- is a problem either in resample() or in AutoTuner.

be-marc · 2024-12-20T16:19:29Z

Fixed by mlr-org/mlr3tuning#484

sebffischer added the Workshop label Aug 16, 2024

be-marc transferred this issue from mlr-org/mlr3 Aug 17, 2024

mb706 changed the title ~~AutoTuner and pipelines obscure objects associated with the base learner~~ AutoTuner $model is set to some bogus value after resample() call. Aug 17, 2024

mb706 transferred this issue from mlr-org/mlr3pipelines Aug 17, 2024

berndbischl self-assigned this Aug 17, 2024

github-project-automation bot added this to Workshop 2021 Aug 28, 2024

github-project-automation bot moved this to To do in Workshop 2021 Aug 28, 2024

be-marc assigned be-marc and unassigned berndbischl Nov 20, 2024

be-marc closed this as completed Dec 20, 2024

github-project-automation bot moved this from To do to Done in Workshop 2021 Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoTuner `$model` is set to some bogus value after `resample()` call. #1091

AutoTuner `$model` is set to some bogus value after `resample()` call. #1091

JHarrisonEcoEvo commented Aug 17, 2021 •

edited by mb706

Loading

sebffischer commented Aug 16, 2024

mb706 commented Aug 17, 2024

mb706 commented Aug 17, 2024

be-marc commented Dec 20, 2024

AutoTuner $model is set to some bogus value after resample() call. #1091

AutoTuner $model is set to some bogus value after resample() call. #1091

Comments

JHarrisonEcoEvo commented Aug 17, 2021 • edited by mb706 Loading

sebffischer commented Aug 16, 2024

mb706 commented Aug 17, 2024

mb706 commented Aug 17, 2024

be-marc commented Dec 20, 2024

AutoTuner `$model` is set to some bogus value after `resample()` call. #1091

AutoTuner `$model` is set to some bogus value after `resample()` call. #1091

JHarrisonEcoEvo commented Aug 17, 2021 •

edited by mb706

Loading