Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoTuner $model is set to some bogus value after resample() call. #1091

Closed
JHarrisonEcoEvo opened this issue Aug 17, 2021 · 4 comments
Closed
Assignees
Labels

Comments

@JHarrisonEcoEvo
Copy link

JHarrisonEcoEvo commented Aug 17, 2021

Hi there,

Thanks for the package. I recently encountered difficulty extracting variable importance metrics from a ranger learner after building a pipeline that did some feature engineering, etc. Working example below with possible points of discussion included as comments.

In short, the difficulty in extracting expected model outputs will stymie new users very easily...it did me, for sure. If possible, it seems that the base learner should always be attached to mlr3 objects at the top level as the model object...e.g., learner$model should return the most base model. I get that this might be hard if one is wrapping learners with other learners, as I do here.

Perhaps a solution if multiple learners are being wrapped/combined, is to have users specify which learner is the one they want to "track" through the modeling process. That way after byzantine tuning, resampling, etc. they could say give me the model that I marked. And, by default that model would fill the top slot of learner$model. Perhaps this suggestion is bad since I know very little about the mlr3 coding architecture. But, if at all possible, it should be made easier to get at the most base learner ....

Thanks!

#MWE------------------
library("mlr3")

boston_task = tsk("boston_housing")
paste(
  "This data set contains", boston_task$nrow, "observations and",
  boston_task$ncol - 1, "features."
)

#remove the factor features
boston_task$col_roles$feature <-  boston_task$feature_names[!boston_task$feature_names %in% c("chas", "town")]

#Build pipeline for scaling and imputation

imp_missind <- po("missind")

#Note that imputation of numeric will NOT work with integer class features.
#I found this confusing at first as well since R typically treats integers
#and numbers the same

imp_num <- po("imputehist", param_vals = list(affect_columns = selector_type("numeric")))

# build learner with pipeline/graph #

graph <-  po("imputehist", param_vals = list(affect_columns = selector_type("numeric"))) %>>% 
  po("scale") %>>%
  po( lrn("regr.ranger", importance = "permutation"))

g1 <- GraphLearner$new(graph)

# tuning #

params <- ps(
  regr.ranger.num.trees = p_int(lower = 10, upper = 200, tags = "budget"),
  regr.ranger.splitrule = p_fct(levels = c("extratrees", "variance"))
)

at <- AutoTuner$new(
  learner = g1,
  resampling = rsmp("cv", folds = 2), 
  measure = msr("regr.rsq"),
  terminator = trm("none"),
  tuner = tnr("hyperband", eta = 3),
  search_space = params, 
  store_models = TRUE)

#pass back to resampling for nested resampling

outer_resampling <- rsmp("cv", folds = 2)
rr <- resample(task = boston_task,
               learner = at,
               outer_resampling,
               store_models = TRUE)

#I expected this to have our ranger model, but it is NULL. I now understand that this is
#because the AutoTuner wraps a model and so it is not clear what to return. See previous comments
#regarding a possible solution for this.

at$model

#I also expected this method to return the ranger model, but it doesn't

basemodel <- at$base_learner()
basemodel$model

#It took me a long time to figure out that I needed to train the AutoTuner learner then dig DEEP into it to find 
#variable importance. Since str() doesn't work on these objects, this was very time consuming. I spent several hours googling and messing around before I figured this out. 

at$train(boston_task) 

#This is pretty unintuitive because we are calling several model objects...

at$learner$model$regr.ranger$model$variable.importance

#I thought that all the machinations that happened within resampling and tuning would generate 
#variable importance metrics. Particularly, because importance is listed as a parameter
#of the AutoTuner object. Perhaps if this parameter was removed from the stdout of the
#AutoTuner object it would make it more
#obvious that variable importance metrics don't exist because this is a wrapper of other learners. 

Thanks, for your time!

@sebffischer
Copy link
Member

@mb706

@be-marc be-marc transferred this issue from mlr-org/mlr3 Aug 17, 2024
@mb706
Copy link
Collaborator

mb706 commented Aug 17, 2024

Thank you @JHarrisonEcoEvo for your thoughtful feedback. (And sorry that you are getting a reply three years later...). There are a few points here that basically boil down to design decisions, but there is also an error in your code.

If possible, it seems that the base learner should always be attached to mlr3 objects at the top level as the model object...e.g., learner$model should return the most base model.

We basically have that, and you almost found it: <AutoTunerObject>$base_learner()$model. We use the $model field to contain the totality of the information obtained by the $train() process.

We could argue, whether a $base_model or something should exist (pointing to $base_learner()$model), but it would not make life that much easier.

It would probably also make sense to expose certain meta-information about the base-model, e.g. $importance (mlr-org/mlr3pipelines#291), $selected_features() etc. (i.e. these).


Why does at$base_learner()$model not work, at that point in your code? The reason here is that you are accessing at after it was used for resampling. It was not really "trained" at all. All the information that you do get ouf of the at object is there by accident, to some degree, I think (@be-marc ?). You can get a "properly" trained AutoTuner object by extracting it out of the rr:

rr$learners[[1]]
#> <AutoTuner:imputehist.scale.regr.ranger.tuned>
#> * Model: auto_tuner_model
#> * Parameters: list()
#> * Validate: NULL
#> * Packages: mlr3, mlr3tuning, mlr3pipelines, graphics, mlr3learners,
#>   ranger
#> * Predict Types:  [response], se, distr
#> * Feature Types: logical, integer, numeric, character, factor, ordered,
#>   POSIXct
#> * Properties: featureless, hotstart_backward, hotstart_forward,
#>   importance, loglik, marshal, missings, oob_error, selected_features,
#>   weights
#> * Search Space:
#>                       id    class lower upper nlevels
#>                   <char>   <char> <num> <num>   <num>
#> 1: regr.ranger.num.trees ParamInt    10   200     191
#> 2: regr.ranger.splitrule ParamFct    NA    NA       2

# using the `$importance()` interface:
rr$learners[[1]]$base_learner()$importance()
#>     lstat        rm     tract       nox       dis     indus      crim       lon
#> 48.148068 31.680772 11.407258  8.476442  7.139589  6.704006  5.855580  4.117204
#>       age   ptratio       rad       tax        zn         b       lat
#>  3.422048  3.398435  2.558387  2.257436  1.950319  1.790536  1.782093

# alternatively, going into the model
rr$learners[[1]]$base_learner()$model$variable.importance
#>       age         b      crim       dis     indus       lat       lon     lstat
#>  3.422048  1.790536  5.855580  7.139589  6.704006  1.782093  4.117204 48.148068
#>       nox   ptratio       rad        rm       tax     tract        zn
#>  8.476442  3.398435  2.558387 31.680772  2.257436 11.407258  1.950319

If you train the at object in the normal way, by using $train(), you get a functioning object that possibly works more in the way you expect:

at$train(boston_task)
at$base_learner()$model
#> Ranger result
#> 
#> Call:
#>  ranger::ranger(dependent.variable.name = task$target_names, data = task$data(),      case.weights = task$weights$weight, importance = "permutation",      num.threads = 1L, num.trees = 22L, splitrule = "variance") 
#> 
#> Type:                             Regression 
#> Number of trees:                  22 
#> Sample size:                      506 
#> Number of independent variables:  15 
#> Mtry:                             3 
#> Target node size:                 5 
#> Variable importance mode:         permutation 
#> Splitrule:                        variance 
#> OOB prediction error (MSE):       12.33812 
#> R squared (OOB):                  0.8536618 

To the degree that GraphLearner does not have an $importance() yet, I consider this to be a duplicate of mlr-org/mlr3pipelines#291. However, I am wondering whether the fact that at retains some things from the resample() call is a bug in the AutoTuner. After running the resample() call in OP's code, I get

at$model
#> $learner
#> $learner$predict_type
#> [1] "response"

this may be confusing? I think at should ideally be entirely untrained at this point, with no model at all.

@mb706 mb706 changed the title AutoTuner and pipelines obscure objects associated with the base learner AutoTuner $model is set to some bogus value after resample() call. Aug 17, 2024
@mb706
Copy link
Collaborator

mb706 commented Aug 17, 2024

Bumping this back to mlr3. The remaining issue -- at$model is not NULL after resample() -- is a problem either in resample() or in AutoTuner.

@mb706 mb706 transferred this issue from mlr-org/mlr3pipelines Aug 17, 2024
@berndbischl berndbischl self-assigned this Aug 17, 2024
@be-marc be-marc assigned be-marc and unassigned berndbischl Nov 20, 2024
@be-marc
Copy link
Member

be-marc commented Dec 20, 2024

Fixed by mlr-org/mlr3tuning#484

@be-marc be-marc closed this as completed Dec 20, 2024
@github-project-automation github-project-automation bot moved this from To do to Done in Workshop 2021 Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

5 participants