Skip to content

Result of tune_cluster() depends on the name of the split? #193

Open
@trevorcampbell

Description

@trevorcampbell

When I try to use tune_cluster() with an apparent() split (because kmeans isn't often used with splits, so apparent() seems to make the most sense to me), the result has a lot of NAs. After a lot of work I eventually traced it down to something really weird: the result seems to depend on the name of the split (!?).

You can reproduce this in the docker image ubcdsci/r-dsci-100-grading:cafad0999c16.

Reprex:

library(tidyverse)
library(tidymodels)
library(tidyclust)

# start by reducing the size of mtcars just to make things cleaner (this is not important for the bug)
mt <- mtcars |> rep_sample_n(size = 10, replace = TRUE, reps = 1) |> ungroup() |> select(mpg, disp)

# specification and recipe
kmeans_spec <- k_means(num_clusters = tune()) |>
    set_engine("stats")

kmeans_recipe <- recipe(~ ., data=mt) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

# tuning 1-4 clusters
ks <- tibble(num_clusters = 1:4)

# Now we create two rsets. One using apparent, one manually. They're identical except for the split name.

# RSET 1: manually created single split that just does tuning on the whole data set. 
# The split can be named anything you want EXCEPT "Apparent". I named it "banana".
# Note: if you name this "Apparent", you'll see a buggy result just like if you used apparent().
indices <- list(list(analysis = 1:nrow(mt), assessment = 1:nrow(mt)))
splits <- lapply(indices, make_splits, data = mt)
split_good <- manual_rset(splits, c("banana"))

# RSET 2: using apparent. 
split_bad <- apparent(mt)

# if you inspect split_good and split_bad, they're identical aside from the split name.

# Now we tune the number of clusters with each rset
results_good <- workflow() |>
    add_recipe(kmeans_recipe) |>
    add_model(kmeans_spec) |>
    tune_cluster(resamples = split_good, grid = ks) |>
    collect_metrics()

results_bad <- workflow() |>
    add_recipe(kmeans_recipe) |>
    add_model(kmeans_spec) |>
    tune_cluster(resamples = split_bad, grid = ks) |>
    collect_metrics()

The outputs look like:

image

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugan unexpected problem or unintended behavior

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions