Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Result of tune_cluster() depends on the name of the split? #193

Open
trevorcampbell opened this issue Aug 27, 2024 · 2 comments
Open

Result of tune_cluster() depends on the name of the split? #193

trevorcampbell opened this issue Aug 27, 2024 · 2 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@trevorcampbell
Copy link

trevorcampbell commented Aug 27, 2024

When I try to use tune_cluster() with an apparent() split (because kmeans isn't often used with splits, so apparent() seems to make the most sense to me), the result has a lot of NAs. After a lot of work I eventually traced it down to something really weird: the result seems to depend on the name of the split (!?).

You can reproduce this in the docker image ubcdsci/r-dsci-100-grading:cafad0999c16.

Reprex:

library(tidyverse)
library(tidymodels)
library(tidyclust)

# start by reducing the size of mtcars just to make things cleaner (this is not important for the bug)
mt <- mtcars |> rep_sample_n(size = 10, replace = TRUE, reps = 1) |> ungroup() |> select(mpg, disp)

# specification and recipe
kmeans_spec <- k_means(num_clusters = tune()) |>
    set_engine("stats")

kmeans_recipe <- recipe(~ ., data=mt) |>
    step_scale(all_predictors()) |>
    step_center(all_predictors())

# tuning 1-4 clusters
ks <- tibble(num_clusters = 1:4)

# Now we create two rsets. One using apparent, one manually. They're identical except for the split name.

# RSET 1: manually created single split that just does tuning on the whole data set. 
# The split can be named anything you want EXCEPT "Apparent". I named it "banana".
# Note: if you name this "Apparent", you'll see a buggy result just like if you used apparent().
indices <- list(list(analysis = 1:nrow(mt), assessment = 1:nrow(mt)))
splits <- lapply(indices, make_splits, data = mt)
split_good <- manual_rset(splits, c("banana"))

# RSET 2: using apparent. 
split_bad <- apparent(mt)

# if you inspect split_good and split_bad, they're identical aside from the split name.

# Now we tune the number of clusters with each rset
results_good <- workflow() |>
    add_recipe(kmeans_recipe) |>
    add_model(kmeans_spec) |>
    tune_cluster(resamples = split_good, grid = ks) |>
    collect_metrics()

results_bad <- workflow() |>
    add_recipe(kmeans_recipe) |>
    add_model(kmeans_spec) |>
    tune_cluster(resamples = split_bad, grid = ks) |>
    collect_metrics()

The outputs look like:

image

@EmilHvitfeldt EmilHvitfeldt added the bug an unexpected problem or unintended behavior label Aug 27, 2024
@trevorcampbell
Copy link
Author

An minor update: if we downgrade tune to version 1.1.2, apparent seems to work again. So perhaps there was a change in tune that broke tune_cluster() from 1.1.2 -> 1.2.0?

@briank-git
Copy link

briank-git commented Dec 25, 2024

I think this commit may be the culprit since it adds a test for the exact string "Apparent" and seems to exclude it tidymodels/tune@637e923

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants