Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tglkmeans] - Better finding cluster centers #161

Open
coforfe opened this issue Aug 29, 2023 · 1 comment
Open

[tglkmeans] - Better finding cluster centers #161

coforfe opened this issue Aug 29, 2023 · 1 comment
Labels
K-Means K Means method New Engine 🚗 Add new engine

Comments

@coforfe
Copy link

coforfe commented Aug 29, 2023

Hi Emil,

Thanks for your detailed description about the lower speed of tglkmeans ( #62 ).

The issue about the speed is something that can be corrected when using tglkmeans in a paraellized way.
But for the relevant aspect of tglkmeans with respect to kmeans is that it offers a better cluster centers finding. tglkmeans is initialized in a different way than kmeansand it gets the right centers better than kmeans.

Please consider this code:

library(tidymodels)
library(tidyclust)
library(tglkmeans)
library(recipes)
library(tibble)

set.seed(1234)
data <- rbind(
  matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2),
  matrix(rnorm(100, mean = 2, sd = 0.3), ncol = 2),
  matrix(rnorm(100, mean = 3, sd = 0.3), ncol = 2),
  matrix(rnorm(100, mean = 4, sd = 0.3), ncol = 2),
  matrix(rnorm(100, mean = 5, sd = 0.3), ncol = 2)
)
colnames(data) <- c("x", "y")

data <- data %>% as.data.frame()


#------------------ SMALL --------------------
km          <- TGL_kmeans_tidy(data, 5)
kmstd       <- kmeans(data, 5)
kmstd$clust <- tibble(id = as.character(1:nrow(data)), clustkmstd = kmstd$cluster)

d <- left_join(km$cluster, kmstd$clust) %>% 
  mutate( compa = ifelse(clust == clustkmstd, 1, 0))

right_val <- sum(d$compa) * 100 / nrow(d)
error_val <- 100 - right_val
error_val



#------------------ MEDIUM --------------------
rec <- recipe(~., data = ames) |>
  step_dummy(all_nominal_predictors()) |>
  step_zv(all_predictors()) |>
  step_normalize(all_predictors())

ames_num <- prep(rec) |> 
  bake(new_data = NULL)

data <- ames_num

km          <- TGL_kmeans_tidy(data, 4)
kmstd       <- kmeans(data, 4)
kmstd$clust <- tibble(id = as.character(1:nrow(data)), clustkmstd = kmstd$cluster)

d <- left_join(km$cluster, kmstd$clust) %>% 
  mutate( compa = ifelse(clust == clustkmstd, 1, 0))

right_val <- sum(d$compa) * 100 / nrow(d)
error_val <- 100 - right_val
error_val



#------------------ LARGE --------------------
ames_num_big <- ames_num |>
  slice_sample(n = 1000000)

data <- ames_num_big

km          <- TGL_kmeans_tidy(data, 4)
kmstd       <- kmeans(data, 4)
kmstd$clust <- tibble(id = as.character(1:nrow(data)), clustkmstd = kmstd$cluster)

d <- left_join(km$cluster, kmstd$clust) %>% 
  mutate( compa = ifelse(clust == clustkmstd, 1, 0))

right_val <- sum(d$compa) * 100 / nrow(d)
error_val <- 100 - right_val
error_val

Which produces these results:

> #------------------ SMALL --------------------
> km          <- TGL_kmeans_tidy(data, 5)
Warning message:
In TGL_kmeans_tidy(data, 5) :
  Input doesn't have a column named "id". Using rownames instead.
> kmstd       <- kmeans(data, 5)
> kmstd$clust <- tibble(id = as.character(1:nrow(data)), clustkmstd = kmstd$cluster)
> 
> d <- left_join(km$cluster, kmstd$clust) %>% 
+   mutate( compa = ifelse(clust == clustkmstd, 1, 0))
Joining with `by = join_by(id)`
> 
> right_val <- sum(d$compa) * 100 / nrow(d)
> error_val <- 100 - right_val
> error_val
[1] 67.84983
> 
> 
> 
> #------------------ MEDIUM --------------------
> rec <- recipe(~., data = ames) |>
+   step_dummy(all_nominal_predictors()) |>
+   step_zv(all_predictors()) |>
+   step_normalize(all_predictors())
> 
> ames_num <- prep(rec) |> 
+   bake(new_data = NULL)
> 
> data <- ames_num
> 
> km          <- TGL_kmeans_tidy(data, 4)
Warning message:
In TGL_kmeans_tidy(data, 4) :
  Input doesn't have a column named "id". Using rownames instead.
> kmstd       <- kmeans(data, 4)
> kmstd$clust <- tibble(id = as.character(1:nrow(data)), clustkmstd = kmstd$cluster)
> 
> d <- left_join(km$cluster, kmstd$clust) %>% 
+   mutate( compa = ifelse(clust == clustkmstd, 1, 0))
Joining with `by = join_by(id)`
> 
> right_val <- sum(d$compa) * 100 / nrow(d)
> error_val <- 100 - right_val
> error_val
[1] 24.57338
> 
> 
> 
> #------------------ LARGE --------------------
> ames_num_big <- ames_num |>
+   slice_sample(n = 1000000)
> 
> data <- ames_num_big
> 
> km          <- TGL_kmeans_tidy(data, 4)
Warning message:
In TGL_kmeans_tidy(data, 4) :
  Input doesn't have a column named "id". Using rownames instead.
> kmstd       <- kmeans(data, 4)
> kmstd$clust <- tibble(id = as.character(1:nrow(data)), clustkmstd = kmstd$cluster)
> 
> d <- left_join(km$cluster, kmstd$clust) %>% 
+   mutate( compa = ifelse(clust == clustkmstd, 1, 0))
Joining with `by = join_by(id)`
> 
> right_val <- sum(d$compa) * 100 / nrow(d)
> error_val <- 100 - right_val
> error_val
[1] 95.93857
> 

Thanks again,
Carlos.

@EmilHvitfeldt EmilHvitfeldt changed the title [tglkmeans] - Better finding cluster centers.... [tglkmeans] - Better finding cluster centers Aug 29, 2023
@EmilHvitfeldt EmilHvitfeldt added New Engine 🚗 Add new engine K-Means K Means method labels Aug 29, 2023
@EmilHvitfeldt
Copy link
Member

Thanks for letter me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
K-Means K Means method New Engine 🚗 Add new engine
Projects
None yet
Development

No branches or pull requests

2 participants