diff --git a/2022-04-27_Intro_to_Healthyverse/healthyverse.png b/2022-04-27_Intro_to_Healthyverse/healthyverse.png new file mode 100644 index 0000000..f48d2f9 Binary files /dev/null and b/2022-04-27_Intro_to_Healthyverse/healthyverse.png differ diff --git a/2022-04-27_Intro_to_Healthyverse/socal_user_april_2022_the_healthyverse.Rmd b/2022-04-27_Intro_to_Healthyverse/socal_user_april_2022_the_healthyverse.Rmd new file mode 100644 index 0000000..e4d3dad --- /dev/null +++ b/2022-04-27_Intro_to_Healthyverse/socal_user_april_2022_the_healthyverse.Rmd @@ -0,0 +1,866 @@ +--- +title: "The {healthyverse}" +subtitle: "A Presentation to the SoCal RUG" +author: "Steven P Sanderson II, MPH" +date: "`r Sys.Date()`" +output: + ioslides_presentation: + logo: healthyverse.png +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = FALSE) +library(tidyverse) +library(tidyquant) +library(timetk) +library(healthyverse) +library(tidymodels) +library(modeltime) +``` + +# About Me + +## Education + +- BA Economics: SUNY Stony Brook University +- MPH: SUNY Stony Brook University + +## Employment + +- Data Scientist/IT Manager: NYU Langone Health at Long Island Community Hospital +- Head Data Scientist: Manchu Technologies Corp. (Consulting in `SQL`, `R`) +- Developer: `healthyverse` packages. +- Researcher: FMTangerman LLC https://www.fmtangerman.com/ + +## Sites + +GitHub: https://github.com/spsanderson + +LinkedIn: https://www.linkedin.com/in/spsanderson/ + +Personal: https://www.spsanderson.com + +## Interests + +- Data Analysis +- R (probably obvious) +- Plex Server (Music is life) - Is 88K mp3's to much? NO! + +## How I got Started + +I got started with R in my first job out of undergrad. + +Employer: "Do you know SQL and any other analysis programming?" +Me: Yes, not a problem at all. + +Reality...a little different, maybe an alternative fact. + +SO...Google Searching...saved the day? YES! + +## What are we going to talk about + +So what are we going to discuss today? + +The `healthyverse` suite of packages. + +## Healthyverse Packages + +There are currently five packages in the `healthyverse` + + +| Package | r-universe badge | +|:-------:|:----------------:| +| `healthyR` | [data:image/s3,"s3://crabby-images/92cec/92cec5c757ef9f91ea790efd9b150acef2fd5b1f" alt="healthyR status badge"](https://spsanderson.r-universe.dev) | +| `healthyR.data` | [data:image/s3,"s3://crabby-images/e5508/e5508285171530a0f768b2fc1e1274f3f216cc8a" alt="healthyR.data status badge"](https://spsanderson.r-universe.dev) | +| `healthyR.ai` | [data:image/s3,"s3://crabby-images/66eb6/66eb69895beafe74dbdd9709f7b59800f944f85c" alt="healthyR.ai status badge"](https://spsanderson.r-universe.dev) | +| `healthyR.ts` | [data:image/s3,"s3://crabby-images/7d427/7d427f381672b6b8d2f92ea0a078ee88c4a9acb8" alt="healthyR.ts status badge"](https://spsanderson.r-universe.dev) | +| `TidyDensity` | [data:image/s3,"s3://crabby-images/057d9/057d9a5cd32e4f8edea99ffaea8a1dcd114c3400" alt="TidyDensity status badge"](https://spsanderson.r-universe.dev) | + +The `healthyverse` package is really just a meta package like tidyverse. + +## `healthyR` + +First came `healthyR` + +The desire of this package is to be... + +It is experimental + +Site: https://www.spsanderson.com/healthyR/ + +## LOS RA Index Plot +This plot takes place at the end of a work flow. Let's take a look at an example. First we need some data. + +```{r warning=FALSE, message=FALSE} +set.seed(123) +data_tbl <- tibble( + "alos" = rnorm(186, 6.5, 3) + , "elos" = rnorm(186, 6, 3) + , "readmit_rate" = rnorm(186, .1, .05) + , "readmit_rate_bench" = rnorm(186, .075, .05) +) + +data_tbl +``` + +## LOS RA Index Plot (summary) +Now that we have our data, let's take a look at the summary table. + +```{r warning=FALSE, message=FALSE} +summary_tbl <- los_ra_index_summary_tbl( + .data = data_tbl + , .max_los = 12 + , .alos_col = alos + , .elos_col = elos + , .readmit_rate = readmit_rate + , .readmit_bench = readmit_rate_bench +) + +head(summary_tbl, 5) +``` + +## LOS RA Index Plot (plot) + +Now lets take a look at the resulting plot and discuss what it tells us. + +```{r warning=FALSE, message=FALSE} +los_ra_index_plt(summary_tbl) +``` + +## K-Means + +What is it? In it's simplest form, K-Means will partition data into K non-overlapping clusters. + +Example? Think a bank wanting to give people credit cards, this is a clustering +problem, high income, average income, low income, etc. + +It's easy to understand how many applications this has. + +Healthcare is no different. + +## Scree Plot + +We are only going to focus on the scree plot. This helps us to decide how many +clusters are in our data. + +This plot is also many times referred to as the elbow plot. + +## Example + +```{r warning=FALSE, message=FALSE} +data_tbl <- healthyR_data%>% + filter(ip_op_flag == "I") %>% + filter(payer_grouping != "Medicare B") %>% + filter(payer_grouping != "?") %>% + select(service_line, payer_grouping) %>% + mutate(record = 1) %>% + as_tibble() + +ui_tbl <- kmeans_user_item_tbl( + .data = data_tbl + , .row_input = service_line + , .col_input = payer_grouping + , .record_input = record + ) + +kmm_tbl <- kmeans_mapped_tbl(ui_tbl) + +kmeans_scree_plt(.data = kmm_tbl) +``` + +## What Just Happened? + +What just happened? Magic? Maybe? + +First the data. Using the `healthyR_data` from my package `healthyR.data` we get a tibble of records with a service line and a payer grouping. + +A patients service line is the row record and the payer grouping is the item, this gives us the user item matrix we need. + +Next we get a tibble with multiple centers via the function `kmeans_mapped_tbl()` + +Lastly, we plot that mapped tibble. + +## Magic Chart + +Have you seen the Gartner Magic Quadrant Chart? + +If so, then you know, if not, then I'll show. + +First let's take a look at the function call. + +## Function Call + +```{r, eval=FALSE, echo=TRUE} +gartner_magic_chart_plt( + .data = tibble::tibble( + x = rnorm(100, 0, 1) + , y = rnorm(100,0,1) + ) + , .x_col = x + , .y_col = y + , .x_lab = "LOS" + , .y_lab = "Readmit" + , .plt_title = "Test Plot" + , .tr_lbl = "High RA-LOS" + , .tl_lbl = "High RA" + , .bl_lbl = "Leader" + , .br_lbl = "High LOS" +) +``` + +## Plot Output + +```{r} +gartner_magic_chart_plt( + .data = tibble::tibble( + x = rnorm(100, 0, 1) + , y = rnorm(100,0,1) + ) + , .x_col = x + , .y_col = y + , .x_lab = "LOS" + , .y_lab = "Readmit" + , .plt_title = "Test Plot" + , .tr_lbl = "High RA-LOS" + , .tl_lbl = "High RA" + , .bl_lbl = "Leader" + , .br_lbl = "High LOS" +) +``` + +## `healthyR.ts` + +What is `healthyR.ts` + +Site: https://www.spsanderson.com/healthyR.ts/ + +The intent is that it be a easy to use verb framework for quick modeling of +time-series data. + +It is experimental. + +In this section we are going to go over the processes of model tuning. + +## Time Series Data + +For this we use the internal `AirPassengers` data set. + +```{r, echo=TRUE} +data_ts <- AirPassengers + +head(data_ts) +``` + +## Transform to `tibble` + +Most of the work done inside of `healthyR.ts` requires the use of a `tibble`, and where a `ts` object can be used, it is typically converted to a `tibble` internally to the function. This is done on purpose so that it works well with the rest of the `tidyverse` + +```{r echo=TRUE} +data_tbl <- ts_to_tbl(data_ts) %>% + select(-index) + +head(data_tbl, 5) +``` + +## `ts` to `tibble` + +With the function `ts_to_tbl()` we can take a time-series object and make it into a `tibble` object. + +## Splits Oject + +Now that we have our data, we need to make our time series split object. We use `timetk::time_series_split()` + +```{r, echo=TRUE} +splits <- time_series_split( + data_tbl + , date_col + , assess = 12 + , skip = 3 + , cumulative = TRUE +) + +head(training(splits), 1) +``` + +## Recipe + +Now that we have our splits object we can go ahead and make the recipe. For this +we will use `ts_auto_recipe()` which creates four different recipes and outputs them into a list object. + +```{r, echo=TRUE} +rec_objs <- ts_auto_recipe( + .data = data_tbl + , .date_col = date_col + , .pred_col = value +) + +``` + +## Recipe (con't) + +Lets take a look at it + +```{r, echo=TRUE} +rec_objs[[1]] +``` + +We are not going to look at all of them. + +## Workflow Sets and `healthyR.ts` +Now that we have our data and our recipe, we can make our automatic `workflowsets` object. As mentioned at the top we are going to use the MARS algorithm with the earth engine. This function from `healthyR.ts` will take in a list of recipes and a `.model_type parameter`, and create a `workflowsets` `tibble`, where `cross` is set to `TRUE` + +We use `ts_wfs_mars()` in this example. + +```{r eval=FALSE, echo=TRUE} +ts_wfs_mars( + .model_type = "earth", + .recipe_list, + .num_terms = 200, + .prod_degree = 1, + .prune_method = "backward" +) +``` + +## (con't) + +Lets make the workflowset objec. + +```{r echo=TRUE} +wfsets <- ts_wfs_mars( + .model_type = "earth" + , .recipe_list = rec_objs +) + +wfsets +``` + +## (con't) + +I tried to give reasonable defaults for all of the functions used in the `healthyverse` set of packages. + +Many times hospitals, especially smaller non-academic or even small academic ones do not have a department of analysts. It could be that there are none or just one. + +## Modeltime workflow + +This workflow is outside of the scope of the `healthyverse` but necessary as it is an intermediate step before we get to the tuning process. + +```{r echo=TRUE} +wf_fits <- wfsets %>% + modeltime_fit_workflowset( + data = training(splits) + , control = control_fit_workflowset( + allow_par = FALSE + , verbose = FALSE + ) + ) + +models_tbl <- wf_fits %>% + filter(.model != "NULL") +``` + +## (con't) + +You will see that we did `.model != "NULL"` this is because if a certain model tries to run, lets say an `ets` model where alpha < beta then it will fail and the `.model` output will be equal to `"NULL"` + +Lets see the models_tbl + +```{r echo=TRUE} +models_tbl +``` + +## Calibration + +This is another part of the `modeltime` workflow that is used. We _calibrate_ the data with the testing split using `testing(splits)` + +```{r echo=TRUE} +calibration_tbl <- models_tbl %>% + modeltime_calibrate(new_data = testing(splits)) + +calibration_tbl +``` + +## Automatic Tuning + +```{r eval=FALSE, echo=TRUE} +output <- healthyR.ts::ts_model_auto_tune( + .modeltime_model_id = 1, + .calibration_tbl = calibration_tbl, + .splits_obj = splits, + .drop_training_na = TRUE, + .date_col = date_col, + .value_col = value, + .tscv_assess = "12 months", + .tscv_skip = "3 months", + .num_cores = 1 +) +``` + +## What happened? + +We just automatically tuned the parameters of a model. + +## Outputs + +```{r echo=FALSE, message=FALSE, warning=FALSE} +output <- healthyR.ts::ts_model_auto_tune( + .modeltime_model_id = 1, + .calibration_tbl = calibration_tbl, + .splits_obj = splits, + .drop_training_na = TRUE, + .date_col = date_col, + .value_col = value, + .tscv_assess = "12 months", + .tscv_skip = "3 months", + .num_cores = 5 +) +``` + +There are three sections of output from the function. + +1. Data +2. Model Info +3. Plots + +## Data + +The function returns the following: + +1. Calibration tibble from modeltime workflow +2. Calibration tuned tibble from tuned model +3. Time Series Cross Validation tibble +4. Tuned Model Results +5. Best tuned results +6. Time Series Cross Validation object (rsample object) + +## Model Info + +The following model information is available: + +1. Model Specification +2. Parsnip Engine used +3. Plucked Model (The original model before tuning) +4. Tuned Workflow specification +5. Tuning Grid used +6. Tuned Time Series Cross Validation Workflow + +## Plots + +The following plots are returned; + +1. Tuning Grid Plot +2. Cross Validation Plan + +https://www.spsanderson.com/healthyR.ts/articles/auto-model-tune.html#data + + +## `healthyR.ai` + +What is `healthyR.ai`? + +The desire of this package is to be... + +It is experimental + +Site: https://www.spsanderson.com/healthyR.ai/ + +## `healthyR.ai` Topics + +We are going to talk about two functions and one workflow. + +Functions: + +1. `hai_histogram_facet_plot()` +2. `pca_your_recipe()` + +Workflow: + +1. Distribution Comparison of X + +## Histogram Facet Plot + +Function Call + +```{r eval=FALSE, echo=TRUE} +hai_histogram_facet_plot( + .data, + .bins = 10, + .scale_data = FALSE, + .ncol = 5, + .fct_reorder = FALSE, + .fct_rev = FALSE, + .fill = "steelblue", + .color = "white", + .scale = "free", + .interactive = FALSE +) +``` + +## Example Iris Data Not Scaled + +```{r echo=TRUE} +hai_histogram_facet_plot(.data = iris) +``` + +## Example Iris Data Scaled + +```{r echo=TRUE} +hai_histogram_facet_plot(.data = iris, .scale_data = TRUE) +``` + +## Function Reference + +[Histogram Facet Plot](https://www.spsanderson.com/healthyR.ai/reference/hai_histogram_facet_plot.html) + +## PCA Your Recipe + +What is PCA? + +Principal component analysis (PCA) is a transformation of a group of variables that produces a new set of artificial features or components. These components are designed to capture the maximum amount of information (i.e. variance) in the original variables. + +The function will automatically center and scale data along with dropping nzv columns before PCA is performed. + +## The output + +The function `pca_your_data()` returns invisibly a list object with a lot of information. + +[Reference pca_your_data()](https://www.spsanderson.com/healthyR.ai/reference/pca_your_recipe.html) + +## Lets test it out + +```{r initialize_pca_data, echo=TRUE} +data_tbl <- healthyR_data %>% + select(visit_end_date_time) %>% + summarise_by_time( + .date_var = visit_end_date_time, + .by = "month", + value = n() + ) %>% + set_names("date_col","value") %>% + filter_by_time( + .date_var = date_col, + .start_date = "2013", + .end_date = "2020" + ) + +splits <- initial_split(data = data_tbl, prop = 0.8) + +rec_obj <- recipe(value ~ ., training(splits)) %>% + step_timeseries_signature(date_col) %>% + step_rm(matches("(iso$)|(xts$)|(hour)|(min)|(sec)|(am.pm)")) +``` + +## Now lets PCA the Recipe + +```{r echo=TRUE} +output_list <- pca_your_recipe(rec_obj, .data = data_tbl) + +names(output_list) +``` + +## Lets Look at the Plots + +```{r, echo=TRUE} +output_list[["pca_variance_scree_plt"]] +``` + +## PCA Variance Scree Plot + +This shows us how much variability is captured by each principal component + +## PCA Loadings Plot + +```{r, echo=TRUE} +output_list[["pca_loadings_plt"]] +``` + +## PCA Top N Loadings Plot + +```{r echo=TRUE} +output_list$pca_top_n_loadings_plt +``` + +## PC Loadings + +Remember loadings are coefficients in linear combination predicting a varible by the standardized components, `pca_your_data()` will standardize your numberic components only for now. + +[PCA On StatQuest](https://statquest.org/pca-clearly-explained/) + +## Distribution Comparison + +This is a workflow, meaning we pipe certain commands together to get the data. We will do this step by step. Here is an example of the full workflow: + +```{r echo=TRUE} +df <- hai_scale_zero_one_vec(.x = mtcars$mpg) + +dist_comp_tbl <- hai_distribution_comparison_tbl( + .x = df, + .distributions = c("beta"), + .normalize = FALSE +) + +dist_tbl <- hai_get_density_data_tbl(.data = dist_comp_tbl) + +dens_plt <- hai_density_plot(dist_tbl, distribution, x, y, + .interactive = FALSE) +``` + +## Scaling Data + +So first we used `hai_scale_zero_one_vec()` which is a vectorized function that returns data on a vector from [0, 1] inclusive. + +Did it here to showcase. + +## Distribution Comparison Table + +[Function Reference](https://www.spsanderson.com/healthyR.ai/reference/hai_distribution_comparison_tbl.html) + +The end user can choose from several different distributions to make the comparison to the empirical data provided. + +Might drop this in favor of moving the functionality to TidyDensity + +## (con't) + +Lets see some output + +```{r echo=TRUE} +dist_comp_tbl +``` + +## (con't) + +This function as the library is experimental. There are some minor issues to flush out but the gist is that you can provide some vector x and choose from the list of supported distributions and get back certain information on them. + +The vector x is passed through the resulting base `r_` distribution function and that creates the dist_data column. + +The dist_data column is passed through to the `density()` function to get the density object. Right now the default n of 512 is used. May change this to the actual `n` or give user the option. + +## Distribution Data + +We used the function `hai_get_density_data_tbl()` which extracts the information we want from the dis_comp_tbl that was created. + +```{r echo=TRUE} +head(dist_tbl, 5) +``` + +## (con't) + +Simply what the function does is act as a sort of 'getter'. It grabs the density data from the `dist_comp_tbl` and puts it into a `tibble` + +[Function Reference](https://www.spsanderson.com/healthyR.ai/reference/hai_get_density_data_tbl.html) + +## Density Comparison Plot + +```{r echo=TRUE} +dens_plt +``` + +## (con't) + +So what we see now is the densities plotted against each other. The work to get to this point was minimal, only three functions. + +[Function Reference](https://www.spsanderson.com/healthyR.ai/reference/hai_density_plot.html) + +## `healthyR.data` + +The desire of this package is to be... + +It is stable + +Site: https://www.spsanderson.com/healthyR.data/ + +## `TidyDensity` + +The desire of this package is to be... + +It is stable + +Site: https://www.spsanderson.com/TidyDensity/ + +## Underlying Packages + +Heavy use of base R +`actuar` for other distributions + +Simple piped workflow, tries to be consistent with `tidyverse` + +## Simple functions + +`rnorm()` = `tidy_normal()` + +Try to use sensible defaults. + +Referred to as `tidy_` distribution functions. + +Syntax that is in all `tidy_` distribution functions `.n` and `.num_sims` where `.n` is the number of randomly generated points and `.num_sims` is how many simulations should be run. + +# Examples + +## What to Generate + +Generate five simulations of 100 points for normal and poisson distributions. + +```{r echo=TRUE} +n <- 100 +ns <- 5 +``` + +## Tidy Normal + +```{r echo=TRUE} +tn <- tidy_normal(.n = n, .num_sims = ns) +head(tn, 1) +tn %>% group_by(sim_number) %>% summarise(mean_val = mean(y)) +``` + +## Tidy Poisson + +```{r echo=TRUE} +tp <- tidy_poisson(.n = n, .num_sims = ns) +head(tp, 1) +tp %>% group_by(sim_number) %>% summarise(mean_val = mean(y)) +``` + +## What do we notice + +We see that there is some random variation in the data...so what does it look like? + +## Ploting + +Hello `tidy_autoplot()` + +Let's take a look at both `tn` and `tp` + +## Plot Tidy Normal + +```{r echo=TRUE} +tn %>% tidy_autoplot() +``` + +## Plot Tidy Poisson + +```{r echo=TRUE} +tp %>% tidy_autoplot() +``` + +## What other plots are there? + +[Function Reference](https://www.spsanderson.com/TidyDensity/reference/tidy_autoplot.html) + +- "density" +- "probability" +- "quantile" +- "qq" + +Plots can also be interactive. Tried to make them as automatic as possible. + +## Combined Distributions + +Can we combine distributions and visualize them? Yes! + +We can do this with `tidy_combine_distributions()` and `tidy_combined_autoplot()` + +How does it work? + +## (con't) + +```{r echo=TRUE} +combined_tbl <- tidy_combine_distributions( + tidy_normal(), + tidy_beta() +) +head(combined_tbl, 5) +``` + +## (con't) + +```{r echo=TRUE} +combined_tbl %>% + group_by(dist_type) %>% + summarise(mean_val = mean(y)) %>% + ungroup() +``` + +## Visualize the Distributions Density + +```{r echo=TRUE} +combined_tbl %>% tidy_combined_autoplot() +``` + +## Visualize the Distributions QQ + +```{r echo=TRUE} +combined_tbl %>% tidy_combined_autoplot(.plot_type = "qq") +``` + +## Multiple Versions of One distribution? + +What if you want to see different parameters for a single distribution? + +`tidy_multi_single_dist()` to the rescue! + +## Function Call + +```{r echo=TRUE, eval=FALSE} +tidy_multi_single_dist( + .tidy_dist = "tidy_normal", + .param_list = list( + .n = n, + .mean = c(-1, 0, 1), + .sd = 1, + .num_sims = ns + ) +) +``` + +## Inspect + +```{r echo=FALSE} +tmn <- tidy_multi_single_dist( + .tidy_dist = "tidy_normal", + .param_list = list( + .n = n, + .mean = c(-1, 0, 1), + .sd = 1, + .num_sims = ns + ) +) + +``` + +```{r echo=TRUE} +tmn %>% with_groups(dist_name, summarise, mean(y)) +``` + +## Visualize + +```{r echo=TRUE} +tmn %>% tidy_multi_dist_autoplot() +``` + +## Parameter Estimation + +Lets pick on the `beta` distribution. + +Lets define a vector x + +```{r echo=TRUE} +x <- tidy_beta(.n = n, .shape1 = 0.1, .shape2 = 0.4)$y +``` + +## Beta Estimates + +Outputs a list object, here we want to see the estimates tibble. + +```{r echo=TRUE} +util_beta_param_estimate(x)$parameter_tbl %>% + select(dist_type, shape1, shape2, method) +``` + +## Visualize the Empirical vs Estimated + +```{r echo=TRUE, message=FALSE, warning=FALSE} +util_beta_param_estimate(x)$combined_data_tbl %>% + tidy_combined_autoplot() +``` + +# Thank You!! \ No newline at end of file diff --git a/2022-04-27_Intro_to_Healthyverse/socal_user_april_2022_the_healthyverse.html b/2022-04-27_Intro_to_Healthyverse/socal_user_april_2022_the_healthyverse.html new file mode 100644 index 0000000..a66d7b1 --- /dev/null +++ b/2022-04-27_Intro_to_Healthyverse/socal_user_april_2022_the_healthyverse.html @@ -0,0 +1,4159 @@ + + +
+2022-04-27
+ +SQL
, R
)healthyverse
packages.GitHub: https://github.com/spsanderson
+ +LinkedIn: https://www.linkedin.com/in/spsanderson/
+ +Personal: https://www.spsanderson.com
+ +I got started with R in my first job out of undergrad.
+ +Employer: “Do you know SQL and any other analysis programming?” Me: Yes, not a problem at all.
+ +Reality…a little different, maybe an alternative fact.
+ +SO…Google Searching…saved the day? YES!
+ +So what are we going to discuss today?
+ +The healthyverse
suite of packages.
There are currently five packages in the healthyverse
Package | +r-universe badge | +
---|---|
healthyR |
+|
healthyR.data |
+|
healthyR.ai |
+|
healthyR.ts |
+|
TidyDensity |
+
The healthyverse
package is really just a meta package like tidyverse.
healthyR
First came healthyR
The desire of this package is to be…
+ +It is experimental
+ + + +This plot takes place at the end of a work flow. Let’s take a look at an example. First we need some data.
+ +## # A tibble: 186 x 4 +## alos elos readmit_rate readmit_rate_bench +## <dbl> <dbl> <dbl> <dbl> +## 1 4.82 9.33 0.0768 0.0651 +## 2 5.81 6.25 0.141 0.0435 +## 3 11.2 8.26 0.126 0.0333 +## 4 6.71 4.50 0.0705 0.104 +## 5 6.89 6.64 0.0502 0.0206 +## 6 11.6 5.03 0.107 0.149 +## 7 7.88 6.28 0.0993 0.0157 +## 8 2.70 3.31 0.0105 0.0801 +## 9 4.44 2.07 0.102 0.102 +## 10 5.16 12.0 0.110 0.104 +## # ... with 176 more rows+ +
Now that we have our data, let’s take a look at the summary table.
+ +## # A tibble: 5 x 4 +## los_group los_index rar_index los_ra_var +## <dbl> <dbl> <dbl> <dbl> +## 1 0 0 0.6 1.4 +## 2 1 0.133 1.83 1.70 +## 3 2 0.414 2 1.59 +## 4 3 0.488 1.57 1.08 +## 5 4 0.662 2.2 1.54+ +
Now lets take a look at the resulting plot and discuss what it tells us.
+ +What is it? In it’s simplest form, K-Means will partition data into K non-overlapping clusters.
+ +Example? Think a bank wanting to give people credit cards, this is a clustering problem, high income, average income, low income, etc.
+ +It’s easy to understand how many applications this has.
+ +Healthcare is no different.
+ +We are only going to focus on the scree plot. This helps us to decide how many clusters are in our data.
+ +This plot is also many times referred to as the elbow plot.
+ +What just happened? Magic? Maybe?
+ +First the data. Using the healthyR_data
from my package healthyR.data
we get a tibble of records with a service line and a payer grouping.
A patients service line is the row record and the payer grouping is the item, this gives us the user item matrix we need.
+ +Next we get a tibble with multiple centers via the function kmeans_mapped_tbl()
Lastly, we plot that mapped tibble.
+ +Have you seen the Gartner Magic Quadrant Chart?
+ +If so, then you know, if not, then I’ll show.
+ +First let’s take a look at the function call.
+ +gartner_magic_chart_plt( + .data = tibble::tibble( + x = rnorm(100, 0, 1) + , y = rnorm(100,0,1) + ) + , .x_col = x + , .y_col = y + , .x_lab = "LOS" + , .y_lab = "Readmit" + , .plt_title = "Test Plot" + , .tr_lbl = "High RA-LOS" + , .tl_lbl = "High RA" + , .bl_lbl = "Leader" + , .br_lbl = "High LOS" +)+ +
healthyR.ts
What is healthyR.ts
Site: https://www.spsanderson.com/healthyR.ts/
+ +The intent is that it be a easy to use verb framework for quick modeling of time-series data.
+ +It is experimental.
+ +In this section we are going to go over the processes of model tuning.
+ +For this we use the internal AirPassengers
data set.
data_ts <- AirPassengers + +head(data_ts)+ +
## [1] 112 118 132 129 121 135+ +
tibble
Most of the work done inside of healthyR.ts
requires the use of a tibble
, and where a ts
object can be used, it is typically converted to a tibble
internally to the function. This is done on purpose so that it works well with the rest of the tidyverse
data_tbl <- ts_to_tbl(data_ts) %>% + select(-index) + +head(data_tbl, 5)+ +
## # A tibble: 5 x 2 +## date_col value +## <date> <dbl> +## 1 1949-01-01 112 +## 2 1949-02-01 118 +## 3 1949-03-01 132 +## 4 1949-04-01 129 +## 5 1949-05-01 121+ +
ts
to tibble
With the function ts_to_tbl()
we can take a time-series object and make it into a tibble
object.
Now that we have our data, we need to make our time series split object. We use timetk::time_series_split()
splits <- time_series_split( + data_tbl + , date_col + , assess = 12 + , skip = 3 + , cumulative = TRUE +) + +head(training(splits), 1)+ +
## # A tibble: 1 x 2 +## date_col value +## <date> <dbl> +## 1 1949-01-01 112+ +
Now that we have our splits object we can go ahead and make the recipe. For this we will use ts_auto_recipe()
which creates four different recipes and outputs them into a list object.
rec_objs <- ts_auto_recipe( + .data = data_tbl + , .date_col = date_col + , .pred_col = value +)+ +
Lets take a look at it
+ +rec_objs[[1]]+ +
## Recipe +## +## Inputs: +## +## role #variables +## outcome 1 +## predictor 1+ +
We are not going to look at all of them.
+ +healthyR.ts
Now that we have our data and our recipe, we can make our automatic workflowsets
object. As mentioned at the top we are going to use the MARS algorithm with the earth engine. This function from healthyR.ts
will take in a list of recipes and a .model_type parameter
, and create a workflowsets
tibble
, where cross
is set to TRUE
We use ts_wfs_mars()
in this example.
ts_wfs_mars( + .model_type = "earth", + .recipe_list, + .num_terms = 200, + .prod_degree = 1, + .prune_method = "backward" +)+ +
Lets make the workflowset objec.
+ +wfsets <- ts_wfs_mars( + .model_type = "earth" + , .recipe_list = rec_objs +) + +wfsets+ +
## # A workflow set/tibble: 4 x 4 +## wflow_id info option result +## <chr> <list> <list> <list> +## 1 rec_base_mars <tibble [1 x 4]> <opts[0]> <list [0]> +## 2 rec_date_mars <tibble [1 x 4]> <opts[0]> <list [0]> +## 3 rec_date_fourier_mars <tibble [1 x 4]> <opts[0]> <list [0]> +## 4 rec_date_fourier_nzv_mars <tibble [1 x 4]> <opts[0]> <list [0]>+ +
I tried to give reasonable defaults for all of the functions used in the healthyverse
set of packages.
Many times hospitals, especially smaller non-academic or even small academic ones do not have a department of analysts. It could be that there are none or just one.
+ +This workflow is outside of the scope of the healthyverse
but necessary as it is an intermediate step before we get to the tuning process.
wf_fits <- wfsets %>% + modeltime_fit_workflowset( + data = training(splits) + , control = control_fit_workflowset( + allow_par = FALSE + , verbose = FALSE + ) + ) + +models_tbl <- wf_fits %>% + filter(.model != "NULL")+ +
You will see that we did .model != "NULL"
this is because if a certain model tries to run, lets say an ets
model where alpha < beta then it will fail and the .model
output will be equal to "NULL"
Lets see the models_tbl
+ +models_tbl+ +
## # Modeltime Table +## # A tibble: 4 x 3 +## .model_id .model .model_desc +## <int> <list> <chr> +## 1 1 <workflow> REC_BASE_MARS +## 2 2 <workflow> REC_DATE_MARS +## 3 3 <workflow> REC_DATE_FOURIER_MARS +## 4 4 <workflow> REC_DATE_FOURIER_NZV_MARS+ +
This is another part of the modeltime
workflow that is used. We calibrate the data with the testing split using testing(splits)
calibration_tbl <- models_tbl %>% + modeltime_calibrate(new_data = testing(splits)) + +calibration_tbl+ +
## # Modeltime Table +## # A tibble: 4 x 5 +## .model_id .model .model_desc .type .calibration_data +## <int> <list> <chr> <chr> <list> +## 1 1 <workflow> REC_BASE_MARS Test <tibble [12 x 4]> +## 2 2 <workflow> REC_DATE_MARS Test <tibble [12 x 4]> +## 3 3 <workflow> REC_DATE_FOURIER_MARS Test <tibble [12 x 4]> +## 4 4 <workflow> REC_DATE_FOURIER_NZV_MARS Test <tibble [12 x 4]>+ +
output <- healthyR.ts::ts_model_auto_tune( + .modeltime_model_id = 1, + .calibration_tbl = calibration_tbl, + .splits_obj = splits, + .drop_training_na = TRUE, + .date_col = date_col, + .value_col = value, + .tscv_assess = "12 months", + .tscv_skip = "3 months", + .num_cores = 1 +)+ +
We just automatically tuned the parameters of a model.
+ +There are three sections of output from the function.
+ +The function returns the following:
+ +The following model information is available:
+ +The following plots are returned;
+ +https://www.spsanderson.com/healthyR.ts/articles/auto-model-tune.html#data
+ +healthyR.ai
What is healthyR.ai
?
The desire of this package is to be…
+ +It is experimental
+ + + +healthyR.ai
TopicsWe are going to talk about two functions and one workflow.
+ +Functions:
+ +hai_histogram_facet_plot()
pca_your_recipe()
Workflow:
+ +Function Call
+ +hai_histogram_facet_plot( + .data, + .bins = 10, + .scale_data = FALSE, + .ncol = 5, + .fct_reorder = FALSE, + .fct_rev = FALSE, + .fill = "steelblue", + .color = "white", + .scale = "free", + .interactive = FALSE +)+ +
hai_histogram_facet_plot(.data = iris)+ +
hai_histogram_facet_plot(.data = iris, .scale_data = TRUE)+ +
What is PCA?
+ +Principal component analysis (PCA) is a transformation of a group of variables that produces a new set of artificial features or components. These components are designed to capture the maximum amount of information (i.e. variance) in the original variables.
+ +The function will automatically center and scale data along with dropping nzv columns before PCA is performed.
+ +The function pca_your_data()
returns invisibly a list object with a lot of information.
data_tbl <- healthyR_data %>% + select(visit_end_date_time) %>% + summarise_by_time( + .date_var = visit_end_date_time, + .by = "month", + value = n() + ) %>% + set_names("date_col","value") %>% + filter_by_time( + .date_var = date_col, + .start_date = "2013", + .end_date = "2020" + ) + +splits <- initial_split(data = data_tbl, prop = 0.8) + +rec_obj <- recipe(value ~ ., training(splits)) %>% + step_timeseries_signature(date_col) %>% + step_rm(matches("(iso$)|(xts$)|(hour)|(min)|(sec)|(am.pm)"))+ +
output_list <- pca_your_recipe(rec_obj, .data = data_tbl) + +names(output_list)+ +
## [1] "pca_transform" "variable_loadings" "variable_variance" +## [4] "pca_estimates" "pca_juiced_estimates" "pca_baked_data" +## [7] "pca_variance_df" "pca_rotation_df" "pca_variance_scree_plt" +## [10] "pca_loadings_plt" "pca_loadings_plotly" "pca_top_n_loadings_plt" +## [13] "pca_top_n_plotly"+ +
output_list[["pca_variance_scree_plt"]]+ +
This shows us how much variability is captured by each principal component
+ +output_list[["pca_loadings_plt"]]+ +
output_list$pca_top_n_loadings_plt+ +
Remember loadings are coefficients in linear combination predicting a varible by the standardized components, pca_your_data()
will standardize your numberic components only for now.
This is a workflow, meaning we pipe certain commands together to get the data. We will do this step by step. Here is an example of the full workflow:
+ +df <- hai_scale_zero_one_vec(.x = mtcars$mpg) + +dist_comp_tbl <- hai_distribution_comparison_tbl( + .x = df, + .distributions = c("beta"), + .normalize = FALSE +) + +dist_tbl <- hai_get_density_data_tbl(.data = dist_comp_tbl) + +dens_plt <- hai_density_plot(dist_tbl, distribution, x, y, + .interactive = FALSE)+ +
So first we used hai_scale_zero_one_vec()
which is a vectorized function that returns data on a vector from [0, 1] inclusive.
Did it here to showcase.
+ +The end user can choose from several different distributions to make the comparison to the empirical data provided.
+ +Might drop this in favor of moving the functionality to TidyDensity
+ +Lets see some output
+ +dist_comp_tbl+ +
## # A tibble: 2 x 3 +## distribution dist_data density_data +## <chr> <list> <list> +## 1 beta <dbl [32]> <density> +## 2 empirical <dbl [32]> <density>+ +
This function as the library is experimental. There are some minor issues to flush out but the gist is that you can provide some vector x and choose from the list of supported distributions and get back certain information on them.
+ +The vector x is passed through the resulting base r_
distribution function and that creates the dist_data column.
The dist_data column is passed through to the density()
function to get the density object. Right now the default n of 512 is used. May change this to the actual n
or give user the option.
We used the function hai_get_density_data_tbl()
which extracts the information we want from the dis_comp_tbl that was created.
head(dist_tbl, 5)+ +
## # A tibble: 5 x 3 +## # Groups: distribution [1] +## distribution x y +## <chr> <dbl> <dbl> +## 1 beta -0.296 0.00915 +## 2 beta -0.293 0.0100 +## 3 beta -0.290 0.0109 +## 4 beta -0.287 0.0119 +## 5 beta -0.284 0.0130+ +
Simply what the function does is act as a sort of ‘getter’. It grabs the density data from the dist_comp_tbl
and puts it into a tibble
dens_plt+ +
So what we see now is the densities plotted against each other. The work to get to this point was minimal, only three functions.
+ + + +healthyR.data
The desire of this package is to be…
+ +It is stable
+ + + +TidyDensity
The desire of this package is to be…
+ +It is stable
+ + + +Heavy use of base R actuar
for other distributions
Simple piped workflow, tries to be consistent with tidyverse
rnorm()
= tidy_normal()
Try to use sensible defaults.
+ +Referred to as tidy_
distribution functions.
Syntax that is in all tidy_
distribution functions .n
and .num_sims
where .n
is the number of randomly generated points and .num_sims
is how many simulations should be run.
Generate five simulations of 100 points for normal and poisson distributions.
+ +n <- 100 +ns <- 5+ +
tn <- tidy_normal(.n = n, .num_sims = ns) +head(tn, 1)+ +
## # A tibble: 1 x 7 +## sim_number x y dx dy p q +## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> +## 1 1 1 -1.84 -3.55 0.000132 0.5 -Inf+ +
tn %>% group_by(sim_number) %>% summarise(mean_val = mean(y))+ +
## # A tibble: 5 x 2 +## sim_number mean_val +## <fct> <dbl> +## 1 1 0.117 +## 2 2 0.0242 +## 3 3 0.0564 +## 4 4 -0.0592 +## 5 5 -0.0910+ +
tp <- tidy_poisson(.n = n, .num_sims = ns) +head(tp, 1)+ +
## # A tibble: 1 x 7 +## sim_number x y dx dy p q +## <fct> <int> <int> <dbl> <dbl> <dbl> <dbl> +## 1 1 1 1 -1.11 0.00423 0.368 0+ +
tp %>% group_by(sim_number) %>% summarise(mean_val = mean(y))+ +
## # A tibble: 5 x 2 +## sim_number mean_val +## <fct> <dbl> +## 1 1 1.07 +## 2 2 0.88 +## 3 3 0.94 +## 4 4 0.98 +## 5 5 1.05+ +
We see that there is some random variation in the data…so what does it look like?
+ +Hello tidy_autoplot()
Let’s take a look at both tn
and tp
tn %>% tidy_autoplot()+ +
tp %>% tidy_autoplot()+ +
Plots can also be interactive. Tried to make them as automatic as possible.
+ +Can we combine distributions and visualize them? Yes!
+ +We can do this with tidy_combine_distributions()
and tidy_combined_autoplot()
How does it work?
+ +combined_tbl <- tidy_combine_distributions( + tidy_normal(), + tidy_beta() +) +head(combined_tbl, 5)+ +
## # A tibble: 5 x 8 +## sim_number x y dx dy p q dist_type +## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> +## 1 1 1 -0.739 -2.65 0.000518 0.5 -Inf Gaussian c(0, 1) +## 2 1 2 2.01 -2.52 0.00149 0.508 -2.05 Gaussian c(0, 1) +## 3 1 3 0.148 -2.39 0.00383 0.516 -1.74 Gaussian c(0, 1) +## 4 1 4 0.341 -2.26 0.00879 0.524 -1.54 Gaussian c(0, 1) +## 5 1 5 -0.152 -2.13 0.0180 0.533 -1.39 Gaussian c(0, 1)+ +
combined_tbl %>% + group_by(dist_type) %>% + summarise(mean_val = mean(y)) %>% + ungroup()+ +
## # A tibble: 2 x 2 +## dist_type mean_val +## <fct> <dbl> +## 1 Gaussian c(0, 1) -0.00243 +## 2 Beta c(1, 1, 0) 0.538+ +
combined_tbl %>% tidy_combined_autoplot()+ +
combined_tbl %>% tidy_combined_autoplot(.plot_type = "qq")+ +
What if you want to see different parameters for a single distribution?
+ +tidy_multi_single_dist()
to the rescue!
tidy_multi_single_dist( + .tidy_dist = "tidy_normal", + .param_list = list( + .n = n, + .mean = c(-1, 0, 1), + .sd = 1, + .num_sims = ns + ) +)+ +
tmn %>% with_groups(dist_name, summarise, mean(y))+ +
## # A tibble: 3 x 2 +## dist_name `mean(y)` +## <fct> <dbl> +## 1 Gaussian c(-1, 1) -0.976 +## 2 Gaussian c(0, 1) -0.0289 +## 3 Gaussian c(1, 1) 0.934+ +
tmn %>% tidy_multi_dist_autoplot()+ +
Lets pick on the beta
distribution.
Lets define a vector x
+ +x <- tidy_beta(.n = n, .shape1 = 0.1, .shape2 = 0.4)$y+ +
Outputs a list object, here we want to see the estimates tibble.
+ +util_beta_param_estimate(x)$parameter_tbl %>% + select(dist_type, shape1, shape2, method)+ +
## There was no need to scale the data.+ +
## # A tibble: 3 x 4 +## dist_type shape1 shape2 method +## <chr> <dbl> <dbl> <chr> +## 1 Beta 16.2 83.8 Bayes +## 2 Beta 0.0967 0.500 NIST_MME +## 3 Beta 0.0993 0.513 EnvStats_MME+ +
util_beta_param_estimate(x)$combined_data_tbl %>% + tidy_combined_autoplot()+ +