README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)
```

# scimo <a href="https://abichat.github.io/scimo/"><img src="man/figures/logo.png" align="right" height="138" alt="scimo website" /></a>

```{r, echo = FALSE}
version <- as.vector(read.dcf('DESCRIPTION')[, 'Version'])
version <- gsub('-', '.', version)
```

<!-- badges: start -->
![packageversion](https://img.shields.io/badge/version-`r version`-orange.svg)
[![R-CMD-check](https://github.com/abichat/scimo/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/abichat/scimo/actions/workflows/R-CMD-check.yaml)
[![CRAN status](https://www.r-pkg.org/badges/version/scimo)](https://CRAN.R-project.org/package=scimo)
<!-- badges: end -->

**scimo** provides extra recipes steps for dealing with omics data, while also being adaptable to other data types.


## Installation

You can install **scimo** from GitHub with:

``` r
# install.packages("remotes")
remotes::install_github("abichat/scimo")
```

## Example

The `cheese_abundance` dataset describes fungal community abundance of 74 Amplicon Sequences Variants (ASVs) sampled from the surface of three different French cheeses.

```{r data, message=FALSE}
library(scimo)
data("cheese_abundance", "cheese_taxonomy")

cheese_abundance

glimpse(cheese_taxonomy)
```

```{r}
list_family <- split(cheese_taxonomy$asv, cheese_taxonomy$family)
head(list_family, 2)
```

```{r seed, echo=FALSE}
set.seed(42)
```

The following recipe will 

1. aggregate the ASV variables at the family level, as defined by `list_family`;
2. transform counts into proportions;
3. discard variables those p-values are above 0.05 with a Kruskal-Wallis test against `cheese`.


```{r recipe}
rec <-
  recipe(cheese ~ ., data = cheese_abundance) %>% 
  step_aggregate_list(all_numeric_predictors(),
                      list_agg = list_family, fun_agg = sum) %>%
  step_rownormalize_tss(all_numeric_predictors()) %>% 
  step_select_kruskal(all_numeric_predictors(), 
                      outcome = "cheese", cutoff = 0.05) %>%
  prep()

rec

bake(rec, new_data = NULL)
```

To see which variables are kept and the associated p-values, you can use the `tidy` method on the third step:

```{r tidy}
tidy(rec, 3)
```


## Notes

### `protection stack overflow` error

If you have a very large dataset, you may encounter this error: 

```{r error, error=TRUE}
data("pedcan_expression")
recipe(disease ~ ., data = pedcan_expression) %>% 
    step_select_cv(all_numeric_predictors(), prop_kept = 0.1) 
```

It is linked to [how **R** handles many variables in formulas](https://github.com/tidymodels/recipes/issues/467). To solve it, pass only the dataset to `recipe()` and manually update roles with `update_role()`, like in the example below:

```{r fix}
recipe(pedcan_expression) %>% 
  update_role(disease, new_role = "outcome") %>% 
  update_role(-disease, new_role = "predictor") %>% 
  step_select_cv(all_numeric_predictors(), prop_kept = 0.1) 
```


### Steps for variable selection

Like [**colino**](https://github.com/stevenpawley/colino), **scimo** proposes 3 arguments for variable selection steps based on a statistic: `n_kept`, `prop_kept` and `cutoff`.

* `n_kept` and `prop_kept` deal with how many variables will be kept in the preprocessed dataset, based on an exact count of variables or a proportion relative to the original dataset. They are mutually exclusive.

* `cutoff` removes variables whose statistic is below (or above, depending on the step) it. It could be used alone or in addition to the two others. 

### Dependencies

**scimo** doesn't introduce any additional dependencies compared to **recipes**.