-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
125 lines (85 loc) · 3.73 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# scimo <a href="https://abichat.github.io/scimo/"><img src="man/figures/logo.png" align="right" height="138" alt="scimo website" /></a>
```{r, echo = FALSE}
version <- as.vector(read.dcf('DESCRIPTION')[, 'Version'])
version <- gsub('-', '.', version)
```
<!-- badges: start -->
![packageversion](https://img.shields.io/badge/version-`r version`-orange.svg)
[![R-CMD-check](https://github.com/abichat/scimo/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/abichat/scimo/actions/workflows/R-CMD-check.yaml)
[![CRAN status](https://www.r-pkg.org/badges/version/scimo)](https://CRAN.R-project.org/package=scimo)
<!-- badges: end -->
**scimo** provides extra recipes steps for dealing with omics data, while also being adaptable to other data types.
## Installation
You can install **scimo** from GitHub with:
``` r
# install.packages("remotes")
remotes::install_github("abichat/scimo")
```
## Example
The `cheese_abundance` dataset describes fungal community abundance of 74 Amplicon Sequences Variants (ASVs) sampled from the surface of three different French cheeses.
```{r data, message=FALSE}
library(scimo)
data("cheese_abundance", "cheese_taxonomy")
cheese_abundance
glimpse(cheese_taxonomy)
```
```{r}
list_family <- split(cheese_taxonomy$asv, cheese_taxonomy$family)
head(list_family, 2)
```
```{r seed, echo=FALSE}
set.seed(42)
```
The following recipe will
1. aggregate the ASV variables at the family level, as defined by `list_family`;
2. transform counts into proportions;
3. discard variables those p-values are above 0.05 with a Kruskal-Wallis test against `cheese`.
```{r recipe}
rec <-
recipe(cheese ~ ., data = cheese_abundance) %>%
step_aggregate_list(all_numeric_predictors(),
list_agg = list_family, fun_agg = sum) %>%
step_rownormalize_tss(all_numeric_predictors()) %>%
step_select_kruskal(all_numeric_predictors(),
outcome = "cheese", cutoff = 0.05) %>%
prep()
rec
bake(rec, new_data = NULL)
```
To see which variables are kept and the associated p-values, you can use the `tidy` method on the third step:
```{r tidy}
tidy(rec, 3)
```
## Notes
### `protection stack overflow` error
If you have a very large dataset, you may encounter this error:
```{r error, error=TRUE}
data("pedcan_expression")
recipe(disease ~ ., data = pedcan_expression) %>%
step_select_cv(all_numeric_predictors(), prop_kept = 0.1)
```
It is linked to [how **R** handles many variables in formulas](https://github.com/tidymodels/recipes/issues/467). To solve it, pass only the dataset to `recipe()` and manually update roles with `update_role()`, like in the example below:
```{r fix}
recipe(pedcan_expression) %>%
update_role(disease, new_role = "outcome") %>%
update_role(-disease, new_role = "predictor") %>%
step_select_cv(all_numeric_predictors(), prop_kept = 0.1)
```
### Steps for variable selection
Like [**colino**](https://github.com/stevenpawley/colino), **scimo** proposes 3 arguments for variable selection steps based on a statistic: `n_kept`, `prop_kept` and `cutoff`.
* `n_kept` and `prop_kept` deal with how many variables will be kept in the preprocessed dataset, based on an exact count of variables or a proportion relative to the original dataset. They are mutually exclusive.
* `cutoff` removes variables whose statistic is below (or above, depending on the step) it. It could be used alone or in addition to the two others.
### Dependencies
**scimo** doesn't introduce any additional dependencies compared to **recipes**.