-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
266 lines (208 loc) · 10 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
---
title: "Microbiology Practical: analysis of ß-galactosidase assays"
author: L. Paoli
date: 2019-03-07
output:
github_document:
toc: true
toc_depth: 2
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
This document presents different options to analyse the data from the beta-galactosidase
assays conducted during the Microbiology Practicals (Exp. 6.6).
After a quick introduction, a first part describes the code suggested in the practical to
perform the analysis, a second part explores more complex analyses and visualisations.
This can, but does not have to, be used for writing the report.
## Introduction : R Basics
A line starting with a hastag (#) is a comment :
```#This is a comment, it is not doing anything```
You can store objects, *i.e. numbers, strings, vectors or data.frames (= tables)* in a
**variable**. Subsequently calling the variable will return the object.
```{r}
a = 2
b = 3
a
c = a + b
c
str = 'Hello world'
str
```
A very convenient type of object are the functions. A function is an easy way to call a
chunk of prewritten code. As example you can compute a sum, calculate a mean or bind
objects into a vector.
```{r}
vect = c(a, b, c)
vect
sum(vect)
```
Whenever you are note sure of what a function does, you can look at its documentation :
```help(sum)```
## Suggested scripting for the report
### Loading data
The data is already available in a spreadsheet and the file can be loaded directly.
```{r}
# Read the into a data.frame
df <- read.table("FS_experiment_6_6_OD summary round1.tsv",header=TRUE)
# Select the results of our group (called group 4 in this case)
df <- df[df$biological_replicate=='4',-1]
# Derive activity according to the formula
v <- 0.2 # The value of the volume (0.2 ml) of the culture used for the experiment
df$activity <- (df$OD420*1000)/(v*df$dt*df$OD600)
# Print the table
df
```
### Visualise results
A good way to visualise this type of data is to make a box-and-whisker plot to show
activity by condition.
```{r}
boxplot(df$activity ~ df$condition, ylab="Miller Units", xlab="Condition")
```
### Statistical analysis
For each pair of reactions you can perform a Student’s t-test to determine if they are
statistically different from each other.
Exemple comparing condition *aerobic with 200µL of glutamin* and with the condition
*anaerobic with 200µL of glutamin*:
```{r}
t.test(df$activity[df$condition=='aerobic200'],df$activity[df$condition=='anaerobic200'])
```
In this case, the p-value indicates that the results for the two conditions are
significantly different from one another (p < 0.05).
**To do:**
1. Test each pair of reactions. Have a look at ```help(pairwise.t.test)```.
2. The t-test assumes that the data is normally distributed, is this reasonable in this case?
### Application to a larger dataset (from all groups)
Note that there is an additional column to show which group the measurement came from.
**To do:**
1. Repeat the calculation of activity.
2. Visualise the data by group, reaction, and by both group and reaction separately.
3. Test for statistical difference between pairs of groups in each reaction.
4. Do the same reactions from different groups agree with each other or is there a strong
‘batch effect’?
5. If you can combine the data from some or all of the different groups, which pairs of
reactions now show significant difference from each other?
## More advanced version of the analysis
### Loading and processing data
```{r tidyverse readxl}
## import libraries to use additional functions
suppressMessages(library(tidyverse)) # if it failes, use : 'install.packages('tidyverse')' first
library(readxl) # if it failes, use : 'install.packages('readxl')' first
## Load the data directly form the excel file
# Check the current directory:
getwd()
# Check if the file is there:
list.files()
# If you need to change your directory, use 'setwd('/where/you/need/to/go/')'
# define the name of the file. You can directly use '/where/you/need/to/go/file_name.xlsx'
# if you want to access the file without changing directory.
file_path = 'FS_experiment_6_6_OD summary round1.xlsx'
# Load the file into a data.frame using the column names:
betagal_results = read_excel(file_path, col_names = T)
# Look at the data.frame:
betagal_results
## Format and process the data
# Create a variable for Reactions, combining condition and gln :
betagal_results = mutate(betagal_results,
Condition = paste(gsub('[0-9]', '', condition),
', ',
gsub('[a-z]', '', condition),
'µl of gln',
sep = ''))
# Transform as a factor to keep the same order as the script
betagal_results$Condition = factor(betagal_results$Condition,
levels = unique(betagal_results$Condition))
# Add group to biological replicates and format as factor for compatibility.
betagal_results$biological_replicate =
factor(paste('Group', betagal_results$biological_replicate))
# Compute and add activity into the data.frame using the equation of the script:
v = 0.2
# Note that you can directly call the column by names inside the function, using `` when
# the variable name has spaces. In base R you would use betagal_results$column
betagal_results = mutate(betagal_results, Activity = (OD420*1000)/(v*dt_min*OD600))
```
### Visualisation of the data
``` {r}
# Here, I suggest we use a wdely used visualistion package called ggplot2.
# Although the syntax is quite different from usual R, it provides nice plotting options.
# First we can look at the activity between conditions, i.e. combining biological
# and technical replicates.
# Here I define a set of custom colors
custom_cols = c('#edf8b1', '#7fcdbb', '#1d91c0', '#0c2c84')
# Note @Till : if you use fill, color or another aes parameter with a characters or
# factors (i.e. not a continuous variable), group will inherit the values so you don't
# need to specify it.
ggplot(data = betagal_results) +
geom_boxplot(aes(y = Activity, fill = Condition, col = Condition), alpha = .7) +
theme_bw() + theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
ylab('Enzymatic activity (Miller unit)') +
ggtitle('Distribution of the activity across reactions') +
scale_fill_manual(values = custom_cols) +
scale_color_manual(values = custom_cols)
```
```{r}
# We can also observe the variability of the results among groups.
ggplot(data = betagal_results) +
geom_boxplot(aes(y = Activity, fill = Condition, col = Condition), alpha = .7) +
facet_wrap(~biological_replicate) +
theme_bw() + theme(axis.text.x = element_blank(),
axis.ticks.x = element_blank()) +
ylab('Enzymatic activity (Miller unit)') +
ggtitle('Distribution of the activity across reactions and biological replicates') +
scale_fill_manual(values = custom_cols) +
scale_color_manual(values = custom_cols)
# Note that with this notation, we are group 4.
```
### Statistical analysis
The usual approach would be to use t-test pairwise. Note that when doing so, you need to
correct for multiple testing, indeed, if you do a 100 random pairwise comparisons with a
confidence interval of 95%, chances are you will have 5 false positives results.
``` {r}
# All the pairwise tests between conditions using the default correction (Holmes):
pairwise.t.test(betagal_results$Activity, betagal_results$Condition)
# Here with the more stringent Bonferroni correction
pairwise.t.test(betagal_results$Activity, betagal_results$Condition,
p.adjust.method = 'bonferroni')
# The tests can also be perfomed between groups :
pairwise.t.test(betagal_results$Activity, betagal_results$biological_replicate)
```
However, if you want to test two factors at once, it's better to use an anova.
Anova stands for Analysis of Variance and there are several functions in R, e.g. aov(). It
builds upon a linear model and basically is a t-test for multiple explanatory factors.
It is more appropriate as you don't test each effect separately. This is a be a problem
when the two factors are autocorrelated, testing them independently means you end up
double counting some the variance explained.
``` {r}
# Compute the anova
summary(anova_betagal <- aov(data = betagal_results,
Activity ~ Condition + biological_replicate))
```
As you can see, this does not give us a pairwise p-value, but only one per factor.
To compute the statistical significance between pairs, a post-hoc test is needed.
This post-hoc test will be conducted below foor the case of interactions between
biological replicates (groups) and conditions. For convenience I use shorter names and
filter for significant comparisons only.
```{r}
# Do the pairwise testing using a post-hoc test:
TukeyHSD(anova_betagal, 'Condition', conf.level = .95)
# This can be tested for all the combinations of groups and reactions. Not all of them are
# interesting, but some might help you to adress specific questions.
betagal_results = mutate(betagal_results,
biological_replicate = gsub('roup ', '', biological_replicate),
Condition = rep(c('R1', 'R2', 'R3', 'R4'), each = 29))
post_hoc_res =
TukeyHSD(aov(data = betagal_results, Activity ~ biological_replicate*Condition),
'biological_replicate:Condition', conf.level = .95)
post_hoc_df = as_tibble(post_hoc_res$`biological_replicate:Condition`,
rownames = 'Comparison')
filter(post_hoc_df, `p adj` <= 0.05)
```
**The to do list is identical, you just have more tools to address the questions!**
1. Repeat the calculation of activity.
2. Visualise the data by group, reaction, and by both group and reaction separately.
3. Test for statistical difference between pairs of groups in each reaction.
4. Do the same reactions from different groups agree with each other or is there a strong
‘batch effect’?
5. If you can combine the data from some or all of the different groups, which pairs of
reactions now show significant difference from each other?