README.Rmd

---
title: "Microbiology Practical: analysis of ß-galactosidase assays"
author: L. Paoli
date: 2019-03-07
output:
  github_document:
    toc: true
    toc_depth: 2
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


This document presents different options to analyse the data from the beta-galactosidase 
assays conducted during the Microbiology Practicals (Exp. 6.6).

After a quick introduction, a first part describes the code suggested in the practical to 
perform the analysis, a second part explores more complex analyses and visualisations. 
This can, but does not have to, be used for writing the report.

## Introduction : R Basics

A line starting with a hastag (#) is a comment :

```#This is a comment, it is not doing anything```

You can store objects, *i.e. numbers, strings, vectors or data.frames (= tables)* in a 
**variable**. Subsequently calling the variable will return the object.
```{r}
a = 2
b = 3
a
c = a + b
c
str = 'Hello world'
str
```

A very convenient type of object are the functions. A function is an easy way to call a
chunk of prewritten code. As example you can compute a sum, calculate a mean or bind 
objects into a vector.

```{r}
vect = c(a, b, c)
vect
sum(vect)
```

Whenever you are note sure of what a function does, you can look at its documentation :
```help(sum)```

## Suggested scripting for the report

### Loading data

The data is already available in a spreadsheet and the file can be loaded directly.

```{r}
# Read the into a data.frame
df <- read.table("FS_experiment_6_6_OD summary round1.tsv",header=TRUE)
# Select the results of our group (called group 4 in this case)
df <- df[df$biological_replicate=='4',-1]

# Derive activity according to the formula
v <- 0.2 # The value of the volume (0.2 ml) of the culture used for the experiment
df$activity <- (df$OD420*1000)/(v*df$dt*df$OD600)
# Print the table
df
```

### Visualise results

A good way to visualise this type of data is to make a box-and-whisker plot to show 
activity by condition.

```{r}
boxplot(df$activity ~ df$condition, ylab="Miller Units", xlab="Condition")
```

### Statistical analysis

For each pair of reactions you can perform a Student’s t-test to determine if they are 
statistically different from each other.

Exemple comparing condition *aerobic with 200µL of glutamin* and with the condition 
*anaerobic with 200µL of glutamin*:

```{r}
t.test(df$activity[df$condition=='aerobic200'],df$activity[df$condition=='anaerobic200'])
```

In this case, the p-value indicates that the results for the two conditions are 
significantly different from one another (p < 0.05).

**To do:**

1. Test each pair of reactions. Have a look at ```help(pairwise.t.test)```.
2. The t-test assumes that the data is normally distributed, is this reasonable in this case?

### Application to a larger dataset (from all groups)

Note that there is an additional column to show which group the measurement came from.

**To do:**

1. Repeat the calculation of activity.
2. Visualise the data by group, reaction, and by both group and reaction separately.
3. Test for statistical difference between pairs of groups in each reaction.
4. Do the same reactions from different groups agree with each other or is there a strong 
‘batch effect’?
5. If you can combine the data from some or all of the different groups, which pairs of 
reactions now show significant difference from each other?

## More advanced version of the analysis

### Loading and processing data

```{r tidyverse readxl}
## import libraries to use additional functions 
suppressMessages(library(tidyverse)) # if it failes, use : 'install.packages('tidyverse')' first
library(readxl) # if it failes, use : 'install.packages('readxl')' first

## Load the data directly form the excel file 
# Check the current directory:
getwd()
# Check if the file is there:
list.files()
# If you need to change your directory, use 'setwd('/where/you/need/to/go/')'
# define the name of the file. You can directly use '/where/you/need/to/go/file_name.xlsx'
# if you want to access the file without changing directory.
file_path = 'FS_experiment_6_6_OD summary round1.xlsx' 
# Load the file into a data.frame using the column names:
betagal_results = read_excel(file_path, col_names = T)
# Look at the data.frame:
betagal_results

## Format and process the data 
# Create a variable for Reactions, combining condition and gln :
betagal_results = mutate(betagal_results,
                         Condition = paste(gsub('[0-9]', '', condition),
                                           ', ',
                                           gsub('[a-z]', '', condition),
                                           'µl of gln',
                                           sep = ''))
# Transform as a factor to keep the same order as the script
betagal_results$Condition = factor(betagal_results$Condition,
                                   levels = unique(betagal_results$Condition))
# Add group to biological replicates and format as factor for compatibility.
betagal_results$biological_replicate = 
  factor(paste('Group', betagal_results$biological_replicate))
# Compute and add activity into the data.frame using the equation of the script:
v = 0.2
# Note that you can directly call the column by names inside the function, using `` when
# the variable name has spaces. In base R you would use betagal_results$column
betagal_results = mutate(betagal_results, Activity = (OD420*1000)/(v*dt_min*OD600))
```


### Visualisation of the data

``` {r}
# Here, I suggest we use a wdely used visualistion package called ggplot2.
# Although the syntax is quite different from usual R, it provides nice plotting options.

# First we can look at the activity between conditions, i.e. combining biological 
# and technical replicates.

# Here I define a set of custom colors
custom_cols = c('#edf8b1', '#7fcdbb', '#1d91c0', '#0c2c84')

# Note @Till : if you use fill, color or another aes parameter with a characters or
# factors (i.e. not a continuous variable), group will inherit the values so you don't
# need to specify it.
ggplot(data = betagal_results) + 
  geom_boxplot(aes(y = Activity, fill = Condition, col = Condition), alpha = .7) +
  theme_bw() + theme(axis.text.x = element_blank(),
                     axis.ticks.x = element_blank()) +
  ylab('Enzymatic activity (Miller unit)') +
  ggtitle('Distribution of the activity across reactions') +
  scale_fill_manual(values = custom_cols) +
  scale_color_manual(values = custom_cols)
```

```{r}
# We can also observe the variability of the results among groups.
ggplot(data = betagal_results) + 
  geom_boxplot(aes(y = Activity, fill = Condition, col = Condition), alpha = .7) +
  facet_wrap(~biological_replicate) +
  theme_bw() + theme(axis.text.x = element_blank(),
                     axis.ticks.x = element_blank()) +
  ylab('Enzymatic activity (Miller unit)') +
  ggtitle('Distribution of the activity across reactions and biological replicates') +
  scale_fill_manual(values = custom_cols) +
  scale_color_manual(values = custom_cols)

# Note that with this notation, we are group 4.
```

### Statistical analysis

The usual approach would be to use t-test pairwise. Note that when doing so, you need to 
correct for multiple testing, indeed, if you do a 100 random pairwise comparisons with a
confidence interval of 95%, chances are you will have 5 false positives results.

``` {r}
# All the pairwise tests between conditions using the default correction (Holmes):
pairwise.t.test(betagal_results$Activity, betagal_results$Condition)
# Here with the more stringent Bonferroni correction
pairwise.t.test(betagal_results$Activity, betagal_results$Condition, 
                p.adjust.method = 'bonferroni')

# The tests can also be perfomed between groups :
pairwise.t.test(betagal_results$Activity, betagal_results$biological_replicate)
```

However, if you want to test two factors at once, it's better to use an anova.

Anova stands for Analysis of Variance and there are several functions in R, e.g. aov(). It
builds upon a linear model and basically is a t-test for multiple explanatory factors.

It is more appropriate as you don't test each effect separately. This is a be a problem
when the two factors are autocorrelated, testing them independently means you end up
double counting some the variance explained.

``` {r}
# Compute the anova
summary(anova_betagal <- aov(data = betagal_results,
                             Activity ~ Condition + biological_replicate))
```

As you can see, this does not give us a pairwise p-value, but only one per factor.
To compute the statistical significance between pairs, a post-hoc test is needed.

This post-hoc test will be conducted below foor the case of interactions between 
biological replicates (groups) and conditions. For convenience I use shorter names and
filter for significant comparisons only.

```{r}
# Do the pairwise testing using a post-hoc test:
TukeyHSD(anova_betagal, 'Condition', conf.level = .95)
# This can be tested for all the combinations of groups and reactions. Not all of them are
# interesting, but some might help you to adress specific questions.
betagal_results = mutate(betagal_results,
                         biological_replicate = gsub('roup ', '', biological_replicate),
                         Condition = rep(c('R1', 'R2', 'R3', 'R4'), each = 29))
post_hoc_res = 
  TukeyHSD(aov(data = betagal_results, Activity ~ biological_replicate*Condition),
           'biological_replicate:Condition', conf.level = .95)
post_hoc_df = as_tibble(post_hoc_res$`biological_replicate:Condition`,
                        rownames = 'Comparison')
filter(post_hoc_df, `p adj` <= 0.05)
```

**The to do list is identical, you just have more tools to address the questions!**

1. Repeat the calculation of activity.
2. Visualise the data by group, reaction, and by both group and reaction separately.
3. Test for statistical difference between pairs of groups in each reaction.
4. Do the same reactions from different groups agree with each other or is there a strong 
‘batch effect’?
5. If you can combine the data from some or all of the different groups, which pairs of 
reactions now show significant difference from each other?