2_Normalization_and_statistics_labeled.Rmd

---
title: "Normalization and Statistical Analysis"
author: "Christian Ayala"
output:
  html_document:
    df_print: paged
  html_notebook: default
  pdf_document: default
editor_options:
  chunk_output_type: console
---

This Notebook is to perform normalization of the area under the curve (AUC) of the peaks detected by *Compound Discoverer*.

# 1. Importing Libraries

```{r libraries, message=FALSE, warning=FALSE}
library(tidyverse)
library(readxl)
library(ggpubr)
library(ggsci)
library(gridExtra)
library(vegan)
library(factoextra)
library(rstatix)
source('functions_cdis_norm_stats.R')
```

# 2. Import data

Set if the data to be used is going to be labeled or unlabeled

```{r}
# Flag for labeled / unlabeled data, set TRUE or FALSE

label = TRUE
```

The input data is the **compounds-table** if generated by the previous scripts. This table is used to avoid problems with some tests such as PCA, which does not allow for many zeroes or missing values

```{r set_path, message=FALSE}
# set path variables
project_dir <- getwd()
project_name <- 'Bog_labeled_all'
figures_dir <- file.path(project_dir, paste0(project_name, '_output_figures'))
tables_dir <- file.path(project_dir,  paste0(project_name, '_output_tables'))

# For unlabeled samples, use the gap_filled_compounds_table.csv, for labeled samples use the compounds_table
if(label == TRUE){
  compounds_table_file <- file.path(tables_dir, 'compounds_table.csv')
}else{
  compounds_table_file <- file.path(tables_dir, 'gap_filled_compounds_table.csv')
}

# Load compounds_table

compounds_table <- read_csv(compounds_table_file)
labeled_compounds_table <- read_csv(file.path(tables_dir, 'labeled_compounds_table_in_all.csv'))


# Import metadata and fix names
metadata_file <- file.path(tables_dir, 'fixed_metadata.csv')
metadata <- read_csv(metadata_file)
label_metadata <- read_csv(file.path(tables_dir, 'fixed_labeled_metadata.csv'))

```

# 3. Data Manipulation and Transformation

```{r Data_manipulation}
# Create a new tibble with the AUC per each mass from each sample
auc_table <- labeled_compounds_table %>% 
  select(FeatureID, SampleID, AUC)

# Transform the dataframe into a matrix-like table

auc_table <-  spread(auc_table, SampleID, AUC)
auc_table$FeatureID <- factor(auc_table$FeatureID, levels = str_sort(auc_table$FeatureID, numeric = TRUE))
auc_table <- auc_table %>% 
  arrange(FeatureID)

# Save untransformed data

auc_table <- auc_table %>% 
  select(FeatureID, all_of(label_metadata$SampleID), BNC) %>% 
  column_to_rownames(var = 'FeatureID')
  

table_file <- file.path(tables_dir, 'labeled_raw_auc_table.csv')
write.csv(auc_table, table_file, row.names = TRUE)

```

# 4. Data Normalization by multiple methods

Data is normalized by multiple methods to decide

```{r Data_normalization, warning=FALSE}

normalization_plot <- normalize_by_all(auc_table)

figure_file <- file.path(figures_dir, 'all_normalized.boxplot.png')
ggsave(figure_file, normalization_plot, dpi = 300)

```

Based on the plot select the best normalization method for the sample.
In this case the best normalization method was **Median normalization**


```{r Best normalization}

# Obtaining non-transformed data for differential analysis

norm.matrix <- median.norm(auc_table, transform_data = FALSE)

# Change missing values for zeroes

norm.matrix[is.na(norm.matrix)] <- 0

norm.matrix_nt <- norm.matrix

# Save normalized data, non transformed data for differential analysis

table_file <- file.path(tables_dir, 'normalized_untransformed_auc_table.csv')
write.csv(norm.matrix, table_file, row.names = TRUE)

```

For the rest of the analysis in this Notebook, the transformed values will be used

```{r Non-transformed.norm}
# Obtaining transformed data for multivariate statistica analysis

norm.matrix <- median.norm(auc_table)


# Change missing values for zeroes

norm.matrix[is.na(norm.matrix)] <- 0

# Save normalized data

table_file <- file.path(tables_dir, 'normalized_transformed_auc_table.csv')
write.csv(norm.matrix, table_file, row.names = TRUE)

```

# 5. Statistical Analysis

## 5.1 NMDS 

Choose if analysis will be done based on relative abundance or presence absence

```{r typeofanalysis}

# This portion of the script parts adapted from statistical analysis from MetaboTandem and MetaboDirect pipelines

type <- 'ra'

if(type == 'ra'){
  nmds.matrix <- t(norm.matrix)
  dm.method <- 'bray'
  # distance matrix by Bray because relative abundance mode was selected
  dm <- vegdist(nmds.matrix, method=dm.method)
  print('Relative abundance method selected')
}else if(type == 'pa'){
  nmds.matrix <- decostand(t(norm.matrix), 'pa')
  dm.method <- 'euclidean'
  dm <- vegdist(nmds.matrix, method = dm.method)
  print('Presence/absence method selected')
} else{
  print('Select analysis method: "pa" for presence absence or "ra" for relative abundance')
}

```

Perform the actual nmds analysis

**A good rule of thumb for interpretation**: 
- < 0.05 provides an excellent representation in reduced dimensions, 
- < 0.1 is great, 
- < 0.2 is good/ok, 
- < 0.3 provides a poor representation. 


```{r nmds}

set.seed(123)
nmds <- metaMDS(dm,
                k = 2,
                maxit = 999,
                trymax = 500,
                wascores = TRUE)
stressplot(nmds)

# Extract nmds scores for plotting

nmds.scores <- as.data.frame(scores(nmds))
nmds.scores <- rownames_to_column(nmds.scores, var = 'SampleID')
nmds.scores <- left_join(nmds.scores, metadata, by = 'SampleID')

nmds_plot <- plot_nmds(nmds.scores, SampleID, Time) +
  labs(title = 'NMDS plot by relative abundance') +
  scale_shape_manual(values = c(15, 16, 17, 18)) +
  scale_color_manual(values = get_palette('lancet', 8))
nmds_plot

figure_file <- file.path(figures_dir, 'nmds_relative_abundance.png')
ggsave(figure_file, nmds_plot, dpi = 300, height = 4, width = 5.5)

```

## 5.2 PCA

```{r}

# Calculate PCA with prcomp
pca <- prcomp(t(norm.matrix))

# Get eigenvalues
eigen <- get_eigenvalue(pca)


# Plot screeplot using the functions from factoextra

scree_plot <- fviz_eig(pca, addlabels = TRUE) +
  theme_bw() +
  theme(plot.title = element_text(face = 'bold', hjust = 0.5))
scree_plot 

figure_file <- file.path(figures_dir, 'screeplot.png')
ggsave(figure_file, scree_plot, dpi = 300)

# Plot cumulative variance plot

cumvar_plot <- plot_cumvar(eigen)
cumvar_plot

figure_file <- file.path(figures_dir, 'cumulative_variance.png')
ggsave(figure_file, cumvar_plot, dpi = 300)
```


```{r}

# Extract sample coordinates for PC1 and PC2
pca_coordinates <- as_tibble(pca$x)
pca_coordinates$SampleID <- rownames(pca$x)

# Merge with metadata
pca_coordinates <- left_join(pca_coordinates, metadata, by ='SampleID')

# Prepare axis labels for PCA

pc1 <- paste0('PC1 (', round(eigen$variance.percent[1], digits = 1), '%)')
pc2 <- paste0('PC2 (', round(eigen$variance.percent[2], digits = 1), '%)')

# Plot Individuals PCA

pca_plot <- plot_dotplot(pca_coordinates, PC1, PC2, SampleID, Time) +
  labs(title = 'PCA plot',
       x = pc1,
       y = pc2)  +
  scale_shape_manual(values = c(15, 16, 17, 18)) +
  scale_color_manual(values = get_palette('lancet', 8))

pca_plot

figure_file <- file.path(figures_dir, 'PCA-plot.png')
ggsave(figure_file, pca_plot, dpi = 300, height = 4, width = 5.5)

```

*Labeled data* obtained from Compound Discoverer is not gap-filled and contains multiple *zeroes*. For that reason the **NMDS plot** is more informative


## 5.3 PERMANOVA

Permutational Multivariate Analysis of Variance Using Distance Matrices

```{r Permanova}

metadata_fix <- metadata %>% 
  filter(SampleID %in% c(label_metadata$SampleID, 'BNC')) %>% 
  column_to_rownames(var = 'SampleID')

set.seed(456)
permanova <- adonis(dm ~ Comp, 
                    data=metadata_fix, 
                    permutations=999, 
                    method="bray")
permanova

table_file <- file.path(tables_dir, 'permanova.csv')
write_csv(permanova$aov.tab, table_file)


```