update

yufree · Jan 5, 2020 · 3dbd366 · 3dbd366
1 parent 9d96008
commit 3dbd366
Show file tree

Hide file tree

Showing 18 changed files with 239 additions and 117 deletions.
diff --git a/01-introduction.Rmd b/01-introduction.Rmd
@@ -134,7 +134,7 @@ res <- gtrends(c("metabolomics", "metabolomics"), geo = c("CA","US"))
 plot(res)
 ```
 
-```{r rentrez,eval=F}
+```{r rentrez}
 library(rentrez)
 papers_by_year <- function(years, search_term){
     return(sapply(years, function(y) entrez_search(db="pubmed",term=search_term, mindate=y, maxdate=y, retmax=0)$count))

diff --git a/05-rawdata.Rmd b/05-rawdata.Rmd
@@ -87,6 +87,14 @@ According to Ronald Hites's simulation[@hites2019], measurements below the LOD (
 
 **CorrectOverloadedPeaks** could be used to correct the Peaks Exceeding the Detection Limit issue[@lisec2016].
 
+## RSD/fold change Filter
+
+Some peaks need to be rule out due to high RSD% and small fold changes compared with blank samples.
+
+## Power Analysis Filter
+
+As shown in [Exprimental design(DoE)], the power analysis in metabolomics is ad-hoc since you don't know too much before you perform the experiment. However, we could perform power analysis after the experiment done. That is, we just rule out the peaks with a lower power in exsit Exprimental design.
+
 ## Software
 
 ### Peak picking

diff --git a/06-normalization.Rmd b/06-normalization.Rmd
@@ -1,28 +1,117 @@
 # Peaks normalization
 
-## Peak misidentification
+## Batch effects classification
 
-- Isomer
+Variances among the samples across all the extracted peaks might be affected by factors other than the experiment design. There are three types of those batch effects: Monotone, Block and Mixed.
 
-Use seperation methods such as chromatography, ion mobility MS, MS/MS. Reversed-phase ion-pairing chromatography and HILIC is useful and chemical derivatization is another options.
+- Monotone would increase/decrease with the injection order or batchs.
 
-- Interfering compounds
+```{r echo=FALSE,out.width='61.8%'}
+library(ggplot2)
+# increasing batch
+batch <- seq(0,9,length.out = 10)
+group <- factor(c(rep(1,5),rep(2,5)))
+ind <- factor(c(1:10))
+data <- cbind.data.frame(group,batch,ind)
+ggplot(data,aes(ind,batch,fill = group)) + geom_col()
+```
+
+- Block would be system shift among different batchs.
+
+```{r echo=FALSE,out.width='61.8%'}
+# Block batch
+batch <- c(2,2,7,7,7,7,7,3,3,3)
+group <- factor(c(rep(1,5),rep(2,5)))
+ind <- factor(c(1:10))
+data <- cbind.data.frame(group,batch,ind)
+ggplot(data,aes(ind,batch,fill = group)) + geom_col()
+```
+
+- Mixed would be the combination of monotone and block batch effects.
+
+```{r echo=FALSE,out.width='61.8%'}
+# Mixed batch
+batch <- c(2,2,7,7,7,7,7,3,3,3) + seq(0,9,length.out = 10)
+group <- factor(c(rep(1,5),rep(2,5)))
+ind <- factor(c(1:10))
+data <- cbind.data.frame(group,batch,ind)
+ggplot(data,aes(ind,batch,fill = group)) + geom_col() 
+```
+
+Meanwhile, different compounds would suffer different type of batch effects. In this case, the normalization or batch correction should be done peak by peak.
+
+```{r bem,echo=FALSE,out.width='70%'}
+getsample <- function(n){
+        batch <- NULL
+        for (i in 1:n){
+        batchin <- seq(1,10,length.out = 10) * rnorm(1)
+        batchde <- seq(10,1,length.out = 10) * rnorm(1)
+        batchblock <- c(rep(rnorm(1),2),rep(rnorm(1),5),rep(rnorm(1),3))
+        
+        batchtemp <- batchin*sample(c(0,1),1) + batchde*sample(c(0,1),1) + batchblock*sample(c(0,1),1)
+        
+        batch <- rbind(batch,batchtemp)
+        }
+        return(batch)
+}
+
+d <- getsample(30)
+df <- expand.grid(Compound = as.character(c(1:30)),index = as.character(c(1:10)))
+df$intensity <- c(d)
+ggplot(data = df, aes(x = index, y = Compound)) +
+  geom_tile(aes(fill = intensity)) 
+```
+
+## Batch effects visulization
 
-20ppm is the least resolution and accuracy
+Any correction might introduce bias. We need to make sure there are patterns which different from our experimental design. Pooled QC samples should be clustered on PCA score plot.
 
-- In-source degradation products
+## Source of batch effects
 
-## RSD Filter
+- Different Operators & Dates & Sequences
 
-Some peaks need to be rule out due to high RSD%. See [Exprimental design(DoE)]
+- Different Instrumental Condition such as different instrumental parameters, poor quality control, sample contamination during the analysis, Column (Pooled QC) and sample matrix effects (ions suppression or/and enhancement)
 
-## Power Analysis Filter
+- Unknown Unknowns
 
-As shown in [Exprimental design(DoE)], the power analysis in metabolomics is ad-hoc since you don't know too much before you perform the experiment. However, we could perform power analysis after the experiment done. That is, we just rule out the peaks with a lower power in exsit Exprimental design.
+## Avoid batch effects by DoE
 
-## Normalization
+You could avoid batch effects from experimental design. Cap the sequence with Pooled QC and Randomized samples sequence. Some internal standards/Instrumental QC might Help to find the source of batch effects while it's not practical for every compounds in non-targeted analysis. 
 
-Variances among the samples across all the extracted peaks might be affected by factors other than the experiment design. To make the samples comparable, normailization across the samples are always needed. There are more than 20 methods to make normalization. We could devided those methods into two category: unsupervised and supervised.
+Batch effects might not change the conclusion when the effect size is relatively small. Here is a simulation:
+
+```{r}
+set.seed(30)
+# real peaks
+group <- factor(c(rep(1,5),rep(2,5)))
+con <- c(rnorm(5,5),rnorm(5,8))
+re <- t.test(con~group)
+# real peaks
+group <- factor(c(rep(1,5),rep(2,5)))
+con <- c(rnorm(5,5),rnorm(5,8))
+batch <- seq(0,5,length.out = 10)
+ins <- batch+con
+re <- t.test(ins~group)
+
+index <- sample(10)
+ins <- batch+con[index]
+re <- t.test(ins~group[index])
+```
+
+Randomization could not guarantee the results. Here is a simulation.
+
+```{r}
+# real peaks
+group <- factor(c(rep(1,5),rep(2,5)))
+con <- c(rnorm(5,5),rnorm(5,8))
+batch <- seq(5,0,length.out = 10)
+ins <- batch+con
+re <- t.test(ins~group)
+```
+
+## post hoc data normalization
+
+To make the samples comparable, normailization across the samples are always needed when the experiment part is done. Batch effect should have patterns, otherwise just noise. Correction is possible by data analysis/randomized experimental design. There are more than 20 methods to make normalization. We could devided those methods into two category: unsupervised and supervised.
 
 Unsupervised methods only consider the normalization peaks intensity distribution across the samples. For example, quantile calibration try to make the intensity distribution among the samples similar. Such methods are preferred to explore the inner structures of the samples. Internal standards or pool QC samples also belong to this category. However, it's hard to take a few peaks standing for all peaks extracted.
 
@@ -253,6 +342,8 @@ This method's performance is similar to SVA. Instead find surrogate variable fro
 
 RRmix also use a latent factor models correct the data[@jr2017]. This method could be treated as linear mixed model version SVA. No control samples are required and the unwanted variances could be removed by factor analysis. This method might be the best choise to remove the unwanted variables with commen experiment design.
 
+## Method to validate the normalization
+
 ## Software
 
 - [BatchCorrMetabolomics](https://github.com/rwehrens/BatchCorrMetabolomics) is for improved batch correction in untargeted MS-based metabolomics

diff --git a/07-annotation.Rmd b/07-annotation.Rmd
@@ -20,6 +20,18 @@ The major issue in annotation is the redundancy peaks from same metabolite. Unli
 
 Another issue is about the MS/MS database. Only 10% of known metabolites in databases have experimental spectral data. Thus **in silico** prediction are required. Some works try to fill the gap between experimental data, theoretical values(from chemical database like chemspider) and prediction together. Here is a nice review about MS/MS prediction[@hufsky2014].
 
+## Peak misidentification
+
+- Isomer
+
+Use seperation methods such as chromatography, ion mobility MS, MS/MS. Reversed-phase ion-pairing chromatography and HILIC is useful.Chemical derivatization is another options.
+
+- Interfering compounds
+
+20ppm is the least resolution and accuracy for HRMS.
+
+- In-source degradation products
+
 ## Annotation v.s. identification
 
 According to the defination from the Chemical Analysis Working Group of the Metabolomics Standards Intitvative[@sumner2007;@viant2017]. Four levels of confidence could be assigned to identification:

diff --git a/09-statistics.Rmd b/09-statistics.Rmd
@@ -119,12 +119,6 @@ Sparse PLS discriminant analysis(sPLS-DA) make a L1 penal on the variable select
 
 For o-PLS-DA, s-plot could be used to find features.[@wiklund2008]
 
-## Self-organizing map
-
-## Canonical correlation analysis 
-
-Find the correlationship between two datasets.
-
 ## Software
 
 - [MetaboAnalystR](https://github.com/xia-lab/MetaboAnalystR)