Skip to content

Commit

Permalink
update
Browse files Browse the repository at this point in the history
  • Loading branch information
yufree committed Jan 5, 2020
1 parent 9d96008 commit 3dbd366
Show file tree
Hide file tree
Showing 18 changed files with 239 additions and 117 deletions.
2 changes: 1 addition & 1 deletion 01-introduction.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ res <- gtrends(c("metabolomics", "metabolomics"), geo = c("CA","US"))
plot(res)
```

```{r rentrez,eval=F}
```{r rentrez}
library(rentrez)
papers_by_year <- function(years, search_term){
return(sapply(years, function(y) entrez_search(db="pubmed",term=search_term, mindate=y, maxdate=y, retmax=0)$count))
Expand Down
8 changes: 8 additions & 0 deletions 05-rawdata.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -87,6 +87,14 @@ According to Ronald Hites's simulation[@hites2019], measurements below the LOD (

**CorrectOverloadedPeaks** could be used to correct the Peaks Exceeding the Detection Limit issue[@lisec2016].

## RSD/fold change Filter

Some peaks need to be rule out due to high RSD% and small fold changes compared with blank samples.

## Power Analysis Filter

As shown in [Exprimental design(DoE)], the power analysis in metabolomics is ad-hoc since you don't know too much before you perform the experiment. However, we could perform power analysis after the experiment done. That is, we just rule out the peaks with a lower power in exsit Exprimental design.

## Software

### Peak picking
Expand Down
115 changes: 103 additions & 12 deletions 06-normalization.Rmd
Original file line number Diff line number Diff line change
@@ -1,28 +1,117 @@
# Peaks normalization

## Peak misidentification
## Batch effects classification

- Isomer
Variances among the samples across all the extracted peaks might be affected by factors other than the experiment design. There are three types of those batch effects: Monotone, Block and Mixed.

Use seperation methods such as chromatography, ion mobility MS, MS/MS. Reversed-phase ion-pairing chromatography and HILIC is useful and chemical derivatization is another options.
- Monotone would increase/decrease with the injection order or batchs.

- Interfering compounds
```{r echo=FALSE,out.width='61.8%'}
library(ggplot2)
# increasing batch
batch <- seq(0,9,length.out = 10)
group <- factor(c(rep(1,5),rep(2,5)))
ind <- factor(c(1:10))
data <- cbind.data.frame(group,batch,ind)
ggplot(data,aes(ind,batch,fill = group)) + geom_col()
```

- Block would be system shift among different batchs.

```{r echo=FALSE,out.width='61.8%'}
# Block batch
batch <- c(2,2,7,7,7,7,7,3,3,3)
group <- factor(c(rep(1,5),rep(2,5)))
ind <- factor(c(1:10))
data <- cbind.data.frame(group,batch,ind)
ggplot(data,aes(ind,batch,fill = group)) + geom_col()
```

- Mixed would be the combination of monotone and block batch effects.

```{r echo=FALSE,out.width='61.8%'}
# Mixed batch
batch <- c(2,2,7,7,7,7,7,3,3,3) + seq(0,9,length.out = 10)
group <- factor(c(rep(1,5),rep(2,5)))
ind <- factor(c(1:10))
data <- cbind.data.frame(group,batch,ind)
ggplot(data,aes(ind,batch,fill = group)) + geom_col()
```

Meanwhile, different compounds would suffer different type of batch effects. In this case, the normalization or batch correction should be done peak by peak.

```{r bem,echo=FALSE,out.width='70%'}
getsample <- function(n){
batch <- NULL
for (i in 1:n){
batchin <- seq(1,10,length.out = 10) * rnorm(1)
batchde <- seq(10,1,length.out = 10) * rnorm(1)
batchblock <- c(rep(rnorm(1),2),rep(rnorm(1),5),rep(rnorm(1),3))
batchtemp <- batchin*sample(c(0,1),1) + batchde*sample(c(0,1),1) + batchblock*sample(c(0,1),1)
batch <- rbind(batch,batchtemp)
}
return(batch)
}
d <- getsample(30)
df <- expand.grid(Compound = as.character(c(1:30)),index = as.character(c(1:10)))
df$intensity <- c(d)
ggplot(data = df, aes(x = index, y = Compound)) +
geom_tile(aes(fill = intensity))
```

## Batch effects visulization

20ppm is the least resolution and accuracy
Any correction might introduce bias. We need to make sure there are patterns which different from our experimental design. Pooled QC samples should be clustered on PCA score plot.

- In-source degradation products
## Source of batch effects

## RSD Filter
- Different Operators & Dates & Sequences

Some peaks need to be rule out due to high RSD%. See [Exprimental design(DoE)]
- Different Instrumental Condition such as different instrumental parameters, poor quality control, sample contamination during the analysis, Column (Pooled QC) and sample matrix effects (ions suppression or/and enhancement)

## Power Analysis Filter
- Unknown Unknowns

As shown in [Exprimental design(DoE)], the power analysis in metabolomics is ad-hoc since you don't know too much before you perform the experiment. However, we could perform power analysis after the experiment done. That is, we just rule out the peaks with a lower power in exsit Exprimental design.
## Avoid batch effects by DoE

## Normalization
You could avoid batch effects from experimental design. Cap the sequence with Pooled QC and Randomized samples sequence. Some internal standards/Instrumental QC might Help to find the source of batch effects while it's not practical for every compounds in non-targeted analysis.

Variances among the samples across all the extracted peaks might be affected by factors other than the experiment design. To make the samples comparable, normailization across the samples are always needed. There are more than 20 methods to make normalization. We could devided those methods into two category: unsupervised and supervised.
Batch effects might not change the conclusion when the effect size is relatively small. Here is a simulation:

```{r}
set.seed(30)
# real peaks
group <- factor(c(rep(1,5),rep(2,5)))
con <- c(rnorm(5,5),rnorm(5,8))
re <- t.test(con~group)
# real peaks
group <- factor(c(rep(1,5),rep(2,5)))
con <- c(rnorm(5,5),rnorm(5,8))
batch <- seq(0,5,length.out = 10)
ins <- batch+con
re <- t.test(ins~group)
index <- sample(10)
ins <- batch+con[index]
re <- t.test(ins~group[index])
```

Randomization could not guarantee the results. Here is a simulation.

```{r}
# real peaks
group <- factor(c(rep(1,5),rep(2,5)))
con <- c(rnorm(5,5),rnorm(5,8))
batch <- seq(5,0,length.out = 10)
ins <- batch+con
re <- t.test(ins~group)
```

## post hoc data normalization

To make the samples comparable, normailization across the samples are always needed when the experiment part is done. Batch effect should have patterns, otherwise just noise. Correction is possible by data analysis/randomized experimental design. There are more than 20 methods to make normalization. We could devided those methods into two category: unsupervised and supervised.

Unsupervised methods only consider the normalization peaks intensity distribution across the samples. For example, quantile calibration try to make the intensity distribution among the samples similar. Such methods are preferred to explore the inner structures of the samples. Internal standards or pool QC samples also belong to this category. However, it's hard to take a few peaks standing for all peaks extracted.

Expand Down Expand Up @@ -253,6 +342,8 @@ This method's performance is similar to SVA. Instead find surrogate variable fro

RRmix also use a latent factor models correct the data[@jr2017]. This method could be treated as linear mixed model version SVA. No control samples are required and the unwanted variances could be removed by factor analysis. This method might be the best choise to remove the unwanted variables with commen experiment design.

## Method to validate the normalization

## Software

- [BatchCorrMetabolomics](https://github.com/rwehrens/BatchCorrMetabolomics) is for improved batch correction in untargeted MS-based metabolomics
Expand Down
12 changes: 12 additions & 0 deletions 07-annotation.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,18 @@ The major issue in annotation is the redundancy peaks from same metabolite. Unli

Another issue is about the MS/MS database. Only 10% of known metabolites in databases have experimental spectral data. Thus **in silico** prediction are required. Some works try to fill the gap between experimental data, theoretical values(from chemical database like chemspider) and prediction together. Here is a nice review about MS/MS prediction[@hufsky2014].

## Peak misidentification

- Isomer

Use seperation methods such as chromatography, ion mobility MS, MS/MS. Reversed-phase ion-pairing chromatography and HILIC is useful.Chemical derivatization is another options.

- Interfering compounds

20ppm is the least resolution and accuracy for HRMS.

- In-source degradation products

## Annotation v.s. identification

According to the defination from the Chemical Analysis Working Group of the Metabolomics Standards Intitvative[@sumner2007;@viant2017]. Four levels of confidence could be assigned to identification:
Expand Down
6 changes: 0 additions & 6 deletions 09-statistics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -119,12 +119,6 @@ Sparse PLS discriminant analysis(sPLS-DA) make a L1 penal on the variable select

For o-PLS-DA, s-plot could be used to find features.[@wiklund2008]

## Self-organizing map

## Canonical correlation analysis

Find the correlationship between two datasets.

## Software

- [MetaboAnalystR](https://github.com/xia-lab/MetaboAnalystR)
Expand Down
Loading

0 comments on commit 3dbd366

Please sign in to comment.