diff --git a/README.md b/README.md index 5d1b189..6f108ec 100644 --- a/README.md +++ b/README.md @@ -1,22 +1,41 @@ ## Introduction -The Feature-Based Molecular Networking (FBMN) workflow is a computational method that bridges popular mass spectrometry data processing tools for LC-MS/MS and molecular networking analysis on [GNPS](http://gnps.ucsd.edu). The tools supported are: MZmine2, OpenMS, MS-DIAL, MetaboScape, XCMS, and the mzTab-M format. - -The main documentation for Feature-Based Molecular Networking with XCMS can be accessed [here](https://ccms-ucsd.github.io/GNPSDocumentation/featurebasedmolecularnetworking-with-xcms3/). See our preprint Nothias, L.F. et al [bioRxiv 812404 (2019)](https://www.biorxiv.org/content/10.1101/812404v1). - -This repository contains example scripts in Python and R showing how `XCMS` version >= 3 (XCMS3) can be used for the -FBMN workflow in GNPS using a subset of samples of the American Gut Project ([MSV000082678](https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=de2d18fd91804785bce8c225cc94a444)) and soil bacteria ([MSV000079204](https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=d74ca92d9dec4e2883f28506c670e3ca)). +The Feature-Based Molecular Networking (FBMN) workflow is a computational method +that bridges popular mass spectrometry data processing tools for LC-MS/MS and +molecular networking analysis on [GNPS](http://gnps.ucsd.edu). The tools +supported are: MZmine2, OpenMS, MS-DIAL, MetaboScape, XCMS, and the mzTab-M +format. + +The main documentation for Feature-Based Molecular Networking with XCMS can be +accessed +[here](https://ccms-ucsd.github.io/GNPSDocumentation/featurebasedmolecularnetworking-with-xcms3/). See +also the publication [Nothias, L.F. et al Nat Methods 2020 +Sep;17(9):905-908](https://doi.org/10.1038/s41592-020-0933-6). + +This repository contains example scripts in Python and R showing how +[*xcms*](https://bioconductor.org/packages/xcms) (version >= 4) can be used for +the FBMN workflow in GNPS using a subset of samples of the American Gut Project +([MSV000082678](https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=de2d18fd91804785bce8c225cc94a444)) +and soil bacteria +([MSV000079204](https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=d74ca92d9dec4e2883f28506c670e3ca)). + +Note: this is an updated version of the workflow to use the newer result objects +and more efficient mass spectrometry (MS) data infrastructure introduced with +*xcms* version 4. For the version with *xcms* < 4 see +[here](https://github.com/DorresteinLaboratory/XCMS3_FeatureBasedMN/tree/1e6f0f36cd9a24eb2ad2b126ad45083fcbde0a72). ### Installation of the XCMS-GNPS workflow for FBMN -Install the latest version of XCMS3 (version >= 3.4) from Bioconductor in R -with: +Install R version >= 4.5. Install the latest version of *xcms* from Bioconductor +in R with: ``` install("BiocManager") BiocManager::install("xcms") ``` -For more information, also refer to the [xcms Bioconductor package](https://www.bioconductor.org/packages/release/bioc/html/xcms.html). + +For more information, also refer to the [xcms Bioconductor +package](https://www.bioconductor.org/packages/release/bioc/html/xcms.html). For utility functions specific to this workflow refer to the Github repository: [https://github.com/jorainer/xcms-gnps-tools](https://github.com/jorainer/xcms-gnps-tools). @@ -25,7 +44,8 @@ For utility functions specific to this workflow refer to the Github repository: This work builds on the efforts of our many colleagues, please cite their work: -Nothias, L.F. et al [Feature-based Molecular Networking in the GNPS Analysis Environment](https://www.biorxiv.org/content/10.1101/812404v1) bioRxiv 812404 (2019). +[Nothias, L.F. et al Nat Methods 2020 +Sep;17(9):905-908](https://doi.org/10.1038/s41592-020-0933-6) [https://github.com/sneumann/xcms](https://github.com/sneumann/xcms) @@ -38,9 +58,12 @@ mass spectrometry data for metabolite profiling using nonlinear peak alignment, ### Running Feature Based Molecular Networking on GNPS -The main documentation for Feature-Based Molecular Networking with GNPS can be accessed [here](https://ccms-ucsd.github.io/GNPSDocumentation/featurebasedmolecularnetworking/). +The main documentation for Feature-Based Molecular Networking with GNPS can be +accessed +[here](https://ccms-ucsd.github.io/GNPSDocumentation/featurebasedmolecularnetworking/). ### Contributions -The XCMS-GNPS integration was developed by Johannes Rainer and Michael Witting, in coordination with Louis-Félix Nothias and Daniel Petras. -This tutorial was prepared by Madeleine Ernst and Ricardo da Silva. +The XCMS-GNPS integration was developed by Johannes Rainer and Michael Witting, +in coordination with Louis-Félix Nothias and Daniel Petras. This tutorial was +prepared by Madeleine Ernst and Ricardo da Silva. diff --git a/XCMS3_Preprocessing.Rmd b/XCMS3_Preprocessing.Rmd index fc340c6..6404199 100644 --- a/XCMS3_Preprocessing.Rmd +++ b/XCMS3_Preprocessing.Rmd @@ -1,31 +1,37 @@ ## Install required packages -Note: This script is optimized for the latest R version (>3.6). Be sure to get -the most recent version of R before running. The code below installs the -required packages and all needed dependencies. Note: You don't have to run this -chunk if the packages are already installed. +Note: This script is optimized for the latest R version (>=4.5) and *xcms* +version (>= 4.6). Be sure to get the most recent version of R before +running. The code below installs the required packages and all needed +dependencies. Note: You don't have to run this chunk if the packages are already +installed. ```{r, eval = FALSE, message = FALSE} if (!requireNamespace("BiocManager")) install.packages("BiocManager") -BiocManager::install(c("xcms", "CAMERA")) +BiocManager::install(c("xcms", "CAMERA", "Spectra", + "MsExperiment", "MsBackendMgf")) ``` -## Preprocess your data using XCMS3 and export data files for feature-based molecular networking through GNPS +## Preprocess your data using *xcms* and export data files for feature-based molecular networking through GNPS To follow this example tutorial, download the folder named *peak/AMG_Plant_subset* from: [here](https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=de2d18fd91804785bce8c225cc94a444) -Note that the settings for `xcms` used in this tutorial were not optimized, +Note that the settings for *xcms* used in this tutorial were not optimized, specifically the alignment based on the default *obiwarp* parameters might -perform a little to strong retention time adjustment. -For more information on optimization of the parameters see the [xcms vignette](https://bioconductor.org/packages/release/bioc/vignettes/xcms/inst/doc/xcms.html) -or the [LC-MS data pre-processing with xcms](https://github.com/jorainer/metabolomics2018) workshop. +perform a little to strong retention time adjustment. For more information on +optimization of the parameters see the [xcms +vignette](https://bioconductor.org/packages/release/bioc/vignettes/xcms/inst/doc/xcms.html) +or the [Metabonaut +tutorials](https://rformassspectrometry.github.io/Metabonaut/). Load required libraries and utility functions for GNPS export. ```{r, message = FALSE} +library(MsExperiment) +library(Spectra) library(xcms) source("https://raw.githubusercontent.com/jorainer/xcms-gnps-tools/master/customFunctions.R") ``` @@ -53,16 +59,18 @@ we assign all samples to the same group. This should be changed according to the experimental setup. ```{r} -mzMLfiles <- paste0('AMG_Plant_subset/', - list.files(path = 'AMG_Plant_subset/', - pattern = ".mzXML$", recursive = TRUE)) -s_groups <- rep("sample", length(mzMLfiles)) +mzxml_files <- dir("AMG_Plant_subset/", pattern = ".mzXML", + full.names = TRUE, recursive = TRUE) pheno <- data.frame( - sample_name = sub(basename(mzMLfiles), - pattern = ".mzML",replacement = "", fixed = TRUE), - sample_group = s_groups, stringsAsFactors = FALSE) + sample_name = sub(basename(mzxml_files), pattern = ".mzXML", + replacement = "", fixed = TRUE), + sample_group = rep("sample", length(mzxml_files))) + ``` +The first 6 lines of this `data.frame` describing the experiment, respectively +files is shown below. + ```{r} head(pheno) ``` @@ -70,21 +78,22 @@ head(pheno) Read all raw data (which includes MS1 and MS2 spectra). ```{r} -rawData <- readMSData(mzMLfiles, centroided. = TRUE, mode = "onDisk", - pdata = new("NAnnotatedDataFrame", pheno)) +rawData <- readMsExperiment(mzxml_files, sampleData = pheno) + ``` Create a base peak chromatogram (BPC) for visual inspection. -```{r, fig.width = 12, fig.height = 6, fig.cap = "Base peak chromatogram."} +```{r, fig.width = 12, fig.height = 6, fig.cap = "Base peak chromatogram.", message = FALSE} bpis <- chromatogram(rawData, aggregationFun = "max") plot(bpis) ``` -### Peak picking +### Chromatographic peak detection -Define settings for the centWave peak detection. As mentioned in the -introduction, these settings should always be adapted to the analyzed data set. +Define settings for the *centWave* chromatographic peak detection algorithm. As +mentioned in the introduction, these settings should always be adapted to the +analyzed data set. ```{r} cwp <- CentWaveParam(snthresh = 5, noise = 1000, peakwidth = c(3, 30), ppm = 20) @@ -118,14 +127,13 @@ Plot the difference between adjusted and raw retention times. plotAdjustedRtime(processedData) ``` - -### Peak grouping +### Peak grouping (correspondence analysis) Define the parameters for the *peak density*-based peak grouping (correspondence analysis). ```{r, message = FALSE, warning = FALSE} -pdp <- PeakDensityParam(sampleGroups = processedData$sample_group, +pdp <- PeakDensityParam(sampleGroups = sampleData(processedData)$sample_group, minFraction = 0.10) processedData <- groupChromPeaks(processedData, param = pdp) ``` @@ -133,8 +141,8 @@ processedData <- groupChromPeaks(processedData, param = pdp) ### Gap filling Fill-in missing peaks. Peak detection might have failed for some features in -some samples. The `fillChromPeaks` function allows to integrate for such cases -all signal in the respective m/z - retention time range. `xcms` supports two +some samples. The `fillChromPeaks()` function allows to integrate for such cases +all signal in the respective m/z - retention time range. *xcms* supports two different approaches to define the region from which signal should be retrieved for gap filling: the original version (`FillChromPeaksParam`) integrates signal from the apex position of detected chromatographic peaks, allowing to extend @@ -152,22 +160,23 @@ processed_Data <- fillChromPeaks(processedData, param = ChromPeakAreaParam()) ### Export data -#### export MS1 and MS2 features +#### Export MS1 feature abundances and related MS2 spectra -Below we use the `featureSpectra` function to extract all MS2 spectra with their -precursor m/z being within the m/z range of a feature/peak and their retention -time within the rt range of the same feature/peak. Note that for older `xcms` -versions (i.e. before version 3.12) `return.type = "Spectra"` has to be used -instead of `return.type = "MSpectra"` as in the example below. Zero-intensity -values are removed from each spectrum with the `clean` function, and -subsequently processed into the expected format using the `formatSpectraForGNPS` -function. +Below we use the `featureSpectra()` function to extract all MS2 spectra with +their precursor m/z being within the m/z range of a feature/peak and their +retention time within the rt range of the same feature/peak. Zero intensity +peaks are then removed from each spectrum using the `filterIntensity()` function +and empty spectra (i.e., MS2 spectra without fragment peaks) with the +`filterEmptySpectra()` function. At last, the `formatSpectraForGNPS()` function +from the [xcms-gnps-tools](https://github.com/jorainer/xcms-gnps-tools) +repository is called to replace the acquisition number of the spectra with an +index representing the feature ID. ```{r} -## export the individual spectra into a .mgf file -filteredMs2Spectra <- featureSpectra(processedData, return.type = "MSpectra") -filteredMs2Spectra <- clean(filteredMs2Spectra, all = TRUE) -filteredMs2Spectra <- formatSpectraForGNPS(filteredMs2Spectra) +feature_ms2 <- featureSpectra(processedData, return.type = "Spectra") +feature_ms2 <- filterIntensity(feature_ms2, intensity = 0) +feature_ms2 <- filterEmptySpectra(feature_ms2) +feature_ms2 <- formatSpectraForGNPS(feature_ms2) ``` The extracted MS2 spectra are saved as *ms2spectra_all.mgf* file. This file can @@ -175,16 +184,18 @@ for example be used to do *in silico* structure prediction through [SIRIUS+CSI:FingerID](https://bio.informatik.uni-jena.de/software/sirius/). ```{r} -writeMgfData(filteredMs2Spectra, "ms2spectra_all.mgf") +library(MsBackendMgf) +export(backend = MsBackendMgf(), + feature_ms2, file = "ms2spectra_all.mgf") ``` Export peak area quantification table. To this end we first extract the *feature definitions* (i.e. the m/z and retention time ranges and other metadata for all defined features in the data set) and then the integrated peak areas (with the -`featureValues` function). This peak area quantification table contains features -and respective per sample peak areas in columns. The combined data is then saved -to the file *xcms_all.txt*. Note that it is now also possible to use the entire -feature table in the FBMN workflow. +`featureValues()` function). This peak area quantification table contains +features and respective per sample peak areas in columns. The combined data is +then saved to the file *xcms_all.txt*. Note that it is now also possible to use +the entire feature table in the FBMN workflow. ```{r} ## get feature definitions and intensities @@ -194,6 +205,7 @@ featuresIntensities <- featureValues(processedData, value = "into") ## generate data table dataTable <- merge(featuresDef, featuresIntensities, by = 0, all = TRUE) dataTable <- dataTable[, !(colnames(dataTable) %in% c("peakidx"))] +colnames(dataTable)[1L] <- "feature_id" ``` ```{r} @@ -205,36 +217,43 @@ write.table(dataTable, "xcms_all.txt", sep = "\t", quote = FALSE, row.names = FALSE) ``` +#### Export data for features with available MS2 spectra -#### Export MS2 features only - -The `filteredMs2Spectra` contains all MS2 spectra with their precursor m/z +The `feature_ms2` contains all MS2 spectra with their precursor m/z within the feature's m/z range and a retention time that is within the retention time of the chromatographic peak/feature. We thus have multiple MS2 spectra for -each feature (also from each sample). Metadata column `"feature_id"` indicates +each feature (also from each sample). Metadata column `"FEATURE_ID"` indicates to which feature a MS2 spectrum belongs: ```{r} -filteredMs2Spectra +head(feature_ms2$FEATURE_ID) ``` We next select a single MS2 spectrum for each feature and export this reduced -set also as an .mgf file. We use the `combineSpectra` function on the list of -spectra and specify with `fcol = "feature_id"` how the spectra are grouped -(i.e. all spectra with the same feature id are processed together). On the set -of spectra of the same feature we apply the `maxTic` function that simply -returns the spectrum with the largest sum of intensities. We thus select with -the code below the spectrum with the largest total signal as the -*representative* MS2 spectrum for each feature. We're specifically calling the -`combineSpectra` method from the `MSnbase` package because also the `Spectra` R -package would provide one, which has however different parameters. +set also as an .mgf file. We use the `Spectra::combineSpectra()` function to +*combine* spectra with the same value for parameter `f` into one. In the example +below we use the `maxTicPeaksData()` function from the +[xcms-gnps-tools](https://github.com/jorainer/xcms-gnps-tools) repository to +report for each feature only the fragment peaks matrix with the largest TIC. As +an alternative, the `combineSpectra()` method would also allow to create a +*representative* spectrum combining all fragment peaks or reporting fragment +peaks present in e.g. 80% of related fragment spectra (see next section for an +example). The sets of fragment spectra assigned to a feature have to be declared +with parameter `f` (which in our example is set to `f = +feature_ms2$FEATURE_ID`, defining for each spectrum the feature it is +related to). In addition, we have to change parameter `p` (which defines how to +split and process the `Spectra` object in parallel) from the default `p = +dataStorage(object)` to `p = rep(1, length(feature_ms2))` hence switching +from a per-mzML-file-based processing to a processing that allows to select a +single representative fragment spectrum for feature across all data files. -```{r} +```{r, message = FALSE} ## Select for each feature the Spectrum2 with the largest TIC. -filteredMs2Spectra_maxTic <- MSnbase::combineSpectra( - filteredMs2Spectra, - fcol = "feature_id", - method = maxTic) +feature_ms2_maxtic <- combineSpectra( + feature_ms2, f = feature_ms2$FEATURE_ID, + p = rep(1, length(feature_ms2)), FUN = maxTicPeaksData) +spectraNames(feature_ms2_maxtic) <- paste0( + feature_ms2_maxtic$FEATURE_ID, " maxTic") ``` Next we export the data to a file which can then be submitted to GNPS [feature-based @@ -242,7 +261,8 @@ molecular networking](https://ccms-ucsd.github.io/GNPSDocumentation/featurebasedmolecularnetworking/). ```{r} -writeMgfData(filteredMs2Spectra_maxTic, "ms2spectra_maxTic.mgf") +export(backend = MsBackendMgf(), + feature_ms2_maxtic, file = "ms2spectra_maxTic.mgf") ``` At last we subset the peak area quantification table to features for which we @@ -253,7 +273,7 @@ networking](https://ccms-ucsd.github.io/GNPSDocumentation/featurebasedmolecularn ```{r} ## filter data table to contain only peaks with MSMS DF[ , !(names(DF) %in% drops)] filteredDataTable <- dataTable[which( - dataTable$Row.names %in% filteredMs2Spectra@elementMetadata$feature_id),] + dataTable$feature_id %in% feature_ms2_maxtic$FEATURE_ID),] ``` ```{r} @@ -271,30 +291,23 @@ Alternatively, instead of selecting the spectrum with the largest total signal as representative MS2 spectrum for each feature, we can create a *consensus MS2 spectrum*. A consensus MS2 spectrum can for example be created by combining all MS2 spectra for a feature into a single spectrum that contains peaks present in -the majority of spectra. Note however that this feature is experimental at -present. - -To this end we can use the `consensusSpectrum` function in combination with the -`combineSpectra` function. The parameter `minProp` defines the mimimal -proportion of spectra in which a peak has to be present in order for it to be -added to the consensus spectrum (0.8 -> 80% of spectra). The parameters `mzd` -and `ppm` allow to define how to group peaks between spectra with `mzd` being a -fixed, constant value and all peaks between spectra with a difference in their -m/z < `mzd` are combined into the final mass peak in the consensus -spectrum. Finally, the parameter `ppm` allows to perform an m/z dependent -grouping of mass peaks, i.e. mass peaks with a difference in their m/z smaller -than `ppm` are combined. - -For more details see the documentation of the -[consensusSpectrum](https://rdrr.io/bioc/MSnbase/man/consensusSpectrum.html) -function in the MSnbase R package. +the majority of spectra. For this we set parameters `ppm = 10`, `peaks = +"intersect"` and `minProp = 0.5` in the `Spectra::combineSpectra()` function: +all fragment peaks in all MS2 spectra of a feature with a similar *m/z* (based +on parameter `ppm`) that are present in >= `minProp` of the related MS2 spectra +are reported in the final *consensus* spectrum. ```{r, message = FALSE, warning = FALSE} -filteredMs2Spectra_consensus <- MSnbase::combineSpectra( - filteredMs2Spectra, fcol = "feature_id", method = consensusSpectrum, - mzd = 0, minProp = 0.8, ppm = 10) +feature_ms2_consensus <- combineSpectra( + feature_ms2, f = feature_ms2$FEATURE_ID, p = rep(1, length(feature_ms2)), + ppm = 10, peaks = "intersect", minProp = 0.5) +spectraNames(feature_ms2_consensus) <- paste0( + feature_ms2_maxtic$FEATURE_ID, " consensus") +#' get rid of empty spectra +feature_ms2_consensus <- filterEmptySpectra(feature_ms2_consensus) -writeMgfData(filteredMs2Spectra_consensus, "ms2spectra_consensus.mgf") +export(backend = MsBackendMgf(), + feature_ms2_consensus, file = "ms2spectra_consensus.mgf") ``` Analogously we subset the peak area quantification table to features for which @@ -303,8 +316,8 @@ file. This file can be submitted to GNPS [feature-based molecular networking](https://ccms-ucsd.github.io/GNPSDocumentation/featurebasedmolecularnetworking/): ```{r} -consensusDataTable <- dataTable[which(dataTable$Row.names %in% - filteredMs2Spectra_consensus@elementMetadata$feature_id),] +consensusDataTable <- dataTable[which(dataTable$feature_id %in% + feature_ms2_consensus$FEATURE_ID),] head(consensusDataTable) ``` @@ -316,17 +329,17 @@ write.table(consensusDataTable, "xcms_consensusMS2.txt", ### CAMERA annotation of adducts and isotopes The code in this section describes how the data can be processed to enable the -ion identify networking (IIN) in FBMN. In brief, we are using the `CAMERA` +ion identify networking (IIN) in FBMN. In brief, we are using the *CAMERA* package to determine which features might be adducts or isotopes of the same compound. This information is exported as an additional *edges* file and is added to the feature annotation.. Note: the CAMERA package supports objects of class `xcmsSet`, which were the -outputs of the *old* version of xcms. The newer `XCMSnExp` object can however be -converted to an `xcmsSet` object with the `as(object, "xcmsSet")`, which does -however not support conversion of objects with MS level > 1 data. Thus we use -the `filterMsLevel` function on the result object to restrict the data in the -object to MS level 1 prior to the conversion. +outputs of the *old* version of xcms. The newer *xcms* result objects can +however be converted to an `xcmsSet` object with the `as(object, "xcmsSet")`, +which does however not support conversion of objects with MS level > 1 +data. Thus we use the `filterMsLevel()` function on the result object to +restrict the data in the object to MS level 1 prior to the conversion. ```{r, message = FALSE} library(CAMERA) @@ -337,7 +350,7 @@ With the conversion we lost also the sample group assignment which we have to manually add. ```{r, message = FALSE} -sampclass(xset) <- s_groups +sampclass(xset) <- sampleData(processedData)$sample_group ``` Create and `xsAnnotate` object named `xsa` extracting the peak table from the @@ -383,8 +396,8 @@ xsaC <- groupCorr(xsaF, cor_eic_th = 0.6, pval = 0.05, graphMethod = "hcs", This step has seperated our 283 pseudospectra into 2132. Now we are going to deal with the second big step of CAMERA workflow: annotation -of isotopes and adducts within pseudospectra-groups. The `findIsotopes` -annotates isotopes according to the relation between features' +of isotopes and adducts within pseudospectra-groups. The `findIsotopes()` +function annotates isotopes according to the relation between features' C12/C13. Parameter `intval` allows to specify which feature value should be uses (`"into"` for the maximum peak intensity, `"into"` for the integrated peak intensity and `"intb"` for the baseline corrected integrated peak intensity, @@ -400,7 +413,7 @@ xsaFI <- findIsotopes(xsaC, maxcharge = 2, maxiso = 3, minfrac = 0.5, Next adducts are identified and annotated based on the m/z differences between grouped chromatographic peaks. Setting the correct polarity with parameter `polarity` (either `"positive"` or `"negative"` is key). For potential adducts, -`CAMERA` calculates by default all possible combinations from the standard ions +*CAMERA* calculates by default all possible combinations from the standard ions depending on the ionization mode. Alternatively it is possible to limit to a predefined set of adducts with the `rules` parameter. @@ -428,8 +441,8 @@ edgelist <- getEdgelist(xsaFA) The resulting `data.frame` contains an edge between nodes (i.e. pairwise associations between potential adducts/isotopes from the same *correlation -group* defined by `CAMERA`) in each row. All edges fulfill the criteria from -`CAMERA` (i.e. representing signal from co-eluting ions with a similar peak +group* defined by *CAMERA*) in each row. All edges fulfill the criteria from +*CAMERA* (i.e. representing signal from co-eluting ions with a similar peak shape), but only for few the actual adduct annotation could be determined (see below). @@ -447,8 +460,8 @@ annotation. head(edgelist[edgelist$EdgeType == "MS1 annotation", ]) ``` -In addition we extract per-feature annotations from the `CAMERA` result object -with the `getFeatureAnnotations` function (also defined in +In addition we extract per-feature annotations from the *CAMERA* result object +with the `getFeatureAnnotations()` function (also defined in [xcms-gnps-tools](https://github.com/jorainer/xcms-gnps-tools)). These are appended to the feature table `dataTable` generated in the previous section. diff --git a/XCMS3_Preprocessing_bacterial_samples.Rmd b/XCMS3_Preprocessing_bacterial_samples.Rmd index ce590a5..865c487 100644 --- a/XCMS3_Preprocessing_bacterial_samples.Rmd +++ b/XCMS3_Preprocessing_bacterial_samples.Rmd @@ -3,15 +3,18 @@ To follow this example tutorial, download the *MSV000079204* data set from:
https://gnps.ucsd.edu/ProteoSAFe/result.jsp?task=d74ca92d9dec4e2883f28506c670e3ca&view=advanced_view -Note that the settings for `xcms` used in this tutorial were not optimized, +Note that the settings for *xcms* used in this tutorial were not optimized, specifically the alignment based on the default *obiwarp* parameters might perform a little to strong retention time adjustment. For more information on optimization of the parameters see the [xcms vignette](https://bioconductor.org/packages/release/bioc/vignettes/xcms/inst/doc/xcms.html) -or the [preprocessing-untargeted-metabolomics](https://github.com/Bioconductor/CSAMA/tree/2019/lab/4-thursday/lab-05-metabolomics) workshop. +or the [Metabonaut +tutorials](https://rformassspectrometry.github.io/Metabonaut/). Load required libraries and utility functions for GNPS export. ```{r, message = FALSE} +library(MsExperiment) +library(Spectra) library(xcms) source("https://raw.githubusercontent.com/jorainer/xcms-gnps-tools/master/customFunctions.R") ``` @@ -40,12 +43,11 @@ adjusted to the actual experimental setup. For the present example analysis we put all files into the same sample group. ```{r} -mzXMLfiles <- paste0('MSV000079204/', - list.files('MSV000079204/', pattern = ".mzXML$", - recursive = TRUE)) -s_groups <- rep("sample", length(mzXMLfiles)) -pheno <- data.frame(sample_name = basename(mzXMLfiles), - sample_group = s_groups, stringsAsFactors = FALSE) +mzxml_files <- dir("MSV000079204", pattern = ".mzXML$", + recursive = TRUE, full.names = TRUE) +s_groups <- rep("sample", length(mzxml_files)) +pheno <- data.frame(sample_name = basename(mzxml_files), + sample_group = s_groups) ``` ```{r} @@ -55,8 +57,7 @@ head(pheno) Read all raw data, including MS2 level. ```{r} -rawData <- readMSData(mzXMLfiles, centroided. = TRUE, mode = "onDisk", - pdata = new("NAnnotatedDataFrame", pheno)) +rawData <- readMsExperiment(mzxml_files, sampleData = pheno) ``` Create a base peak chromatogram (BPC) of your data for visual inspection. @@ -94,14 +95,13 @@ We skip the retention time adjustment, because the different files have considerable differences in retention time ranges (ranging from 300 to 5000 seconds). - ### Peak grouping Define the parameters for the *peak density*-based peak grouping (correspondence analysis). ```{r, message = FALSE, warning = FALSE} -pdp <- PeakDensityParam(sampleGroups = processedData$sample_group, +pdp <- PeakDensityParam(sampleGroups = sampleData(processedData)$sample_group, minFraction = 0.10) processedData <- groupChromPeaks(processedData, param = pdp) ``` @@ -109,53 +109,49 @@ processedData <- groupChromPeaks(processedData, param = pdp) ### Gap filling Fill-in missing peaks. Peak detection might have failed for some features in -some samples. The `fillChromPeaks` function allows to integrate for such cases -all signal in the respective m/z - retention time range. Below we first define -the median width of identified chromatographic peaks in retention time dimension -and use this as parameter `fixedRt` for the `fillChromPeaks`. +some samples. The `fillChromPeaks()` function allows to integrate for such cases +all signal in the respective m/z - retention time range. ```{r, message = FALSE, warning = FALSE} -medWidth <- median(chromPeaks(processedData)[, "rtmax"] - - chromPeaks(processedData)[, "rtmin"]) -processed_Data <- fillChromPeaks(processedData, - param = FillChromPeaksParam(fixedRt = medWidth)) +processedData <- fillChromPeaks(processedData, param = ChromPeakAreaParam()) ``` ### Export data -#### export MS1 and MS2 features +#### Export MS1 feature abundances and related MS2 spectra -Below we use the `featureSpectra` function to extract all MS2 spectra with their -precursor m/z being within the m/z range of a feature/peak and their retention -time within the rt range of the same feature/peak. Note that for older `xcms` -versions (i.e. before version 3.12) `return.type = "Spectra"` has to be used -instead of `return.type = "MSpectra"` as in the example below. Zero-intensity -values are removed from each spectrum with the `clean` function, and -subsequently processed into the expected format using the `formatSpectraForGNPS` -function. +Below we use the `featureSpectra()` function to extract all MS2 spectra with +their precursor m/z being within the m/z range of a feature/peak and their +retention time within the rt range of the same feature/peak. See the +[XCMS3_Preprocessing.Rmd](XCMS3_Preprocessing.Rmd) file for details and +descriptions of the various functions and parameters used for extracting, +combining and exporting the features' MS2 spectra. ```{r} ## export the individual spectra into a .mgf file -filteredMs2Spectra <- featureSpectra(processedData, return.type = "MSpectra") -filteredMs2Spectra <- clean(filteredMs2Spectra, all = TRUE) -filteredMs2Spectra <- formatSpectraForGNPS(filteredMs2Spectra) +feature_ms2 <- featureSpectra(processedData, return.type = "Spectra") +feature_ms2 <- filterIntensity(feature_ms2, intensity = 0) +feature_ms2 <- filterEmptySpectra(feature_ms2) +feature_ms2 <- formatSpectraForGNPS(feature_ms2) ``` -The extracted MS2 spectra are saved as *ms2spectra_all.mgf* file. This file can -for example be used to do *in silico* structure prediction through +The extracted MS2 spectra are saved as *ms2spectra_all_bacterial.mgf* file. This +file can for example be used to do *in silico* structure prediction through [SIRIUS+CSI:FingerID](https://bio.informatik.uni-jena.de/software/sirius/). ```{r} -writeMgfData(filteredMs2Spectra, "ms2spectra_all.mgf") +library(MsBackendMgf) +export(backend = MsBackendMgf(), + feature_ms2, file = "ms2spectra_all_bacterial.mgf") ``` Export peak area quantification table. To this end we first extract the *feature definitions* (i.e. the m/z and retention time ranges and other metadata for all defined features in the data set) and then the integrated peak areas (with the -`featureValues` function). This peak area quantification table contains features -and respective per sample peak areas in columns. The combined data is then saved -to the file *xcms_all.txt*. Note that it is now also possible to use the entire -feature table in the FBMN workflow. +`featureValues()` function). This peak area quantification table contains +features and respective per sample peak areas in columns. The combined data is +then saved to the file *xcms_all_bacterial.txt*. Note that it is now also +possible to use the entire feature table in the FBMN workflow. ```{r} ## get data @@ -163,8 +159,9 @@ featuresDef <- featureDefinitions(processedData) featuresIntensities <- featureValues(processedData, value = "into") ## generate data table -dataTable <- merge(featuresDef, featuresIntensities, by=0, all=TRUE) +dataTable <- merge(featuresDef, featuresIntensities, by=0, all = TRUE) dataTable <- dataTable[, !(names(dataTable) %in% c("peakidx"))] +colnames(dataTable)[1L] <- "feature_id" ``` ```{r} @@ -172,36 +169,35 @@ head(dataTable) ``` ```{r} -write.table(dataTable, "xcms_all.txt", sep = "\t", quote = FALSE, row.names = FALSE) +write.table(dataTable, "xcms_all_bacterial.txt", sep = "\t", + quote = FALSE, row.names = FALSE) ``` -#### export MS2 features only +#### Export features with available MS2 spectra only -The `filteredMs2Spectra` contains all MS2 spectra with their precursor m/z +The `feature_ms2` contains all MS2 spectra with their precursor m/z within the feature's m/z range and a retention time that is within the retention time of the chromatographic peak/feature. We thus have multiple MS2 spectra for -each feature (also from each sample). Metadata column `"feature_id"` indicates +each feature (also from each sample). Metadata column `"FEATURE_ID"` indicates to which feature a MS2 spectrum belongs: ```{r} -filteredMs2Spectra +head(feature_ms2$FEATURE_ID) ``` We next select a single MS2 spectrum for each feature and export this reduced -set also as an .mgf file. We use the `combineSpectra` function on the list of -spectra and specify with `fcol = "feature_id"` how the spectra are grouped -(i.e. all spectra with the same feature id are processed together). On the set -of spectra of the same feature we apply the `maxTic` function that simply -returns the spectrum with the largest sum of intensities. We thus select with -the code below the spectrum with the largest total signal as the -*representative* MS2 spectrum for each feature. - +set also as an .mgf file. In this first example we report the fragment peaks of +the MS2 spectrum with the highest TIC (sum of fragment intensities). See the +[XCMS3_Preprocessing.Rmd](XCMS3_Preprocessing.Rmd) file for more information and +description of the settings. ```{r} ## Select for each feature the Spectrum2 with the largest TIC. -filteredMs2Spectra_maxTic <- combineSpectra(filteredMs2Spectra, - fcol = "feature_id", - method = maxTic) +feature_ms2_maxtic <- combineSpectra( + feature_ms2, f = feature_ms2$FEATURE_ID, + p = rep(1, length(feature_ms2)), FUN = maxTicPeaksData) +spectraNames(feature_ms2_maxtic) <- paste0( + feature_ms2_maxtic$FEATURE_ID, " maxTic") ``` Next we export the data to a file which can then be submitted to GNPS [feature-based @@ -209,18 +205,19 @@ molecular networking](https://ccms-ucsd.github.io/GNPSDocumentation/featurebasedmolecularnetworking/). ```{r} -writeMgfData(filteredMs2Spectra_maxTic, "ms2spectra_maxTic.mgf") +export(backend = MsBackendMgf(), + feature_ms2_maxtic, file = "ms2spectra_maxTic_bacterial.mgf") ``` At last we subset the peak area quantification table to features for which we -have also an MS2 spectrum and export this to the *xcms_onlyMS2.txt* file. This -file can be submitted to GNPS [feature-based molecular +have also an MS2 spectrum and export this to the *xcms_onlyMS2_bacterial.txt* +file. This file can be submitted to GNPS [feature-based molecular networking](https://ccms-ucsd.github.io/GNPSDocumentation/featurebasedmolecularnetworking/): ```{r} ## filter data table to contain only peaks with MSMS DF[ , !(names(DF) %in% drops)] filteredDataTable <- dataTable[which( - dataTable$Row.names %in% filteredMs2Spectra@elementMetadata$feature_id),] + dataTable$feature_id %in% feature_ms2_maxtic$FEATURE_ID),] ``` ```{r} @@ -228,7 +225,8 @@ head(filteredDataTable) ``` ```{r} -write.table(filteredDataTable, "xcms_onlyMS2.txt", sep = "\t", quote = FALSE, row.names = FALSE) +write.table(filteredDataTable, "xcms_onlyMS2_bacterial.txt", + sep = "\t", quote = FALSE, row.names = FALSE) ``` #### Export MS2 consensus spectra @@ -238,43 +236,32 @@ as representative MS2 spectrum for each feature, we can create a *consensus MS2 spectrum*. A consensus MS2 spectrum can for example be created by combining all MS2 spectra for a feature into a single spectrum that contains peaks present in the majority of spectra. Note however that this feature is experimental at -present. - -To this end we can use the `consensusSpectrum` function in combination with the -`combineSpectra` function. The parameter `minProp` defines the mimimal -proportion of spectra in which a peak has to be present in order for it to be -added to the consensus spectrum (0.8 -> 80% of spectra). The parameters `mzd` -and `ppm` allow to define how to group peaks between spectra with `mzd` being a -fixed, constant value and all peaks between spectra with a difference in their -m/z < `mzd` are combined into the final mass peak in the consensus -spectrum. Finally, the parameter `ppm` allows to perform an m/z dependent -grouping of mass peaks, i.e. mass peaks with a difference in their m/z smaller -than `ppm` are combined. - -For more details see the documentation of the -[consensusSpectrum](https://rdrr.io/bioc/MSnbase/man/consensusSpectrum.html) -function in the MSnbase R package. +present. Again, see the [XCMS3_Preprocessing.Rmd](XCMS3_Preprocessing.Rmd) file +for more information. ```{r, message = FALSE, warning = FALSE} -filteredMs2Spectra_consensus <- combineSpectra( - filteredMs2Spectra, fcol = "feature_id", method = consensusSpectrum, - mzd = 0, minProp = 0.8, ppm = 10) - -writeMgfData(filteredMs2Spectra_consensus, "ms2spectra_consensus_bacterial.mgf") +feature_ms2_consensus <- combineSpectra( + feature_ms2, f = feature_ms2$FEATURE_ID, p = rep(1, length(feature_ms2)), + ppm = 10, peaks = "intersect", minProp = 0.5) +spectraNames(feature_ms2_consensus) <- paste0( + feature_ms2_maxtic$FEATURE_ID, " consensus") +#' get rid of empty spectra +feature_ms2_consensus <- filterEmptySpectra(feature_ms2_consensus) + +export(backend = MsBackendMgf(), + feature_ms2_consensus, file = "ms2spectra_consensus_bacterial.mgf") ``` Analogously we subset the peak area quantification table to features for which -we have an MS2 consensus spectrum and export this to the *xcms_consensusMS2.txt* -file. This file can be submitted to GNPS [feature-based molecular +we have an MS2 consensus spectrum and export this to the +*xcms_consensusMS2_bacterial.txt* file. This file can be submitted to GNPS +[feature-based molecular networking](https://ccms-ucsd.github.io/GNPSDocumentation/featurebasedmolecularnetworking/): ```{r} -consensusDataTable <- dataTable[which(dataTable$Row.names %in% - filteredMs2Spectra_consensus@elementMetadata$feature_id),] +consensusDataTable <- dataTable[which(dataTable$feature_id %in% + feature_ms2_consensus$FEATURE_ID),] head(consensusDataTable) -``` - -```{r} write.table(consensusDataTable, "xcms_consensusMS2_bacterial.txt", sep = "\t", quote = FALSE, row.names = FALSE) ```