-
Notifications
You must be signed in to change notification settings - Fork 28
Explain the resulting profile
The result that you obtain from motus profile
or motus calc_motu
is a profile with three headers that start with #
. After these three lines you have the taxa id/name and read count values.
Example of a motus profile:
# git tag version 2.0.0 | motus version 2.0.0 | map_tax 2.0.0 | gene database: nr2.0.0 | calc_mgc 2.0.0 -y insert.scaled_counts -l 75 | calc_motu 2.0.0 -k mOTU -g 3 | taxonomy: ref_mOTU_2.0.0 meta_mOTU_2.0.0
# call: python mOTUs_v2/motus profile -s test1_single.fastq -n test1
#consensus_taxonomy test1
Kandleria vitulina [ref_mOTU_v2_0001] 0.0688211617
Methyloversatilis universalis [ref_mOTU_v2_0002] 0.0000000000
Megasphaera genomosp. [ref_mOTU_v2_0003] 0.0234955832
...
Thermoproteus uzoniensis [ref_mOTU_v2_5304] 0.0000000000
Paenibacillus sp. [ref_mOTU_v2_5305] 0.0030541740
unknown Bdellovibrio [meta_mOTU_v2_5307] 0.0000000000
unknown Alphaproteobacteria [meta_mOTU_v2_5308] 0.0000031719
...
unknown Clostridiales [meta_mOTU_v2_7800] 0.0000000000
unassigned 0.2307163722
You can easily remove the first two rows with:
tail -n+3 taxonomic_profiling.txt
Let's analyse the single parts:
The first header describes the version of the scripts and database that were used for profiling, as well as the parameters used for the computation. With this information is possible to reproduce the same profiles, and it's useful to check the parameters used for this specific profile.
The second header contains the call that produced the profile, with the information of the fastq files that were used.
Contains the information of what the rows represents and the name of the sample(s).
There are 5,232 ref-mOTUs, which represents species with a reference genome in NCBI (or other databases). The name of the ref-mOTUs is resolved at the species level. Note that the mOTUs represents species based on genetic distances, which in some cases are different from the historical phenotype-based classification of prokaryotic species (check Mende et al. Nature methods 2013 and Parks et al. Nature Biotech 2018 for more information).
There are 2,494 meta-mOTUs, which represents species without a reference genome. These mOTUs are extracted from metagenomes from human associated biomes (oral cavity, vagina, skin and gut) and global oceans. The annotation is done through LCA and in most of the cases is not resolved at the species level. For example unknown Alphaproteobacteria [meta_mOTU_v2_5308]
is a species that belongs to the class Alphaproteobacteria, for which the genome sequence is not available in NCBI.
The unassigned
at the end of the profile file represents the fraction of unmapped reads. This represents species that we know to be present in the sample, but we are not able to quantify. For almost all the analysis, it is better to remove this value, since it does not represent a single species/clade. The usefulness of the unassigned
comes out when we need to calculate relative abundances. See the following example:
True rel. ab. mOTUs read counts mOTUs rel. ab.
species1 20% species1 200 species1 20%
species2 10% species3 300 species3 30%
species3 30% species4 100 species4 10%
species4 10% unassigned 400 unassigned 40%
species5 30%
In the example the sample (True rel. ab.) contains 5 species, of which only 3 are represented in the mOTUs profiler. Despite this, the relative abundance of these species is correct since we are able to measure the unassigned
(or unmapped reads). If you would calculate the relative abundance without taking into account the unassigned
, then you would get an over-estimation of the profiled species:
True rel. ab. mOTUs read counts mOTUs rel. ab.
species1 20% species1 200 species1 33.4%
species2 10% species3 300 species3 50%
species3 30% species4 100 species4 16.6%
species4 10%
species5 30%