A Snakemake workflow for Genome Resolved Metagenomics
- Preprocessing:
- FASTQ processing with
fastp
. - Mapping of preprocessed reads against the host(s) and possible contaminants with
bowtie2
. Skip if no host is provided.- Useful for environmental genomics (there is no host in soil, duh!), or
- environments that can have multiple genomes, like in mycorrhiza, where plant, fungus and even insects, can be there.
- Assembly-free statistics with
kraken2
,nonpareil
andsinglem
.
- FASTQ processing with
- Assembly of non-host reads with
megahit
.- Coassembly strategies denoted in the
samples.tsv
file. See below. - Taxonomic annotation of contigs with
kraken2
- Assembly quantification with
bowtie2
andcoverm
- Coassembly strategies denoted in the
- Bacterial metagenomics:
- Binning with
CONCOCT
,Maxbin2
,MetaBAT2
, and aggregated withMAGScoT
. - Annotation with
quast
(mag and contig lengths),gtdbtk
(taxonomy),dram
(functions) andcheckm2
(completeness and contamination) - Dereplication with
dRep
, using multiple secondary ANIs, in case you need one for read mapping (eg. 95%), and a different for something like pangenomics (98 and 99%). - Quantification with
bowtie2
andcoverm
. One per secondary ANI
- Binning with
- Viral metagenomics:
- Identification and clustering with
genomad
,bbmap
andmmseqs
. - Quantification with
bowtie2
andcoverm
. - Annotation with
dram
,virsorter2
,checkv
andquast
.
- Identification and clustering with
- Module reporting with
multiqc
, assisted withsamtools
andfastqc
.
-
Make sure you have
conda
,mamba
andsnakemake
installed.conda --version snakemake --version mamba --version
-
Clone the git repository in your terminal and get in:
git clone [email protected]:3d-omics/mg_assembly.git cd mg_assembly
-
Test your installation by running the test data. It will download all the necesary software through conda / mamba. It should take less than 5 minutes.
snakemake --use-conda --cores 8 test
-
Run it with your own data:
- Edit
config/samples.tsv
and add your samples names, a library identifier to differentiate them, where are they located, the adapters used, and the coassemblies each sample will belong to.
sample_id library_id forward_filename reverse_filename forward_adapter reverse_adapter assembly_ids sample1 lib1 resources/reads/sample1_1.fq.gz resources/reads/sample1_2.fq.gz AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT sample, all sample2 lib1 resources/reads/sample2_1.fq.gz resources/reads/sample2_2.fq.gz AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT all
In the
assembly_ids
you can name all the coassemblies each library will belong to. If you don't want to use a sample (a blank or a failed sample), leave the field empty.- Edit
config/features.yml
with reference databases:
hosts: # Add more in case of multi-host, remove entries in case of environmental sample chicken: resources/reference/chicken_39_sub.fa.gz # pig: resources/reference/pig.fa.gz magscot: pfam_hmm: workflow/scripts/MAGScoT/hmm/gtdbtk_rel207_Pfam-A.hmm.gz tigr_hmm: workflow/scripts/MAGScoT/hmm/gtdbtk_rel207_tigrfam.hmm.gz databases: # The pipeline does not provide or generate them. There are scripts tho. checkm2: resources/databases/checkm2/20210323/uniref100.KO.1.dmnd checkv: resources/databases/checkv/20230320/checkv-db-v1.5/ dram: resources/databases/dram/20230811 genomad: resources/databases/genomad/genomad_db_v1.7 gtdbtk: resources/databases/gtdbtk/release214 kraken2: # add entries as necessary refseq500: resources/databases/kraken2/kraken2_RefSeqV205_Complete_500GB/20220505/ singlem: resources/databases/singlem/S3.2.1.GTDB_r214.metapackage_20231006.smpkg.zb virsorter2: resources/databases/virsorter2/20200511/
- Edit
config/params.yml
with execution parameters. The defaults are reasonable.
- Edit
-
Run the pipeline
# make sure firsthand that you have all the databases above properly installed snakemake --use-conda --cores 8 # locally snakemake --use-conda --cores 24 --jobs 100 --executor slurm # in slurm
-
Output:
The main outputs are:
results/prokaryotes/annotate/
: MAG annotations.results/prokaryotes/quantify/
: MAG and contig-wise quantifications.- There is an experimental pipeline for viral identification with a similar structure. See below. The results are in
results/viruses/
. - MultiQC html reports and tables next to the main modules:
preprocess
,assemble
,prokaryotes
andviruses
.