Since VEBA functionality benefits from structure, it's good to have a list of identifiers that you can use for for-loops. In the examples, it will be the following: identifiers.list
However, for datasets with metagenomics and metatranscriptomics it's often useful to have a master list identifiers.list and separate identifiers.dna.list and identifiers.rna.list for metagenomic and metatranscriptomic samples, respectively.
Our VEBA project directory is going to be veba_output and each step will be in a subdirectory.
- e.g.,
veba_output/preprocesswill have all of the preprocessed reads.
In the workflows that work on specific samples, there will be sample subdirectories.
- e.g.,
veba_output/preprocess/SRR17458603/output,veba_output/preprocess/SRR17458606/output, ... - e.g.,
veba_output/assembly/SRR17458603/output,veba_output/assembly/SRR17458606/output, ...
Many of these jobs should be run using a job scheduler like SunGridEngine or SLURM. This resource is useful for converting commands between SunGridEnginer and SLURM. I've used both and these are adaptations of the submission commands you can use as a template:
# Let's create some informative name. Remember we are going to create a lot of jobs and log files for the different workflows if you have multiple samples
N=preprocessing__${ID}
CMD="some command we want to run"
# SunGridEngine:
qsub -o logs/${N}.o -e logs/${N}.e -cwd -N ${N} -j y -pe threaded ${N_JOBS} "${CMD}"
# SLURM:
sbatch -J ${N} -N 1 -c ${N_JOBS} --ntasks-per-node=1 -o logs/${N}.o -e logs/${N}.e --export=ALL -t 12:00:00 --mem=20G --wrap="${CMD}"
- Downloading and preprocessing fastq files - Explains how to download reads from NCBI and run VEBA's
preprocess.pymodule to decontaminate either metagenomic and/or metatranscriptomic reads. - Complete end-to-end metagenomics analysis - Goes through assembling metagenomic reads, binning, clustering, classification, and annotation. We also show how to use the unbinned contigs in a pseudo-coassembly with guidelines on when it's a good idea to go this route.
- Recovering viruses from metatranscriptomics - Goes through assembling metatranscriptomic reads, viral binning, clustering, and classification.
- Read mapping and counts tables - Read mapping and generating counts tables at the contig, MAG, SLC, ORF, and SSO levels.
- Phylogenetic inference - Phylogenetic inference of eukaryotic diatoms.
- Setting up bona fide coassemblies for metagenomics or metatranscriptomics - In the case where all samples are of low depth, it may be useful to use coassembly instead of sample-specific approaches. This walkthrough goes through concatenating reads, creating a reads table, coassembly of concatenated reads, aligning sample-specific reads to the coassembly for multiple sorted BAM files, and mapping reads for scaffold/transcript-level counts. Please note that a coassembly differs from the pseudo-coassembly concept introduced in the VEBA publication. For more information regarding the differences between bona fide coassembly and pseud-coassembly, please refer to 23. What's the difference between a coassembly and a pseudo-coassembly?.
- Bioprospecting for biosynthetic gene clusters - Detecting biosynthetic gene clusters (BGC) with and scoring novelty of BGCs.
- Converting counts tables - Convert your counts table (with or without metadata) to anndata or biom format. Also supports Pandas pickle format.
- Adapting commands for Docker - Explains how to download and use Docker for running VEBA.
- Adapting commands for AWS - Explains how to download and use Docker for running VEBA specifically on AWS.
- Metabolic Profiling de novo genomes - Explains how to build and align reads to custom
HUMAnNdatabases from de novo genomes and annotations.
Coming Soon:
- Workflow for low-depth samples with no bins
- Workflow for ASV detection from short-read amplicons
- Workflows for integrating 3rd party software with VEBA:
- Using EukHeist for eukaryotic binning followed by VEBA for mapping and annotation.
- Using EukMetaSanity for modeling genes for eukaryotic genomes recovered with VEBA.
- The final output files are in the
outputsubdirectory and to avoid redundant files many of these symlinked from theintermediatedirectory. This can cause issues if you are using a "scratch" directory where the files are deleted after a certain amount of time. If you have a crontab set up, make sure it also touches symlinks and not just files. - You'll need to adjust the memory and time for different jobs. Assembly will take much longer than preprocessing. Annotation will require more memory than mapping.