Skip to content

Latest commit

 

History

History
62 lines (42 loc) · 5.88 KB

File metadata and controls

62 lines (42 loc) · 5.88 KB

Getting started with VEBA

Basics:

Since VEBA functionality benefits from structure, it's good to have a list of identifiers that you can use for for-loops. In the examples, it will be the following: identifiers.list

However, for datasets with metagenomics and metatranscriptomics it's often useful to have a master list identifiers.list and separate identifiers.dna.list and identifiers.rna.list for metagenomic and metatranscriptomic samples, respectively.

Our VEBA project directory is going to be veba_output and each step will be in a subdirectory.

  • e.g., veba_output/preprocess will have all of the preprocessed reads.

In the workflows that work on specific samples, there will be sample subdirectories.

  • e.g., veba_output/preprocess/SRR17458603/output, veba_output/preprocess/SRR17458606/output , ...
  • e.g., veba_output/assembly/SRR17458603/output , veba_output/assembly/SRR17458606/output , ...

Many of these jobs should be run using a job scheduler like SunGridEngine or SLURM. This resource is useful for converting commands between SunGridEnginer and SLURM. I've used both and these are adaptations of the submission commands you can use as a template:

# Let's create some informative name. Remember we are going to create a lot of jobs and log files for the different workflows if you have multiple samples
N=preprocessing__${ID}
	
CMD="some command we want to run"
	
# SunGridEngine:
qsub -o logs/${N}.o -e logs/${N}.e -cwd -N ${N} -j y -pe threaded ${N_JOBS} "${CMD}"
	
# SLURM:
sbatch -J ${N} -N 1 -c ${N_JOBS} --ntasks-per-node=1 -o logs/${N}.o -e logs/${N}.e --export=ALL -t 12:00:00 --mem=20G --wrap="${CMD}"

Available walkthroughs:


Coming Soon:

  • Workflow for low-depth samples with no bins
  • Workflow for ASV detection from short-read amplicons
  • Workflows for integrating 3rd party software with VEBA:
    • Using EukHeist for eukaryotic binning followed by VEBA for mapping and annotation.
    • Using EukMetaSanity for modeling genes for eukaryotic genomes recovered with VEBA.

Notes:
  • The final output files are in the output subdirectory and to avoid redundant files many of these symlinked from the intermediate directory. This can cause issues if you are using a "scratch" directory where the files are deleted after a certain amount of time. If you have a crontab set up, make sure it also touches symlinks and not just files.
  • You'll need to adjust the memory and time for different jobs. Assembly will take much longer than preprocessing. Annotation will require more memory than mapping.