Skip to content

A modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes

License

Notifications You must be signed in to change notification settings

JCVenterInstitute/veba

 
 

Repository files navigation

Maintainer License DOI:10.1186/s12859-022-04973-8

Forks Stargazers Issues

 _    _ _______ ______  _______
  \  /  |______ |_____] |_____|
   \/   |______ |_____] |     |

What is VEBA?

The Viral Eukaryotic Bacterial Archaeal (VEBA) is an open-source software suite developed with all domains of microorganisms as the primary objective (not post hoc adjustments) including prokaryotic, eukaryotic, and viral organisms. VEBA is an end-to-end metagenomics and bioprospecting software suite that can directly recover and analyze eukaryotic and viral genomes in addition to prokaryotic genomes with native support for candidate phyla radiation (CPR). VEBA implements a novel iterative binning procedure and an optional hybrid sample-specific/multi-sample framework that recovers more genomes than non-iterative methods. To optimize the microeukaryotic gene calling and taxonomic classifications, VEBA includes a consensus microeukaryotic database containing protists and fungi compiled from several existing databases. VEBA also provides a unique clustering-based dereplication strategy allowing for sample-specific genomes and proteins to be directly compared across non-overlapping biological samples. VEBA also automates biosynthetic gene cluster identification and novelty scores for bioprospecting.

VEBA's mission is to make robust (meta-)genomics/transcriptomics analysis effortless. The philosophy of VEBA is that workflows should be modular, generalizable, and easy-to-use with minimal intermediate steps. The approach implemented in VEBA is to (try and) think 2 steps ahead of what you may need to do and automate the task for you.

^__^


Citation

Espinoza JL, Dupont CL. VEBA: a modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes. BMC Bioinformatics. 2022 Oct 12;23(1):419. doi: 10.1186/s12859-022-04973-8. PMID: 36224545.

Please cite the software dependencies described under the Dependency Citation Table.

^__^


Announcements

  • What's new in VEBA v1.3.0?

  • VEBA Modules:

    • Added profile-pathway.py module and associated scripts for building HUMAnN databases from de novo genomes and annotations. Essentially, a reads-based functional profiling method via HUMAnN using binned genomes as the database.
    • Added marker_gene_clustering.py script which identifies core marker proteins that are present in all genomes within a genome cluster (i.e., pangenome) and unique to only that genome cluster. Clusters in either protein or nucleotide space.
    • Added module_completion_ratios.py script which calculates KEGG module completion ratios for genomes and pangenomes. Automatically run in backend of annotate.py.
    • Updated annotate.py and merge_annotations.py to provide better annotations for clustered proteins.
    • Added merge_genome_quality.py and merge_taxonomy_classifications.py which compiles genome quality and taxonomy, respectively, for all organisms.
    • Added BGC clustering in protein and nucleotide space to biosynthetic.py. Also, produces prevalence tables that can be used for further clustering of BGCs.
    • Added pangenome_core_sequences in cluster.py writes both protein and CDS sequences for each genome cluster.
    • Added PDF visualization of newick trees in phylogeny.py.
  • VEBA Database (VDB_v5.2):

    • Added CAZy
    • Added MicrobeAnnotator-KEGG

Check out the VEBA Change Log for insight into what is being implemented in the upcoming version.

^__^


Installation and databases

Current Stable Version: v1.3.0

Current Database Version: VDB_v5.2

Please refer to the Installation and Database Configuration Guide for software installation and database configuration.

Docker containers are now available (starting with v1.1.2) for all modules via DockerHub

^__^


Getting started with VEBA

Usage and Resource Requirements Guide for parameters and module descriptions

Walkthrough Guides for tutorials and workflows on how to get started

^__^


What does VEBA do?

Please refer to the Modules for a description of all VEBA modules and their functionality.

If you wish VEBA did something that isn't implemented, please submit a [Feature Request Issue].

Schematic

^__^


Output structure

VEBA's is built on the GenoPype archituecture which creates a reproducible and easy-to-navigate directory structure. GenoPype's philosophy is to use the same names for all files but to have sample names as subdirectories. This makes it easier to glob files for grepping, concatenating, etc. NextFlow support is in the works...

Example of GenoPype's layout:

# Project directory
project_directory/

# Temporary directory
project_directory/tmp/

# Log directory
project_directory/logs/
project_directory/logs/[step]__[program-name].e
project_directory/logs/[step]__[program-name].o
project_directory/logs/[step]__[program-name].returncode

# Checkpoint directory
project_directory/checkpoints/
project_directory/checkpoints/

# Intermediate directories for each step
project_directory/intermediate/
project_directory/intermediate/[step]__[program-name]/

# Output directory
project_directory/output/

# Commands
project_directory/commands.sh

For VEBA, it has all the directories created by GenoPype above but is built for having multiple samples under the same project.

Example of VEBA's default directory layout:

ID="sample_1"

# Main output directory
veba_output/

# Assembly directory
veba_output/assembly

# Assembly output for ${ID} sample
veba_output/assembly/${ID}/output/

# Prokaryotic binning for ${ID} sample
veba_output/binning/prokaryotic/${ID}/output/ 

# Eukaryotic binning
veba_output/binning/eukaryotic/${ID}/output/

# Viral binning
veba_output/binning/viral/${ID}/output/

The above are default output locations but they can be customized.

^__^


Frequently Asked Questions

If perusing the Frequently Asked Questions doesn't address your question, feel free to submit a [Question Issue]

^__^


About

A modular end-to-end suite for in silico recovery, clustering, and analysis of prokaryotic, microeukaryotic, and viral genomes from metagenomes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.1%
  • Shell 3.7%
  • Other 0.2%