GitHub - davide-scognamiglio/MuSA: A Nextflow pipeline for deep, reproducible annotation and ranking of clinical genomic variants

Introduction

MuSA (Multi-Source variant Annotation) is an nf-core–compliant Nextflow pipeline for deep, reproducible annotation and clinical ranking of germline genomic variants. It automates the full workflow from database setup through clinical interpretation, and is designed for both routine diagnostic laboratories and research-oriented genomics settings.

MuSA orchestrates four parallel annotation branches — Ensembl VEP (with up to 22 curated plugins), a native dbNSFP distribution, the RENOVO machine-learning pathogenicity predictor, and vcf2maf — that are merged into a single richly annotated Mutation Annotation Format (MAF) file (up to ~920 columns per variant). An interactive HTML report is generated for each patient, tailored for clinical review with HPO-matched gene panels.

The pipeline supports two operating modes:

Basic mode — core clinical annotation (VEP, ANNOVAR, dbNSFP, ClinVar, gnomAD) with minimal storage footprint (~123 GB).
Extended mode — ultra-deep annotation additionally enabling 22 VEP plugins (AlphaMissense, CADD, SpliceAI, Enformer, and others) for comprehensive functional characterization (~224 GB total).

Prerequisites

Before running MuSA you need:

Nextflow ≥ 25.04.0 — curl -s https://get.nextflow.io | bash
Docker or Singularity/Apptainer — all tools run inside version-pinned containers; no manual software installation is needed.
ANNOVAR — MuSA requires ANNOVAR for RENOVO scoring. Users must independently obtain a license and download ANNOVAR from the official source. Set --annovar_software_dir to the directory containing table_annovar.pl.
Disk space — ~123 GB for basic mode, ~224 GB for extended mode (see The database manifest system for a full breakdown).

Important

All annotation data are downloaded by the setup workflow into the directory you specify with --data_dir. This same directory must be provided to the annotate workflow. It is mounted read-only at /data inside every container.

Note

dbNSFP licensing. The pipeline downloads the dbNSFP academic distribution. This resource is restricted to non-commercial use. Ensure your use case complies with the dbNSFP license terms before running the setup workflow.

Step 1 — Setup

The setup workflow downloads and configures all annotation databases and reference files. It must be run once before the first annotation. Two sub-modes are available depending on whether you plan to use VEP plugins.

Basic setup (core resources only)

Downloads VEP cache, dbNSFP, ANNOVAR databases, and the reference genome (~123 GB total):

nextflow run main.nf \
    -profile docker \
    --workflow setup \
    --data_dir /path/to/your/data_dir

Extended setup (core resources + VEP plugins)

Additionally downloads all 22 VEP plugin data files (~224 GB total):

nextflow run main.nf \
    -profile docker \
    --workflow setup \
    --download_vep_plugins true \
    --data_dir /path/to/your/data_dir

Replace -profile docker with -profile singularity if using Singularity/Apptainer.

The setup workflow produces an HTML summary report listing every downloaded resource, its version, source URL, and SHA-256 checksum (see The database manifest system for details).

Step 2 — Test run

A minimal test dataset (downsampled NA12878) is bundled with the pipeline. This confirms that your container engine, ANNOVAR installation, and data directory are correctly configured:

nextflow run main.nf \
    -profile test,docker \
    --data_dir /path/to/your/data_dir \
    --annovar_software_dir /path/to/annovar \
    --outdir output_test

The test profile sets --workflow annotate, --vcf_format multicaller, and --use_vep_plugins false. Expected runtime on a modern workstation: a few minutes.

Step 3 — Annotate your data

3a. Prepare your samplesheet

Create a CSV file with the following columns:

patient,sample_type,sample_file,hpo
PATIENT_01,blood,/path/to/PATIENT_01.vcf.gz,HP:0001250;HP:0002121
PATIENT_02,saliva,/path/to/PATIENT_02.vcf.gz,

Column	Required	Description
`patient`	✅	Unique patient identifier. Used as the prefix for all output files.
`sample_type`	✅	Tissue of origin (e.g. `blood`, `saliva`). Informational only.
`sample_file`	✅	Absolute path to the input VCF or VCF.gz file.
`hpo`	⬜	Semicolon-separated HPO codes (e.g. `HP:0001250;HP:0002121`). Used for HPO-based gene panel retrieval when `--offline false`. Leave empty if not available.

3b. Run the annotate workflow

Offline mode (default — no external API calls, uses local databases only):

nextflow run main.nf \
    -profile docker \
    --workflow annotate \
    --input samplesheet.csv \
    --outdir results \
    --data_dir /path/to/your/data_dir \
    --annovar_software_dir /path/to/annovar \
    --vcf_format multicaller \
    --use_vep_plugins false

Online mode (enables GeneBe ACMG/AMP scoring and HPO-based gene panel retrieval):

nextflow run main.nf \
    -profile docker \
    --workflow annotate \
    --input samplesheet.csv \
    --outdir results \
    --data_dir /path/to/your/data_dir \
    --annovar_software_dir /path/to/annovar \
    --vcf_format multicaller \
    --use_vep_plugins false \
    --offline false \
    --gb_user your_genebe_username \
    --gb_api_key your_genebe_api_key

A free GeneBe account can be created at genebe.net.

Extended annotation (all 22 VEP plugins — requires prior extended setup):

nextflow run main.nf \
    -profile docker \
    --workflow annotate \
    --input samplesheet.csv \
    --outdir results \
    --data_dir /path/to/your/data_dir \
    --annovar_software_dir /path/to/annovar \
    --vcf_format multicaller \
    --use_vep_plugins true \
    --n_core 16 \
    --offline false \
    --gb_user your_genebe_username \
    --gb_api_key your_genebe_api_key

Parameters reference

Core pipeline parameters

Parameter	Default	Description
`--workflow`	`annotate`	Workflow to run: `annotate` or `setup`.
`--build`	`hg38`	Reference genome build. Only `hg38` is currently supported.
`--input`	—	Path to the samplesheet CSV file. Required for `annotate`.
`--outdir`	—	Output directory. Required for `annotate`.
`--data_dir`	—	Path to the directory containing annotation databases (created by `setup`). Required for both workflows.
`--annovar_software_dir`	—	Path to the ANNOVAR software directory (`table_annovar.pl` must be present). Required for `annotate`.
`--vcf_format`	—	Input VCF format. Supported values: `multicaller` (any standard multi-sample VCF), `sarek` (applies GATK hard filtering).

Annotation control

Parameter	Default	Description
`--offline`	`true`	If `true`, all external API calls (GeneBe, HPO) are skipped. Fully local annotation.
`--use_vep_plugins`	`false`	Enable the full suite of 22 VEP plugins. Requires a prior `--download_vep_plugins true` setup.
`--n_core`	`8`	Number of CPU cores passed to VEP (`--fork`).
`--skip_bcftools`	`false`	Skip bcftools-based VCF normalization and filtering. Use only if your VCFs are already fully normalized.

Filtering parameters

Parameter	Default	Description
`--max_freq`	`null`	Maximum population allele frequency (e.g. `0.01`). Variants with `MAX_AF` above this threshold are excluded from the filtered MAF. If `null`, no frequency filter is applied.
`--drop_benign`	`false`	If `true`, variants classified as Benign or Likely Benign in ClinVar are excluded from the filtered MAF.
`--panel`	`null`	Name (without extension) of a gene panel CSV file located at `{data_dir}/panels/{panel}.csv`. The file must have gene symbols in its first column. When set, only variants in panel genes are retained in the filtered MAF. Can be combined with HPO-based filtering.

GeneBe / online mode parameters

These are required when --offline false.

Parameter	Default	Description
`--gb_user`	—	GeneBe account username.
`--gb_api_key`	—	GeneBe API key.
`--http_proxy`	`null`	HTTP proxy address, if required by your network.
`--https_proxy`	`null`	HTTPS proxy address, if required by your network.
`--skip_genebe`	`false`	Skip GeneBe annotation even when `--offline false`. Useful if the API is unavailable or rate-limited.

Setup workflow parameters

Parameter	Default	Description
`--download_vep_plugins`	`false`	Also download VEP plugin data files during setup (adds ~100 GB). Required before using `--use_vep_plugins true`.
`--dbs_manifest`	GitHub URL	URL or path to the database manifest YAML. Override to pin a specific version or use a local copy. See The database manifest system.

Miscellaneous

Parameter	Default	Description
`--center`	`null`	Sequencing centre identifier added to MAF output files.
`--email`	`null`	Email address to receive a completion notification.

The database manifest system

What it is

All annotation databases downloaded during setup are managed through a central YAML manifest file (dbs_manifest.yaml). By default, this file is fetched from the MuSA test-datasets GitHub repository. Each entry in the manifest describes a single downloadable resource:

grch38:
  vep_cache:
    dbname: "homo_sapiens_vep_cache"
    version: 115
    out: "homo_sapiens_vep_115_GRCh38.tar.gz"
    url: "https://ftp.ensembl.org/pub/release-115/variation/indexed_vep_cache/homo_sapiens_vep_115_GRCh38.tar.gz"
    method: "curl"
    expected_sha256: "5871a5e34527cce76f7ac75399c33e306bf3be8083c63fe7d3459c56338f98de"
    computed_sha256: ""

Fields:

dbname — human-readable name of the resource
version — version of the database
url — canonical download source
method — download tool (curl or wget)
expected_sha256 — cryptographic checksum of the expected file
computed_sha256 — filled in at download time by the pipeline

How it works

When a setup module runs, it:

Parses the manifest to extract the URL and method for its resource
Downloads the file using the specified tool
Computes the SHA-256 checksum of the downloaded file
Writes the computed checksum back into the manifest under computed_sha256
Emits the updated manifest as a Nextflow channel item

All per-module manifests are collected and merged by the MERGE_YAML module, producing a single final manifest that records the exact state of every downloaded resource. This merged manifest is then used to generate the HTML setup report.

Why this matters for reproducibility and auditability

The manifest system provides several guarantees that ad-hoc download scripts cannot:

Version pinning. Every resource is identified by its exact URL and version number, not just a "latest" download. The versions used in any given run are fully captured.
Integrity verification. The expected SHA-256 checksum is stored alongside the URL. Because the pipeline computes and records the actual checksum at download time, any file corruption or unexpected upstream change is immediately detectable by comparing expected_sha256 and computed_sha256.
Audit trail. The final merged manifest (and the HTML setup report derived from it) constitutes a machine-readable record of exactly which database versions, from which sources, were used to produce a given set of annotations. This is directly relevant to clinical and regulatory auditability.
Customisation. Users can supply their own manifest via --dbs_manifest to pin alternative database versions, use institutional mirrors, or add new resources — without modifying any pipeline code.

Resource footprint

Mode	Resources included	Approximate size
Basic	VEP cache, ANNOVAR databases, dbNSFP, reference genome	~123 GB
Extended (adds)	AlphaMissense, AncestralAllele, CADD, ClinPred, dbscSNV, Enformer, EVE, GWAS, MaveDB, MaxEntScan, mutfunc, PhenotypeOrthologous, pLI, ReferenceQuality, SpliceVault, UTRAnnotator	+~100 GB
Total (extended)		~224 GB

Pipeline output

Directory structure

For each patient in the samplesheet, MuSA writes results to {outdir}/{YYMMDD}/{patient}/:

results/
└── 260507/
    └── PATIENT_01/
        ├── PATIENT_01_maf_dashboard.html   # Interactive HTML report
        ├── PATIENT_01.filtered.maf          # Prioritized variants (panel/frequency/benign filters applied)
        └── PATIENT_01.raw.maf               # All annotated variants (no filters)

Pipeline-level execution reports are written to {outdir}/:

results/
├── timeline.txt      # Nextflow execution timeline
├── trace.txt         # Per-process resource usage trace
├── report.html       # Nextflow HTML execution report
└── pipeline_dag.html # Pipeline directed acyclic graph

Annotated MAF files

MuSA uses MAF (Mutation Annotation Format) as its primary output rather than VCF. This deliberate design choice enforces a one-row-per-variant representation with canonical MANE transcript selection, reducing the transcript-level redundancy that characterises VCF-based VEP output and making downstream analysis directly tractable.

The raw MAF (*.raw.maf) contains all annotated variants and up to ~920 columns organised into thematic groups:

Category	Source(s)	Example columns
Variant coordinates & consequence	vcf2maf, VEP	`Hugo_Symbol`, `Chromosome`, `Start_Position`, `Consequence`, `HGVSc`, `HGVSp_VEP`, `genome_change`, `ref_context`
Population frequencies	dbNSFP, ANNOVAR	`MAX_AF`, `MAX_AF_POPS`, gnomAD exome/genome AF, TOPMed, All of Us, RegeneronME AF
Pathogenicity predictions	dbNSFP	REVEL, MetaSVM, MetaLR, MetaRNN, BayesDel, SIFT, PolyPhen-2, AlphaMissense score/class, VARITY
VEP plugin scores	VEP (extended)	AlphaMissense, CADD_PHRED, ClinPred_score, EVE_score, SpliceVault, MaxEntScan, Enformer, mutfunc
Splicing	dbNSFP, VEP	dbscSNV Ada/RF, SpliceRegion, MaxEntScan
Clinical evidence	ANNOVAR, VEP, GeneBe	`encoded_CLNSIG`, `encoded_CLNREVSTAT`, `clinvar_trait`, `clinvar_id`, `acmg_criteria`, `acmg_score`
Disease & gene ontology	dbNSFP	OMIM, Orphanet, GenCC, ClinGen dosage sensitivity, HPO, GWAS
RENOVO	RENOVO	`PL_score` (pathogenicity likelihood ∈ [0,1] for missense VUS)
MuSA classification	Pipeline	`renovo_adj_acmg_score` (ACMG score adjusted by RENOVO for missense variants)
Variant quality	VCF INFO/FORMAT	`bioinfo_params`, `FORMAT_FIELDS`, `FORMAT_VALUES`

The filtered MAF (*.filtered.maf) is a subset of the raw MAF after applying the configured gene panel, allele frequency, and benign-variant filters. When no filters are active, the filtered MAF equals the raw MAF.

Interactive HTML report

Each patient's *_maf_dashboard.html is a self-contained report that requires no server and opens directly in any web browser.

It includes:

Summary panel — patient identifier, total variant count, and counts of ClinVar pathogenic and VUS variants
Variant browser — a sortable, filterable, paginated table with expandable rows. Main columns include gene symbol, cDNA/protein change, ClinVar class, MuSA class (ACMG score adjusted by RENOVO), population allele frequency, and PubMed links. Expandable child rows show additional fields (gDNA change, reference context, variant quality, OMIM/ClinVar IDs, ACMG criteria, orthologous phenotype data)

Note

Screenshots of the annotation report and the setup report will be added here. Place them in docs/images/ and reference them below.

Credits

MuSA was written by D. Scognamiglio and E. Bonetti at IRCCS Istituto Ortopedico Rizzoli, Bologna, Italy.

We thank the nf-core community for providing the framework and best practices that guided MuSA's development.

Citations

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

The MuSA publication can be cited as:

Pubblication pending

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
assets		assets
bin		bin
conf		conf
docs		docs
lib		lib
modules/local		modules/local
subworkflows/local		subworkflows/local
tests		tests
workflows		workflows
.gitattributes		.gitattributes
.gitignore		.gitignore
.nf-core.yml		.nf-core.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.prettierrc.yml		.prettierrc.yml
CHANGELOG.md		CHANGELOG.md
CITATIONS.md		CITATIONS.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json
nf-test.config		nf-test.config
ro-crate-metadata.json		ro-crate-metadata.json
tower.yml		tower.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Table of Contents

Prerequisites

Step 1 — Setup

Basic setup (core resources only)

Extended setup (core resources + VEP plugins)

Step 2 — Test run

Step 3 — Annotate your data

3a. Prepare your samplesheet

3b. Run the annotate workflow

Parameters reference

Core pipeline parameters

Annotation control

Filtering parameters

GeneBe / online mode parameters

Setup workflow parameters

Miscellaneous

The database manifest system

What it is

How it works

Why this matters for reproducibility and auditability

Resource footprint

Pipeline output

Directory structure

Annotated MAF files

Interactive HTML report

Credits

Citations

About

Uh oh!

Releases 1

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Introduction

Table of Contents

Prerequisites

Step 1 — Setup

Basic setup (core resources only)

Extended setup (core resources + VEP plugins)

Step 2 — Test run

Step 3 — Annotate your data

3a. Prepare your samplesheet

3b. Run the annotate workflow

Parameters reference

Core pipeline parameters

Annotation control

Filtering parameters

GeneBe / online mode parameters

Setup workflow parameters

Miscellaneous

The database manifest system

What it is

How it works

Why this matters for reproducibility and auditability

Resource footprint

Pipeline output

Directory structure

Annotated MAF files

Interactive HTML report

Credits

Citations

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors

Uh oh!

Languages