MuSA (Multi-Source variant Annotation) is an nf-core–compliant Nextflow pipeline for deep, reproducible annotation and clinical ranking of germline genomic variants. It automates the full workflow from database setup through clinical interpretation, and is designed for both routine diagnostic laboratories and research-oriented genomics settings.
MuSA orchestrates four parallel annotation branches — Ensembl VEP (with up to 22 curated plugins), a native dbNSFP distribution, the RENOVO machine-learning pathogenicity predictor, and vcf2maf — that are merged into a single richly annotated Mutation Annotation Format (MAF) file (up to ~920 columns per variant). An interactive HTML report is generated for each patient, tailored for clinical review with HPO-matched gene panels.
The pipeline supports two operating modes:
- Basic mode — core clinical annotation (VEP, ANNOVAR, dbNSFP, ClinVar, gnomAD) with minimal storage footprint (~123 GB).
- Extended mode — ultra-deep annotation additionally enabling 22 VEP plugins (AlphaMissense, CADD, SpliceAI, Enformer, and others) for comprehensive functional characterization (~224 GB total).
- Prerequisites
- Step 1 — Setup
- Step 2 — Test run
- Step 3 — Annotate your data
- Parameters reference
- The database manifest system
- Pipeline output
- Credits
- Citations
Before running MuSA you need:
- Nextflow ≥ 25.04.0 —
curl -s https://get.nextflow.io | bash - Docker or Singularity/Apptainer — all tools run inside version-pinned containers; no manual software installation is needed.
- ANNOVAR — MuSA requires ANNOVAR for RENOVO scoring. Users must independently obtain a license and download ANNOVAR from the official source. Set
--annovar_software_dirto the directory containingtable_annovar.pl. - Disk space — ~123 GB for basic mode, ~224 GB for extended mode (see The database manifest system for a full breakdown).
Important
All annotation data are downloaded by the setup workflow into the directory you specify with --data_dir. This same directory must be provided to the annotate workflow. It is mounted read-only at /data inside every container.
Note
dbNSFP licensing. The pipeline downloads the dbNSFP academic distribution. This resource is restricted to non-commercial use. Ensure your use case complies with the dbNSFP license terms before running the setup workflow.
The setup workflow downloads and configures all annotation databases and reference files. It must be run once before the first annotation. Two sub-modes are available depending on whether you plan to use VEP plugins.
Downloads VEP cache, dbNSFP, ANNOVAR databases, and the reference genome (~123 GB total):
nextflow run main.nf \
-profile docker \
--workflow setup \
--data_dir /path/to/your/data_dirAdditionally downloads all 22 VEP plugin data files (~224 GB total):
nextflow run main.nf \
-profile docker \
--workflow setup \
--download_vep_plugins true \
--data_dir /path/to/your/data_dirReplace -profile docker with -profile singularity if using Singularity/Apptainer.
The setup workflow produces an HTML summary report listing every downloaded resource, its version, source URL, and SHA-256 checksum (see The database manifest system for details).
A minimal test dataset (downsampled NA12878) is bundled with the pipeline. This confirms that your container engine, ANNOVAR installation, and data directory are correctly configured:
nextflow run main.nf \
-profile test,docker \
--data_dir /path/to/your/data_dir \
--annovar_software_dir /path/to/annovar \
--outdir output_testThe test profile sets --workflow annotate, --vcf_format multicaller, and --use_vep_plugins false. Expected runtime on a modern workstation: a few minutes.
Create a CSV file with the following columns:
patient,sample_type,sample_file,hpo
PATIENT_01,blood,/path/to/PATIENT_01.vcf.gz,HP:0001250;HP:0002121
PATIENT_02,saliva,/path/to/PATIENT_02.vcf.gz,| Column | Required | Description |
|---|---|---|
patient |
✅ | Unique patient identifier. Used as the prefix for all output files. |
sample_type |
✅ | Tissue of origin (e.g. blood, saliva). Informational only. |
sample_file |
✅ | Absolute path to the input VCF or VCF.gz file. |
hpo |
⬜ | Semicolon-separated HPO codes (e.g. HP:0001250;HP:0002121). Used for HPO-based gene panel retrieval when --offline false. Leave empty if not available. |
Offline mode (default — no external API calls, uses local databases only):
nextflow run main.nf \
-profile docker \
--workflow annotate \
--input samplesheet.csv \
--outdir results \
--data_dir /path/to/your/data_dir \
--annovar_software_dir /path/to/annovar \
--vcf_format multicaller \
--use_vep_plugins falseOnline mode (enables GeneBe ACMG/AMP scoring and HPO-based gene panel retrieval):
nextflow run main.nf \
-profile docker \
--workflow annotate \
--input samplesheet.csv \
--outdir results \
--data_dir /path/to/your/data_dir \
--annovar_software_dir /path/to/annovar \
--vcf_format multicaller \
--use_vep_plugins false \
--offline false \
--gb_user your_genebe_username \
--gb_api_key your_genebe_api_keyA free GeneBe account can be created at genebe.net.
Extended annotation (all 22 VEP plugins — requires prior extended setup):
nextflow run main.nf \
-profile docker \
--workflow annotate \
--input samplesheet.csv \
--outdir results \
--data_dir /path/to/your/data_dir \
--annovar_software_dir /path/to/annovar \
--vcf_format multicaller \
--use_vep_plugins true \
--n_core 16 \
--offline false \
--gb_user your_genebe_username \
--gb_api_key your_genebe_api_key| Parameter | Default | Description |
|---|---|---|
--workflow |
annotate |
Workflow to run: annotate or setup. |
--build |
hg38 |
Reference genome build. Only hg38 is currently supported. |
--input |
— | Path to the samplesheet CSV file. Required for annotate. |
--outdir |
— | Output directory. Required for annotate. |
--data_dir |
— | Path to the directory containing annotation databases (created by setup). Required for both workflows. |
--annovar_software_dir |
— | Path to the ANNOVAR software directory (table_annovar.pl must be present). Required for annotate. |
--vcf_format |
— | Input VCF format. Supported values: multicaller (any standard multi-sample VCF), sarek (applies GATK hard filtering). |
| Parameter | Default | Description |
|---|---|---|
--offline |
true |
If true, all external API calls (GeneBe, HPO) are skipped. Fully local annotation. |
--use_vep_plugins |
false |
Enable the full suite of 22 VEP plugins. Requires a prior --download_vep_plugins true setup. |
--n_core |
8 |
Number of CPU cores passed to VEP (--fork). |
--skip_bcftools |
false |
Skip bcftools-based VCF normalization and filtering. Use only if your VCFs are already fully normalized. |
| Parameter | Default | Description |
|---|---|---|
--max_freq |
null |
Maximum population allele frequency (e.g. 0.01). Variants with MAX_AF above this threshold are excluded from the filtered MAF. If null, no frequency filter is applied. |
--drop_benign |
false |
If true, variants classified as Benign or Likely Benign in ClinVar are excluded from the filtered MAF. |
--panel |
null |
Name (without extension) of a gene panel CSV file located at {data_dir}/panels/{panel}.csv. The file must have gene symbols in its first column. When set, only variants in panel genes are retained in the filtered MAF. Can be combined with HPO-based filtering. |
These are required when --offline false.
| Parameter | Default | Description |
|---|---|---|
--gb_user |
— | GeneBe account username. |
--gb_api_key |
— | GeneBe API key. |
--http_proxy |
null |
HTTP proxy address, if required by your network. |
--https_proxy |
null |
HTTPS proxy address, if required by your network. |
--skip_genebe |
false |
Skip GeneBe annotation even when --offline false. Useful if the API is unavailable or rate-limited. |
| Parameter | Default | Description |
|---|---|---|
--download_vep_plugins |
false |
Also download VEP plugin data files during setup (adds ~100 GB). Required before using --use_vep_plugins true. |
--dbs_manifest |
GitHub URL | URL or path to the database manifest YAML. Override to pin a specific version or use a local copy. See The database manifest system. |
| Parameter | Default | Description |
|---|---|---|
--center |
null |
Sequencing centre identifier added to MAF output files. |
--email |
null |
Email address to receive a completion notification. |
All annotation databases downloaded during setup are managed through a central YAML manifest file (dbs_manifest.yaml). By default, this file is fetched from the MuSA test-datasets GitHub repository. Each entry in the manifest describes a single downloadable resource:
grch38:
vep_cache:
dbname: "homo_sapiens_vep_cache"
version: 115
out: "homo_sapiens_vep_115_GRCh38.tar.gz"
url: "https://ftp.ensembl.org/pub/release-115/variation/indexed_vep_cache/homo_sapiens_vep_115_GRCh38.tar.gz"
method: "curl"
expected_sha256: "5871a5e34527cce76f7ac75399c33e306bf3be8083c63fe7d3459c56338f98de"
computed_sha256: ""Fields:
dbname— human-readable name of the resourceversion— version of the databaseurl— canonical download sourcemethod— download tool (curlorwget)expected_sha256— cryptographic checksum of the expected filecomputed_sha256— filled in at download time by the pipeline
When a setup module runs, it:
- Parses the manifest to extract the URL and method for its resource
- Downloads the file using the specified tool
- Computes the SHA-256 checksum of the downloaded file
- Writes the computed checksum back into the manifest under
computed_sha256 - Emits the updated manifest as a Nextflow channel item
All per-module manifests are collected and merged by the MERGE_YAML module, producing a single final manifest that records the exact state of every downloaded resource. This merged manifest is then used to generate the HTML setup report.
The manifest system provides several guarantees that ad-hoc download scripts cannot:
- Version pinning. Every resource is identified by its exact URL and version number, not just a "latest" download. The versions used in any given run are fully captured.
- Integrity verification. The expected SHA-256 checksum is stored alongside the URL. Because the pipeline computes and records the actual checksum at download time, any file corruption or unexpected upstream change is immediately detectable by comparing
expected_sha256andcomputed_sha256. - Audit trail. The final merged manifest (and the HTML setup report derived from it) constitutes a machine-readable record of exactly which database versions, from which sources, were used to produce a given set of annotations. This is directly relevant to clinical and regulatory auditability.
- Customisation. Users can supply their own manifest via
--dbs_manifestto pin alternative database versions, use institutional mirrors, or add new resources — without modifying any pipeline code.
| Mode | Resources included | Approximate size |
|---|---|---|
| Basic | VEP cache, ANNOVAR databases, dbNSFP, reference genome | ~123 GB |
| Extended (adds) | AlphaMissense, AncestralAllele, CADD, ClinPred, dbscSNV, Enformer, EVE, GWAS, MaveDB, MaxEntScan, mutfunc, PhenotypeOrthologous, pLI, ReferenceQuality, SpliceVault, UTRAnnotator | +~100 GB |
| Total (extended) | ~224 GB |
For each patient in the samplesheet, MuSA writes results to {outdir}/{YYMMDD}/{patient}/:
results/
└── 260507/
└── PATIENT_01/
├── PATIENT_01_maf_dashboard.html # Interactive HTML report
├── PATIENT_01.filtered.maf # Prioritized variants (panel/frequency/benign filters applied)
└── PATIENT_01.raw.maf # All annotated variants (no filters)
Pipeline-level execution reports are written to {outdir}/:
results/
├── timeline.txt # Nextflow execution timeline
├── trace.txt # Per-process resource usage trace
├── report.html # Nextflow HTML execution report
└── pipeline_dag.html # Pipeline directed acyclic graph
MuSA uses MAF (Mutation Annotation Format) as its primary output rather than VCF. This deliberate design choice enforces a one-row-per-variant representation with canonical MANE transcript selection, reducing the transcript-level redundancy that characterises VCF-based VEP output and making downstream analysis directly tractable.
The raw MAF (*.raw.maf) contains all annotated variants and up to ~920 columns organised into thematic groups:
| Category | Source(s) | Example columns |
|---|---|---|
| Variant coordinates & consequence | vcf2maf, VEP | Hugo_Symbol, Chromosome, Start_Position, Consequence, HGVSc, HGVSp_VEP, genome_change, ref_context |
| Population frequencies | dbNSFP, ANNOVAR | MAX_AF, MAX_AF_POPS, gnomAD exome/genome AF, TOPMed, All of Us, RegeneronME AF |
| Pathogenicity predictions | dbNSFP | REVEL, MetaSVM, MetaLR, MetaRNN, BayesDel, SIFT, PolyPhen-2, AlphaMissense score/class, VARITY |
| VEP plugin scores | VEP (extended) | AlphaMissense, CADD_PHRED, ClinPred_score, EVE_score, SpliceVault, MaxEntScan, Enformer, mutfunc |
| Splicing | dbNSFP, VEP | dbscSNV Ada/RF, SpliceRegion, MaxEntScan |
| Clinical evidence | ANNOVAR, VEP, GeneBe | encoded_CLNSIG, encoded_CLNREVSTAT, clinvar_trait, clinvar_id, acmg_criteria, acmg_score |
| Disease & gene ontology | dbNSFP | OMIM, Orphanet, GenCC, ClinGen dosage sensitivity, HPO, GWAS |
| RENOVO | RENOVO | PL_score (pathogenicity likelihood ∈ [0,1] for missense VUS) |
| MuSA classification | Pipeline | renovo_adj_acmg_score (ACMG score adjusted by RENOVO for missense variants) |
| Variant quality | VCF INFO/FORMAT | bioinfo_params, FORMAT_FIELDS, FORMAT_VALUES |
The filtered MAF (*.filtered.maf) is a subset of the raw MAF after applying the configured gene panel, allele frequency, and benign-variant filters. When no filters are active, the filtered MAF equals the raw MAF.
Each patient's *_maf_dashboard.html is a self-contained report that requires no server and opens directly in any web browser.
It includes:
- Summary panel — patient identifier, total variant count, and counts of ClinVar pathogenic and VUS variants
- Variant browser — a sortable, filterable, paginated table with expandable rows. Main columns include gene symbol, cDNA/protein change, ClinVar class, MuSA class (ACMG score adjusted by RENOVO), population allele frequency, and PubMed links. Expandable child rows show additional fields (gDNA change, reference context, variant quality, OMIM/ClinVar IDs, ACMG criteria, orthologous phenotype data)
Note
Screenshots of the annotation report and the setup report will be added here. Place them in docs/images/ and reference them below.
MuSA was written by D. Scognamiglio and E. Bonetti at IRCCS Istituto Ortopedico Rizzoli, Bologna, Italy.
We thank the nf-core community for providing the framework and best practices that guided MuSA's development.
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
The MuSA publication can be cited as:
Pubblication pending