A tool to analyze haplotype-specific chromosome-scale somatic copy number aberrations and aneuploidy using long reads (Oxford Nanopore, PacBio). Wakhan takes long-read alignment and phased heterozygous variants as input, and first uses extends the phased blocks, taking advantage of the CNA differences between the haplotypes. Wakhan then generates inetractive haplotype-specific coverage plots.
git clone https://github.com/KolmogorovLab/Wakhan.git
cd Wakhan/
conda env create -f environment.yml -n Wakhan
conda activate Wakhan
cd src/
python main.py --threads <4> --reference <ref.fa> --target-bam <data.tumor.bam> --normal-phased-vcf <data.normal_phased.vcf.gz> --genome-name <cellline/dataset name> --out-dir-plots <genome_abc_output> --breakpoints <severus-sv-VCF>
python main.py --threads <4> --reference <ref.fa> --target-bam <data.tumor_haplotagged.bam> --tumor-vcf <data.tumor_phased.vcf.gz> --genome-name <cellline/dataset name> --out-dir-plots <genome_abc_output> --breakpoints <severus-sv-VCF>
Wakhan accepts Severus or any other structural variant caller VCF as breakpoints with param --breakpoints
inputs to detect copy number changes and this option is highly recommended.
However, if --breakpoints
option is not used, --change-point-detection-for-cna
should be used instead to use change point detection algorithm ruptures alternatively.
User can input both --ploidy-range
[default: 1.5-5.5 -> [min-max]] and --purity-range
[default: 0.5-1.0 -> [min-max] to inform copy number model about normal contamination in tumor to estimate copy number states correctly.
By default, Wakhan uses COSMIC cancer genes to display corresponding copy number states in <genome_name>_<ploidy>_<purity>_<confidence>_genes_genome.html
file.
User can also input path through param --user-input-genes
to custom input genes/subset of genes [examples in src\annotations\user_input_genes_example_.bed] bed file these genes will be used in plots instead of default COSMIC cancer genes.
grch38 reference genes will be use as default, user can input alternate (i.e, chm13) --reference-name
to change to T2T-CHM13 instead.
Wakhan produces reads coverage coverage.csv
(bin-size based reads coverage) and phasesets reads coverage coverage_ps.csv
data, phase-corrected coverage phase_corrected_coverage.csv
(as well as tumor BAM pileup pileup_SNPs.csv
in case Tumor/normal mode) and stores in directory coverage_data
inside --out-dir-plots
location.
If this data has already been generated in a previous Wakhan run, user can rerun the Wakhan with additionally passing --quick-start
and --quick-start-coverage-path <path to coverage_data directory -> e.g., /home/rezkuh/data/1437/coverage_data>
in addition to required params in above example runs.
This will save runtime significantly by not invoking coverage and pileup methods.
Few cell lines arbitrary phase-switch correction and copy number estimation output with coverage profile is included in the examples directory.
-
--reference
Reference file path -
--target-bam
path to target bam files (must be indexed) -
--out-dir-plots
path to output coverage plots -
--genome-name
genome cellline/sample name to be displayed on plots -
--normal-phased-vcf
normal phased VCF file to generate het SNPs frequncies pileup for tumor BAM (if tumor-only mode, use phased--tumor-vcf
instead) -
--tumor-vcf
phased VCF is required in tumor-only mode
-
--breakpoints
For segmentation to use in CN estimation, structural variations/breakpoints VCF file is required -
--breakpoints-min-length
To adjust breakpoints min length to be included in copy number analysis [default: 10000] -
--cpd-internal-segments
For change point detection algo on internal segments after breakpoint/cpd algo for more precise segmentation. -
--copynumbers-subclonal-enable
Enabling subclonal/fractional copy number states in plots -
--loh-enable
Enabling LOH regions display in CN plots -
--phaseblock-flipping-disable
disabling phaseblock flipping if traget tumor BAM doesn't need phase-correction (default: enabled) -
--phaseblocks-enable
enabling phaseblocks display in coverage plots -
--contigs
List of contigs (chromosomes, default: chr1-22) to be included in the plots [e.g., chr1-22,X,Y] -
--cut-threshold
Maximum cut threshold for coverage (readdepth) plots [default: 100] -
--centromere
Path to centromere annotations BED file [default: annotations/grch38.cen_coord.curated.bed] -
--cancer-genes
Path to Cancer Genes in TSV format to display in CNA plots [default: annotations/CancerGenes.tsv] -
--pdf-enable
Enabling PDF output for plots
Wakhan can also be used in case phasing is not good in input tumor or analysis is being performed without considering phasing:
--without-phasing
enable it if CNA analysis is being performed without phasing in conjunction with--phaseblock-flipping-disable
with all other required parameters as mentioned in example command
Here is a sample copy number/breakpoints output plot without phasing.
Based on best confidence scores, tumor purity and ploidy values are calculated and copy number analysis is performed.
Each subfolder in output directory represents best <ploidy
><purity
><confidence
> values.
<genome-name>_genome_copynumber.html
Genome-wide copy number plots with coverage information on same axis<genome-name>_copynumber_breakpoints.html
Genome-wide copy number plots with coverage information on opposite axis, additionally breakpoints and genes annotations<genome-name>_copynumber_breakpoints_subclonal.html
Genome-wide subclonal/fractional copy number plots with coverage information on opposite axis, additionally breakpoints and genes annotations (--copynumbers-subclonal-enable
)bed_output
It contains copy numbers segments in bed formatvariation_plots
Copy number chromosomes-scale plots with segmentation, coverage and LOH
Following are coverage and SNPs/LOH plots and bed directories in output folder, independent of CNA analysis
snps_loh_plots
SNPs and SNPs ratios plots with LOH representation in chromosomes-scale and genome-wide<genome-name>_genome_loh.html
Genome-wide LOH plotbed_output
It contains LOH segments in bed formatcoverage_plots
Haplotype specific coverage plots for chromosomes with option for unphased coveragephasing_output
Phase-switch error correction plots and phase corrected VCF file (*rephased.vcf.gz)
This tool requires haplotagged tumor BAM and phased VCF in case tumor-only mode and normal phased VCF in case tumor-normal mode. This can be done through any phasing tools like Margin, Whatshap and Longphase. Following commands could be helpful for phasing VCFs and haplotagging BAMs.
# ClairS phase and haplotag both normal and tumor samples
singularity run clairs_latest.sif /opt/bin/run_clairs --threads 56 --phase_tumor True --use_whatshap_for_final_output_haplotagging --use_whatshap_for_final_output_phasing --tumor_bam_fn normal.bam --normal_bam_fn tumor.bam --ref ref.fasta --output_dir clairS --platform ont_r10
or
# Phase normal sample
pepper_margin_deepvariant call_variant -b normal.bam -f ref.fasta -o pepper/output -t 56 --ont_r9_guppy5_sup -p pepper --phased_output
# Haplotag tumor sample with normal phased VCF (phased.vcf.gz) output from previous step
whatshap haplotag --ignore-read-groups phased.vcf.gz tumor.bam --reference ref.fasta -o tumor_whatshap_haplotagged.bam
# Phase and haplotag tumor sample
singularity run clair3_latest.sif /opt/bin/run_clair3.sh --use_whatshap_for_final_output_haplotagging --use_whatshap_for_final_output_phasing --bam_fn=tumor.bam --ref_fn=ref.fasta --threads=56 --platform=ont --model_path=r941_prom_sup_g5014 --output=clair3 --enable_phasing
or
# Phase and haplotag tumor sample
pepper_margin_deepvariant call_variant -b tumor.bam -f ref.fasta -o pepper/output -t 56 --ont_r9_guppy5_sup -p pepper --phased_output