In a genome assembly, a contig represents one assembly among multiple possible assemblies of shorter contiguous sequences called unitigs. Information about the possible networks of assemblies across unitigs are generally contained in graphical fragment assembly files (GFAs).
gfa_parser is a tool to compute and extract all possible genome assemblies from GFA files using directed, acyclic networks. It is compatible with GFAs produced by hifiasm, Verkko, Shasta, and minigraph.
The script gw_paths.py allows users to measure the total number of directed, acyclic paths of unitigs present in whole-genome GFA files (a measure of the total number of contigs that can be extracted from a GFA). It also normalizes this value by the number of unitigs in a GFA, providing a measurement of assembly uncertainty.
Samuel N. Bogan | University of California, Santa Cruz
Owen W. Moosman | University of California, Santa Cruz
Joanna L. Kelley | University of California, Santa Cruz
Given a GFA file and set(s) of unitigs associated with a region or contig, gfa_parser identifies all possible contiguous paths through the flanking unitigs of each haplotype that are directed and acyclic using the networkx package. It also creates an index of unitig names and their associated sequences. The directed, acyclic paths ensure that (i) no assembly path loops through a unitig multiple times and (ii) that alternate unitigs within the same bubble are not included in the same path. The script then extracts the fasta sequence of each unitig and, for a given path, concatenates the fasta sequences in the order of the assembly path. This process is repeated for all possible assemblies of each haplotype. The script exports a directory of all possible contigs for both haplotypes in FASTA format. The FASTA header contains the length-weighted read depth of all unitigs included in the path and FASTA. A filtering parameter allows you to ignore unitigs and remove them from paths if they fall below a given read depth threshold.
The tool has two modes: unphased mode and phased mode.
Computes and exports all directed acyclic paths through a gfa as fasta files, not separating based on haplotype.
Computes and exports all directed acyclic paths through a gfa as fasta files for haplotypes 1 and 2 in a diploid genome.
This script counts the total number of directed, acyclic paths through unitigs in through a GFA file and normalizes this value by the number of unitigs in the GFA. This provides a genome-wide estimate of assembly uncertainty. Paths are enumerated on a log10 scale.
python unphased_mode.py \
--gfa/-g my.gfa \ # GFA file
--start/-s start_unitig \ # ID of 5’ unitig
--end/-e end_unitig \ # ID of 3’ unitig
--filter_rd/-r 0 \ # Skip unitigs below this read depth in paths
--package/-p \ # Enter one of "hifiasm", "verkko", "shasta", or "minigraph"
--out/-o output_prefix # Output
python phased_mode.py \
--gfa/-g my.gfa \ # GFA file
--hap1_start/-h1s start_unitig \ # ID of hap 1 5’ unitig
--hap1_end/-h1e end_unitig \ # ID of hap 1 3’ unitig
--hap1_unitigs/-h1u \ # List of hap 1 unitig IDs to consider
--hap2_start/-h2s start_unitig \ # ID of hap 1 5’ unitig
--hap2_end/-h2e end_unitig \ # ID of hap 2 3’ unitig
--hap2_unitigs/-h2u \ # List of hap 2 unitig IDs to consider
--filter_rd/-r 0 \ # Skip unitigs below this read depth in paths
--package/-p \ # Enter one of "hifiasm", "verkko", "shasta", or "minigraph"
--out/-o output_prefix # Output
python gw_paths.py -g my.gfa > output.txt
A manuscript reporting gfa_parser is in revision. For now, please cite this Github page and the listed authors if you use gfa_parser.