Skip to content

jetrz/GAP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

132 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GAP

GAP = Genome Assembly Postprocessor :-)

How To Use

To run, please setup config.yaml, then run main.py. Instructions to setup config.yaml are commented in the file.

Architecture

The codebase is split into three main functions:

  1. Pre-processing: Processes the GFA, PAF and Error-Corrected Reads FASTA into the relevant files needed for the other steps. If it is a Hifiasm run, the GFA file should be the primary contigs (.p_ctg.gfa) file. If it is a GNNome run, the GFA file should be the raw (.bp.raw.r_utg.gfa) file.
  2. Baseline Generation: Generates the baseline assembly as well as the walks, which are the graph nodes corresponding to the assembly's contigs.
  3. Postprocessing: Runs the postprocessing step as outlined in <paper name here>. This postprocessing step can be run as a standalone if you wish to generate your own input files. Please see the File Formats section for relevant info.

GAP Repo Architecture GAP Repo Architecture.

Input/Output Files

Function Input Files Output Files
Preprocessing - GFA file
- PAF file
- EC Reads FASTA
- Graph
- Node-to-sequence (n2s)
- Read-to-node (r2n)
- Read-to-sequence (r2s)
- Processed PAF
Baseline Generation For GNNome
- Graph
- Node-to-sequence (n2s)
- Genome Reference

For Hifiasm
- GFA file
- Read-to-node (r2n)
- Read-to-sequence (r2s)
- Genome Reference
- Baseline walks & assembly
Postprocessing - Contigs
- Graph
- Node-to-sequence (n2s)
- Read-to-node (r2n)
- Read-to-sequence (r2s)
- Processed PAF
- Genome Reference
- Final assembly

File Formats

Graph: DGL Graph object. It is important that for each node with node_id, the node with its reverse complement is node_id+1. Additionally, the graph has:
- Node features: ['N_ID', 'read_length']
- Edge features: ['E_ID', 'prefix_length', 'overlap_length', 'overlap_similarity']

n2s: {
    node_id (int) : sequence (str)
}

r2n : {
    read_id (str) : (real_node_id, virtual_node_id) (tuple<int, int>)
}

r2s : {
    read_id (str) : (sequence, reverse comp) (tuple<str, str>)
}

Processed PAF: {
    ghost_edges = {
        'valid_src' : [node_id_1, ...] source nodes (list<int>),
        'valid_dst' : [node_id_2, ...] destination nodes (list<int>),
        'ol_len' : [overlap_length_1, ...] respective overlap lengths (list<int>),
        'ol_similarity' : [overlap_similarity_1, ...] respective overlap similarities (list<int>),
        'prefix_len' : [prefix_length_1, ...] respective prefix lengths (list<int>),
        'edge_hops' : [hop_neighbourhood_1, ...] respective edge hops (list<int>)
    },
    ghost_nodes = {
        'hop_<n>' {
            '+' : {
                read_id : {
                    'read_len' : Read length for this read
                    'outs' : [read_id, ...]
                    'ol_len_outs' : [ol_len, ...],
                    'ol_similarity_outs' : [ol_similarity, ...],
                    'prefix_len_outs' : [prefix_len, ...],
                    'ins' : [read_id, ...],
                    'ol_len_ins' : [ol_len, ...],
                    'ol_similarity_ins' : [ol_similarity, ...],
                    'prefix_len_ins' : [prefix_len, ...],
                }, 
                read_id_2 : { ... },
                ...
            },
            '-' : { ... }
        },
        'hop_<n+1>' : { ... }
    }
}

Directory

The codebase is split into the three main functions, each with their respective directory.

- main.py  		      Main script to run.
- config.yaml                 Configs to be set. Ensure that the genome you are running has its info in the specified format.
- preprocess/             
    - preprocess.py           Main script to run the various pre-processing steps.
    - gfa_util.py             Script to pre-process GFA file.
    - fasta_util.py           Script to pre-process error-corrected reads FASTA file.
    - paf_util.py             Script to pre-process PAF file.
- generate_baseline/
    - generate_baseline.py    Main script to generate the baseline walks and assembly.
    - hifiasm_decoding.py     Generates baseline from Hifiasm's final GFA.
    - gnnome_decoding.py      Generates baseline from GNNome's graph. Basic version of GNNome's decoding step.
    - SymGatedGCN.py          SymGatedGCN layer from GNNome.
- postprocess/
    - postprocess.py          Script for postprocessing pipeline.

About

GAP = Genome Assembly Postprocessor :-)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages