GAP

GAP = Genome Assembly Postprocessor :-)

How To Use

To run, please setup config.yaml, then run main.py. Instructions to setup config.yaml are commented in the file.

Architecture

The codebase is split into three main functions:

Pre-processing: Processes the GFA, PAF and Error-Corrected Reads FASTA into the relevant files needed for the other steps. If it is a Hifiasm run, the GFA file should be the primary contigs (.p_ctg.gfa) file. If it is a GNNome run, the GFA file should be the raw (.bp.raw.r_utg.gfa) file.
Baseline Generation: Generates the baseline assembly as well as the walks, which are the graph nodes corresponding to the assembly's contigs.
Postprocessing: Runs the postprocessing step as outlined in <paper name here>. This postprocessing step can be run as a standalone if you wish to generate your own input files. Please see the File Formats section for relevant info.

GAP Repo Architecture.

Input/Output Files

Function	Input Files	Output Files
Preprocessing	- GFA file - PAF file - EC Reads FASTA	- Graph - Node-to-sequence (n2s) - Read-to-node (r2n) - Read-to-sequence (r2s) - Processed PAF
Baseline Generation	For GNNome - Graph - Node-to-sequence (n2s) - Genome Reference For Hifiasm - GFA file - Read-to-node (r2n) - Read-to-sequence (r2s) - Genome Reference	- Baseline walks & assembly
Postprocessing	- Contigs - Graph - Node-to-sequence (n2s) - Read-to-node (r2n) - Read-to-sequence (r2s) - Processed PAF - Genome Reference	- Final assembly

File Formats

Graph: DGL Graph object. It is important that for each node with node_id, the node with its reverse complement is node_id+1. Additionally, the graph has:
- Node features: ['N_ID', 'read_length']
- Edge features: ['E_ID', 'prefix_length', 'overlap_length', 'overlap_similarity']

n2s: {
    node_id (int) : sequence (str)
}

r2n : {
    read_id (str) : (real_node_id, virtual_node_id) (tuple<int, int>)
}

r2s : {
    read_id (str) : (sequence, reverse comp) (tuple<str, str>)
}

Processed PAF: {
    ghost_edges = {
        'valid_src' : [node_id_1, ...] source nodes (list<int>),
        'valid_dst' : [node_id_2, ...] destination nodes (list<int>),
        'ol_len' : [overlap_length_1, ...] respective overlap lengths (list<int>),
        'ol_similarity' : [overlap_similarity_1, ...] respective overlap similarities (list<int>),
        'prefix_len' : [prefix_length_1, ...] respective prefix lengths (list<int>),
        'edge_hops' : [hop_neighbourhood_1, ...] respective edge hops (list<int>)
    },
    ghost_nodes = {
        'hop_<n>' {
            '+' : {
                read_id : {
                    'read_len' : Read length for this read
                    'outs' : [read_id, ...]
                    'ol_len_outs' : [ol_len, ...],
                    'ol_similarity_outs' : [ol_similarity, ...],
                    'prefix_len_outs' : [prefix_len, ...],
                    'ins' : [read_id, ...],
                    'ol_len_ins' : [ol_len, ...],
                    'ol_similarity_ins' : [ol_similarity, ...],
                    'prefix_len_ins' : [prefix_len, ...],
                }, 
                read_id_2 : { ... },
                ...
            },
            '-' : { ... }
        },
        'hop_<n+1>' : { ... }
    }
}

Directory

The codebase is split into the three main functions, each with their respective directory.

- main.py  		      Main script to run.
- config.yaml                 Configs to be set. Ensure that the genome you are running has its info in the specified format.
- preprocess/             
    - preprocess.py           Main script to run the various pre-processing steps.
    - gfa_util.py             Script to pre-process GFA file.
    - fasta_util.py           Script to pre-process error-corrected reads FASTA file.
    - paf_util.py             Script to pre-process PAF file.
- generate_baseline/
    - generate_baseline.py    Main script to generate the baseline walks and assembly.
    - hifiasm_decoding.py     Generates baseline from Hifiasm's final GFA.
    - gnnome_decoding.py      Generates baseline from GNNome's graph. Basic version of GNNome's decoding step.
    - SymGatedGCN.py          SymGatedGCN layer from GNNome.
- postprocess/
    - postprocess.py          Script for postprocessing pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
generate_baseline		generate_baseline
misc		misc
postprocess		postprocess
preprocess		preprocess
README.md		README.md
__init__.py		__init__.py
config.yaml		config.yaml
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GAP

How To Use

Architecture

Input/Output Files

File Formats

Directory

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GAP

How To Use

Architecture

Input/Output Files

File Formats

Directory

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages