Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add abromics bacterial hybrid assembly and annotation workflow #324

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
version: 1.2
workflows:
- name: main
subclass: Galaxy
publish: true
primaryDescriptorPath: /abromics_genomic_hybrid_long_reads.ga
authors:
- name: Pierre
alternateName: pimarin
email: mailto:[email protected]
familyName: MARIN
orcid: 0000-0002-8304-138X
- name: abromics-consortium
email: mailto:[email protected]
url: https://www.abromics.fr/
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# Changelog

## [0.1] 2024-01-24

### Automatic update
- First release of bacterial genomic workflow from the ABRomics French project (www.abromics.fr)[https://www.abromics.fr/]
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Bacterial genome assembly and annotation to hybrid long reads assembly

This workflow use paired-end illumina fastq(.gz) files and long-read (tested on nanopore) input (fastq.gz) and make some analysis to assembly bacterial genomes and annotate genomes to extract antimicrobial informations.

The main steps are:
1. Quality control and trimming
2. Taxonomic assignation on trimmed data
3. Assembly and polishing raw reads to a final contig fasta file
4. Quality control of the assembly
5. Genomic annotation:
- Genomic annotation
- Integron identification
- Plasmid gene identification
- Inserted Element (IS) detection
- Antimicrobial resistance gene identification
- Virulence gene identification

## Tools used
1. Quality control and trimming
- fastp QC control and trimming for short reads
- FastQC and filtlong for QC and trimming of long reads
2. Taxonomic assignation on long read data
- kraken2 assignation
- bracken to restimate abundance to the species level
- recentrifuge to make a krona chart
3. Assembly raw reads to a final contig fasta file
- flye and polipolish to use the illumina reads
4. Quality control of the assembly
- Quast
- Bandage to plot assembly graph
- Refseqmasher to identify the closed reference genome
5. Genomic annotation:
- Genomic annotation: Bakta
- Integron identification: IntegronFinder2
- Plasmid gene identification: Plasmidfinder
- Inserted Element (IS) detection: ISEScan
- Antimicrobial resistance gene identification: staramr to blast against resfinder and plasmidfinder database
- Virulence gene identification: abricate with VFDB_A database

The multiQC tool is used to aggregate all QC result into only 1 report.



## inputs

1. Paired-end illumina rax reads in fastq(.gz) format.
2. Single long read in fastq(.gz) format.

## Outputs
1. Quality control:
- quality report
- trimmed raw reads
2. Taxonomic assignation:
- Tabular report of identified species
- Tabular file with assigned read to a taxonomic level
- Krona chart to illustrate species diversity of the sample
3. Assembly:
- polished assembly with contigs in fasta
- Mapped read on assembly in bam format
- Graph assembly in gfa format
4. Quality of Assembly:
- Assembly report
- Assembly Graph
- Tabular result of closed reference genomes
5. Genomic annotation:
- Genomic annotation:
- genome annotation in tabular, gff and several other formats
- annotation plot
- nucleotide and protein sequences identified
- summary of genomic identified elements
- Integron identification:
- integron identification in tabular format and a summary
- Plasmid gene identification :
- plasmid gene identified and associated blast hits
- Inserted Element (IS) detection :
- IS element list in tabular format
- is hits in fasta format
- ORF hits in protein and nucleotide fasta format
- IS annotation gff format
- Antimicrobial resistance gene identification
- AMR gene list
- MLST typing
- Plasmid gene identification
- Blast hits
- Virulence gene identification
- Gene identification in tabular format

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
- doc: Test outline for abromics_genomic_hybrid_long_reads
job:
paired_r2:
class: File
path: test-data/paired_r2.fastqsanger.gz
filetype: fastqsanger.gz
paired_r1:
class: File
path: test-data/paired_r1.fastqsanger.gz
filetype: fastqsanger.gz
long_reads_input:
class: File
path: test-data/long_reads_input.fastqsanger.gz
filetype: fastqsanger.gz
select_plasmidfinder_database: plasmidfinder_314d85f_2023_03_17
select_amrfinderplus_to_bakta_database: amrfinderplus_V3.11_2023-04-17.1
select_bakta_database: V5.0light_2023-02-20
select_ncbi_taxonomy_database: ncbi-2015-10-05
select_kraken2_database: 2022-02-02T162959Z_silva_kmer-len_35_minimizer-len_31_minimizer-spaces_6_load-factor_0.7
select_bracken_database: 2022-02-02T162959Z_silva_kmer-len_35_minimizer-len_31_minimizer-spaces_6_load-factor_0.7
outputs:
bandage_contig_graph_plot:
path: test-data/bandage_contig_graph_plot.svg
compare: sim_size
delta: 30000
bakta_aminoacid_sequence_faa:
path: test-data/bakta_aminoacid_sequence_faa.fasta
compare: sim_size
delta: 30000
bakta_annotation_log:
path: test-data/bakta_annotation_log.txt
compare: sim_size
delta: 1000
bakta_annotation_plot:
path: test-data/bakta_annotation_plot.svg
compare: sim_size
delta: 30000
bakta_annotation_json:
path: test-data/bakta_annotation_json.json
compare: sim_size
delta: 30000
bakta_summary_text:
path: test-data/bakta_summary_text.txt
compare: sim_size
delta: 1000
bakta_hypothetical_faa:
path: test-data/bakta_hypothetical_faa.fasta
compare: sim_size
delta: 30000
bakta_hypothetical_tabular:
path: test-data/bakta_hypothetical_tabular.tabular
compare: sim_size
delta: 30000
bakta_nucleotide_sequence_fasta:
path: test-data/bakta_nucleotide_sequence_fasta.fasta
compare: sim_size
delta: 30000
bakta_assembly_fasta:
path: test-data/bakta_assembly_fasta.fasta
compare: sim_size
delta: 30000
bakta_annotation_embl:
path: test-data/bakta_annotation_embl.tabular
compare: sim_size
delta: 30000
bakta_annotation_gbff:
path: test-data/bakta_annotation_gbff.tabular
compare: sim_size
delta: 30000
bakta_annotation_gff3:
path: test-data/bakta_annotation_gff3.gff3
compare: sim_size
delta: 30000
bakta_annotation_tabular:
path: test-data/bakta_annotation_tabular.tabular
compare: sim_size
delta: 30000
quast_report_tabular:
path: test-data/quast_report_tabular.tabular
compare: sim_size
delta: 10000
popypolish_fasta:
path: test-data/popypolish_fasta.fasta
compare: sim_size
delta: 60000
aligned_R2_bam:
path: test-data/aligned_R2_bam.qname_input_sorted.bam
compare: sim_size
delta: 60000
aligned_R1_bam:
path: test-data/aligned_R1_bam.qname_input_sorted.bam
compare: sim_size
delta: 60000
recentrifuge_report_html:
path: test-data/recentrifuge_report_html.html
compare: sim_size
delta: 30000
bracken_report_tsv:
path: test-data/bracken_report_tsv.tabular
flye_assembly_graph:
path: test-data/flye_assembly_graph.graph_dot
compare: sim_size
delta: 30000
flye_assembly_fasta:
path: test-data/flye_assembly_fasta.fasta
compare: sim_size
delta: 30000
kraken_report_reads:
path: test-data/kraken_report_reads.tabular
kraken_report_tabular:
path: test-data/kraken_report_tabular.tabular
quality_check_after_text:
path: test-data/quality_check_after_text.txt
fastp_report_json:
path: test-data/fastp_report_json.json
compare: sim_size
delta: 30000
fastp_trimmed_R2:
path: test-data/fastp_trimmed_R2.fastqsanger.gz
fastp_trimmed_R1:
path: test-data/fastp_trimmed_R1.fastqsanger.gz
quality_report_before_text:
path: test-data/quality_report_before_text.txt
compare: diff
lines_diff: 7
Binary file not shown.
Binary file not shown.
Loading
Loading