-
Notifications
You must be signed in to change notification settings - Fork 485
Update kraken2 and add support for custom databases #7257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -14,10 +14,17 @@ | |||||
| <version_command>kraken2 --version</version_command> | ||||||
| <command detect_errors="exit_code"> | ||||||
| <![CDATA[ | ||||||
| #if $kraken2_database.database_source == "history": | ||||||
| DB_DIR=\$(ls -d '${kraken2_database.custom_database.extra_files_path}/dataset_'*/) && | ||||||
| #end if | ||||||
| kraken2 | ||||||
| --threads \${GALAXY_SLOTS:-1} | ||||||
| --db '${kraken2_database.fields.path}' | ||||||
|
|
||||||
| #if $kraken2_database.database_source == "builtin" | ||||||
| --db '${kraken2_database.builtin_database.fields.path}' | ||||||
| #elif $kraken2_database.database_source == "history" | ||||||
| --db "\$DB_DIR" | ||||||
| #end if | ||||||
|
|
||||||
| $quick | ||||||
|
|
||||||
| #if $single_paired.single_paired_selector == "collection": | ||||||
|
|
@@ -37,9 +44,9 @@ | |||||
| #end if | ||||||
| #end if | ||||||
|
|
||||||
| --confidence '${confidence}' | ||||||
| --minimum-base-quality '${min_base_quality}' | ||||||
| --minimum-hit-groups '${minimum_hit_groups}' | ||||||
| --confidence ${confidence} | ||||||
| --minimum-base-quality ${min_base_quality} | ||||||
| --minimum-hit-groups ${minimum_hit_groups} | ||||||
|
|
||||||
| $use_names | ||||||
|
|
||||||
|
|
@@ -129,6 +136,7 @@ | |||||
| <!--<data format="tabular" label="${tool.name} on ${on_string}: Translated classification" name="translated" />--> | ||||||
| </outputs> | ||||||
| <tests> | ||||||
| <!-- test1 --> | ||||||
| <test expect_num_outputs="1"> | ||||||
| <conditional name="single_paired"> | ||||||
| <param name="single_paired_selector" value="no"/> | ||||||
|
|
@@ -137,10 +145,14 @@ | |||||
| <param name="split_reads" value="false"/> | ||||||
| <param name="quick" value="no"/> | ||||||
| <param name="confidence" value=".2"/> | ||||||
| <param name="kraken2_database" value="test_entry"/> | ||||||
| <conditional name="kraken2_database"> | ||||||
| <param name="database_source" value="builtin" /> | ||||||
| <param name="builtin_database" value="test_entry"/> | ||||||
| </conditional> | ||||||
| <output name="output" file="kraken_test1_output.tab" ftype="tabular"/> | ||||||
| </test> | ||||||
|
|
||||||
| <!-- test2 --> | ||||||
| <test expect_num_outputs="3"> | ||||||
| <conditional name="single_paired"> | ||||||
| <param name="single_paired_selector" value="no"/> | ||||||
|
|
@@ -149,12 +161,16 @@ | |||||
| <param name="split_reads" value="true"/> | ||||||
| <param name="quick" value="no"/> | ||||||
| <param name="confidence" value=".2"/> | ||||||
| <param name="kraken2_database" value="test_entry"/> | ||||||
| <conditional name="kraken2_database"> | ||||||
| <param name="database_source" value="builtin" /> | ||||||
| <param name="builtin_database" value="test_entry"/> | ||||||
| </conditional> | ||||||
| <output name="output" file="kraken_test1_output.tab" ftype="tabular"/> | ||||||
| <output name="classified_out_s" file="kraken_test1_cl.fas" ftype="fasta"/> | ||||||
| <output name="unclassified_out_s" file="kraken_test1_un.fas" ftype="fasta"/> | ||||||
| </test> | ||||||
|
|
||||||
| <!-- test3 --> | ||||||
| <test expect_num_outputs="7"> | ||||||
| <conditional name="single_paired"> | ||||||
| <param name="single_paired_selector" value="collection"/> | ||||||
|
|
@@ -169,7 +185,10 @@ | |||||
| <param name="split_reads" value="true"/> | ||||||
| <param name="quick" value="no"/> | ||||||
| <param name="confidence" value="0"/> | ||||||
| <param name="kraken2_database" value="test_entry"/> | ||||||
| <conditional name="kraken2_database"> | ||||||
| <param name="database_source" value="builtin" /> | ||||||
| <param name="builtin_database" value="test_entry"/> | ||||||
| </conditional> | ||||||
| <output_collection name="out_unclassified_paired" type="paired"> | ||||||
| <element name="forward" file="un_test2_output_1.fastq" ftype="fastqsanger.gz" decompress="true"/> | ||||||
| <element name="reverse" file="un_test2_output_2.fastq" ftype="fastqsanger.gz" decompress="true"/> | ||||||
|
|
@@ -184,6 +203,7 @@ | |||||
| </assert_command> | ||||||
| </test> | ||||||
|
|
||||||
| <!-- test4 --> | ||||||
| <test expect_num_outputs="2"> | ||||||
| <conditional name="single_paired"> | ||||||
| <param name="single_paired_selector" value="collection"/> | ||||||
|
|
@@ -199,9 +219,28 @@ | |||||
| <param name="create_report" value="true"/> | ||||||
| <param name="report_minimizer_data" value="true"/> | ||||||
| </section> | ||||||
| <param name="kraken2_database" value="test_entry"/> | ||||||
| <conditional name="kraken2_database"> | ||||||
| <param name="database_source" value="builtin" /> | ||||||
| <param name="builtin_database" value="test_entry"/> | ||||||
| </conditional> | ||||||
| <output name="report_output" file="kraken_test2_report.tab" ftype="tabular"/> | ||||||
| </test> | ||||||
|
|
||||||
| <!-- test5 --> | ||||||
| <test expect_num_outputs="1"> | ||||||
| <conditional name="single_paired"> | ||||||
| <param name="single_paired_selector" value="no"/> | ||||||
| <param name="input_sequences" value="kraken_test1.fa" ftype="fasta"/> | ||||||
| </conditional> | ||||||
| <param name="split_reads" value="false"/> | ||||||
| <param name="quick" value="no"/> | ||||||
| <param name="confidence" value=".2"/> | ||||||
| <conditional name="kraken2_database"> | ||||||
| <param name="database_source" value="history" /> | ||||||
| <param name="custom_database" ftype="zip" value="test_db.zip" /> | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you unzip this, then use
Suggested change
?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (kraken2_database or similar would subclass the directory datatype)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have added a test for the Directory provided directly and provided via a zip. This revealed a bug in my previous version of the code so that is now updated. |
||||||
| </conditional> | ||||||
| <output name="output" file="kraken_test1_output.tab" ftype="tabular"/> | ||||||
| </test> | ||||||
| </tests> | ||||||
| <help> | ||||||
| <![CDATA[ | ||||||
|
|
@@ -225,6 +264,23 @@ Each sequence classified by Kraken results in a single line of output. Output li | |||||
| c) the next 31 k-mers contained an ambiguous nucleotide | ||||||
| d) the next k-mer was not in the database | ||||||
| e) the last 3 k-mers mapped to taxonomy ID #562 | ||||||
|
|
||||||
|
|
||||||
| ----- | ||||||
|
|
||||||
| **Custom databases** | ||||||
|
|
||||||
| The database can either be provided from a built-in selection of databases on the Galaxy server, in which case they are installed by the Galaxy administrator, typically using a tool manager, or can be provided by the user. | ||||||
|
|
||||||
| The kraken2 tool can use custom databases provided, as directories, from the user's history. These directory datasets must contain the files `hash.k2d`, `opts.k2d` and `taxo.k2d`. Since Galaxy does not support uploading directory datatypes directly, users can upload a zip file containing these three files and then extract it using the "Extract compressed file" tool in Galaxy. The resulting directory can then be used as a custom database for Kraken2. A compressed zip archive can also be used as input to the tool, in which case Galaxy will automatically extract it before running the tool. | ||||||
|
|
||||||
| To further clarify, the structure of the uploaded zip file should be as follows:: | ||||||
|
|
||||||
| hash.k2d | ||||||
| opts.k2d | ||||||
| taxo.k2d | ||||||
|
|
||||||
| For more information on building custom databases for Kraken2, please refer to the `Kraken2 documentation <https://github.com/DerrickWood/kraken2/wiki/Manual#custom-databases>`_. | ||||||
| ]]> | ||||||
| </help> | ||||||
| <expand macro="citations" /> | ||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,8 +1,8 @@ | ||
| >gi|145231|gb|M33724.1|ECOALPHOA Escherichia coli K-12 truncated PhoA (phoA) gene, partial cds; and transposon Mu dI, partial sequence | ||
| >gi|145231|gb|M33724.1|ECOALPHOA | ||
| CAAAGCTCCGGGCCTCACCCAGGCGCTAAATACCAAAGATGGCGCAGTGATGGTGATGAGTTACGGGAACTCCGAAGAGGATTCACAAGAACATACCGGCAGTCAGTTGCGTATTGCGGCGTATGGCCCGCATGCCGCCAATGAAGCGGCGCACGAAAAACGCGAAAGCGT | ||
| >gi|145232|gb|M33725.1|ECOALPHOB Escherichia coli K12 phoA pseudogene and transposon Mu dl-R, partial sequence | ||
| >gi|145232|gb|M33725.1|ECOALPHOB | ||
| CTGTCATAAAGTTGTCACGGCCGAGACTTATAGTCGCTTTGTTTTTATTTTTTAATGTATTTGTACATGGAGAAAATAAAGTGAAACAAAGCACTATTGCACTGGCACTCTTACCGTTACTGTTTACCCCTGTGACAAAAGCCCGGACACCAGTGAAGCGGCGCACGAAAAACGCGAAAGCGT | ||
| >gi|145234|gb|M33727.1|ECOALPHOE Escherichia coli K12 upstream sequence of psiA5::Mu dI. is identical to psiA30 upstream sequence; putative (phoA) pseudogene and transposon Mu dl-R, partial sequence | ||
| >gi|145234|gb|M33727.1|ECOALPHOE | ||
| TTGTTTTTATTTTTTAATGTATTTGTACATGGAGAAAATAAAGTGAAACAAAGCACTATTGCACTGGTGAAGCGGCGCACGAAAAACGCGAAAGCGT | ||
| >gi|146195|gb|J01619.1|ECOGLTA Eschericia coli gltA gene, sdhCDAB operon and sucABCD operons, complete sequence | ||
| >gi|146195|gb|J01619.1|ECOGLTA | ||
| GAATTCGACCGCCATTGCGCAAGGCATCGCCATGACCAGGCAGGATACAAAAGAGAGTCGATAAATATTCACGGTGTCCATACCTGATAAATATTTTATGAAAGGCGGCGATGATGCCGCAAAATAATACTTATTTATAATCCAGCACGTAGGTTGCGTTAGCGGTTACTTCACCTGCCGTGACATCGACTGCATTATCAATTTGTTCCATCCAGGCGAAAAAGTTCAGCGTCTGTTCTGATGAGCTTGCATCCAGGTCAAGATCTGGCGCGGCTGAACCTAATACGATGTTACCGTCATTTTTGTCCATCAGTCGTACACCGACCCCAGTTGCTTCGCCTGCACTGGTGTTGCTCAACAAAGGCGTAGCACCAGTTGTCTTAGCCGTGCTATCGAAGGTTACGCCAAACTTTGGATACCGGCATTCCGCTACCGTTGTCAGAAGCAGGCAGATCACAGTTGATCAAGCGAATGTCGACGGCCACTTTATTGCTATGATGCTCCCGGTTTATATGGGTTGTCGTGACTTGTCCAAGATCTATGTTTTTATCAATATCTTCTGGATGAATTTCACAAGGTGCTTCAATAACCTCCCCCTTAAAGTGAATTTCGCCAGAACCTTCATCAGCAGCATAAACAGGTGCAGTGAACAGCAGAGATACGGCCAGTGCGGCCAATGTTTTTTGTCCTTTAAACATAACAGAGTCCTTTAAGGATATAGAATAGGGGTATAGCTACGCCAGAATATCGTATTTGATTATTGCTAGTTTTTAGTTTTGCTTAAAAAATATTGTTAGTTTTATTAAATTGGAAAACTAAATTATTGGTATCATGAATTGTTGTATGATGATAAATATAGGGGGGATATGATAGACGTCATTTTCATAGGGTTATAAAATGCGACTACCATGAAGTTTTTAATTCAAAGTATTGGGTTGCTGATAATTTGAGCTGTTCTATTCTTTTTAAATATCTATATAGGTCTGTTAATGGATTTTATTTTTACAAGTTTTTTGTGTTTAGGCATATAAAAATCAAGCCCGCCATATGAACGGCGGGTTAAAATATTTACAACTTAGCAATCGAACCATTAACGCTTGATATCGCTTTTAAAGTCGCGTTTTTCATATCCTGTATACAGCTGACGCGGACGGGCAATCTTCATACCGTCACTGTGCATTTCGCTCCAGTGGGCGATCCAGCCAACGGTACGTGCCATTGCGAAAATGACGGTGAACATGGAAGACGGAATACCCATCGCTTTCAGGATGATACCAGAGTAGAAATCGACGTTCGGGTACAGTTTCTTCTCGATAAAGTACGGGTCGTTCAGCGCGATGTTTTCCAGCTCCATAGCCACTTCCAGCAGGTCATCCTTCGTGCCCAGCTCTTTCAGCACTTCATGGCAGGTTTCACGCATTACGGTGGCGCGCGGGTCGTAATTTTTGTACACGCGGTGACCGAAGCCCATCAGGCGGAAAGAATCATTTTTGTCTTTCGCACGACGAAAAAATTCCGGAATGTGTTTAACGGAGCTGATTTCTTCCAGCATTTTCAGCGCCGCTTCGTTAGCACCGCCGTGCGCAGGTCCCCACAGTGAAGCAATACCTGCTGCGATACAGGCAAACGGGTTCGCACCCGAAGAGCCAGCGGTACGCACGGTGGAGGTAGAGGCGTTCTGTTCATGGTCAGCGTGCAGGATCAGAATACGGTCCATAGCACGTTCCAGAATCGGATTAACTTCATACGGTTCGCACGGCGTGGAGAACATCATATTCAGGAAGTTACCGGCGTAGGAGAGATCGTTGCGCGGGTAAACAAATGGCTGACCAATGGAATACTTGTAACACATCGCGGCCATGGTCGGCATTTTCGACAGCAGGCGGAACGCGGCAATTTCACGGTGACGAGGATTGTTAACATCCAGCGAGTCGTGATAGAACGCCGCCAGCGCGCCGGTAATACCACACATGACTGCCATTGGATGCGAGTCGCGACGGAAAGCATGGAACAGACGGGTAATCTGCTCGTGGATCATGGTATGACGGGTCACCGTAGTTTTAAATTCGTCATACTGTTCCTGAGTCGGTTTTTCACCATTCAGCAGGATGTAACAAACTTCCAGGTAGTTAGAATCGGTCGCCAGCTGATCGATCGGGAAACCGCGGTGCAGCAAAATACCTTCATCACCATCAATAAAAGTAATTTTAGATTCGCAGGATGCGGTTGAAGTGAAGCCTGGGTCAAAGGTGAACACACCTTTTGAACCGAGAGTACGGATATCAATAACATCTTGACCCAGCGTGCCTTTCAGCACATCCAGTTCAACAGCTGTATCCCCGTTGAGGGTGAGTTTTGCTTTTGTATCAGCCATTTAAGGTCTCCTTAGCGCCTTATTGCGTAAGACTGCCGGAACTTAAATTTGCCTTCGCACATCAACCTGGCTTTACCCGTTTTTTATTTGGCTCGCCGCTCTGTGAAAGAGGGGAAAACCTGGGTACAGAGCTCTGGGCGCTTGCAGGTAAAGGATCCATTGATGACGAATAAATGGCGAATCAAGTACTTAGCAATCCGAATTATTAAACTTGTCTACCACTAATAACTGTCCCGAATGAATTGGTCAATACTCCACACTGTTACATAAGTTAATCTTAGGTGAAATACCGACTTCATAACTTTTACGCATTATATGCTTTTCCTGGTAATGTTTGTAACAACTTTGTTGAATGATTGTCAAATTAGATGATTAAAAATTAAATAAATGTTGTTATCGTGACCTGGATCACTGTTCAGGATAAAACCCGACAAACTATATGTAGGTTAATTGTAATGATTTTGTGAACAGCCTATACTGCCGCCAGTCTCCGGAACACCCTGCAATCCCGAGCCACCCAGCGTTGTAACGTGTCGTTTTCGCATCTGGAAGCAGTGTTTTGCATGACGCGCAGTTATAGAAAGGACGCTGTCTGACCCGCAAGCAGACCGGAGGAAGGAAATCCCGACGTCTCCAGGTAACAGAAAGTTAACCTCTGTGCCCGTAGTCCCCAGGGAATAATAAGAACAGCATGTGGGCGTTATTCATGATAAGAAATGTGAAAAAACAAAGACCTGTTAATCTGGACCTACAGACCATCCGGTTCCCCATCACGGCGATAGCGTCCATTCTCCATCGCGTTTCCGGTGTGATCACCTTTGTTGCAGTGGGCATCCTGCTGTGGCTTCTGGGTACCAGCCTCTCTTCCCCTGAAGGTTTCGAGCAAGCTTCCGCGATTATGGGCAGCTTCTTCGTCAAATTTATCATGTGGGGCATCCTTACCGCTCTGGCGTATCACGTCGTCGTAGGTATTCGCCACATGATGATGGATTTTGGCTATCTGGAAGAAACATTCGAAGCGGGTAAACGCTCCGCCAAAATCTCCTTTGTTATTACTGTCGTGCTTTCACTTCTCGCAGGAGTCCTCGTATGGTAAGCAACGCCTCCGCATTAGGACGCAATGGCGTACATGATTTCATCCTCGTTCGCGCTACCGCTATCGTCCTGACGCTCTACATCATTTATATGGTCGGTTTTTTCGCTACCAGTGGCGAGCTGACATATGAAGTCTGGATCGGTTTCTTCGCCTCTGCGTTCACCAAAGTGTTCACCCTGCTGGCGCTGTTTTCTATCTTGATCCATGCCTGGATCGGCATGTGGCAGGTGTTGACCGACTACGTTAAACCGCTGGCTTTGCGCCTGATGCTGCAACTGGTGATTGTCGTTGCACTGGTGGTTTACGTGATTTATGGATTCGTTGTGGTGTGGGGTGTGTGATGAAATTGCCAGTCAGAGAATTTGATGCAGTTGTGATTG |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,15 +1,26 @@ | ||
| <?xml version="1.0"?> | ||
| <macros> | ||
| <token name="@TOOL_VERSION@">2.1.3</token> | ||
| <token name="@VERSION_SUFFIX@">2</token> | ||
| <token name="@TOOL_VERSION@">2.1.6</token> | ||
| <token name="@VERSION_SUFFIX@">0</token> | ||
| <token name="@PROFILE@">24.0</token> | ||
| <token name="@INTYPES@">fasta,fasta.gz,fasta.bz2,fastqsanger,fastqsanger.gz,fastqsanger.bz2</token> | ||
| <xml name="input_database"> | ||
| <param label="Select a Kraken2 database" name="kraken2_database" type="select"> | ||
| <options from_data_table="kraken2_databases"> | ||
| <validator message="No Kraken2 database is available" type="no_options" /> | ||
| </options> | ||
| </param> | ||
| <conditional name="kraken2_database"> | ||
| <param type="select" name="database_source" label="Kraken2 database source"> | ||
| <option value="builtin" selected="true">Use a database from the Galaxy server</option> | ||
| <option value="history">Use a database from the history</option> | ||
| </param> | ||
| <when value="builtin"> | ||
| <param label="Select a Kraken2 database" name="builtin_database" type="select"> | ||
| <options from_data_table="kraken2_databases"> | ||
| <validator message="No Kraken2 database is available" type="no_options" /> | ||
| </options> | ||
| </param> | ||
| </when> | ||
| <when value="history"> | ||
| <param type="data" name="custom_database" label="Kraken2 database" format="directory" help="A kraken2 is a directory containing the files hash.k2d, opts.k2d and taxo.k2d"/> | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Gigabyte scale. E.g. the Mycobacterium genus databases from Hall is 7.6 GB (https://zenodo.org/records/8343322). And yes, ideally we need a datatype and tool for creating Kraken2 databases, but there are already several Kraken2 databases available online that I am using for e.g. my M. tuberculosis analyses. I'm happy to work on PRs for the Kraken2 database datatype and the tool that creates them, but I don't think that that work should block this PR.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would this be of use for many users? How about just adding this to the data manager? We are just updating the dm anyway #6980
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The DM is very useful... the use case here is in Galaxy's Workflow Landings API, which is oriented around inputs, not DMs. And also, once there is a kraken2-build tool, custom databases.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mvdbeek when are data manager bundles expected. Would this help here?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Directory datasets are a much better idea
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Would this then mean that the user uploads multiple GB "just" for a single workflow run? I have not yet used Galaxy's Workflow Landings API - how is this supposed to be used? |
||
| </when> | ||
| </conditional> | ||
| </xml> | ||
| <xml name="citations"> | ||
| <citations> | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should wait for a proper fix galaxyproject/galaxy#20857 and use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need a fix for that bug, but for now this works and a fix for the Galaxy bug will likely take some times (months) to be available in a Galaxy release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope that this will be considered a bug fix and goes in the current release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reference your bugfix in a comment in the latest version of the code