-
Notifications
You must be signed in to change notification settings - Fork 485
Update kraken2 and add support for custom databases #7257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
This both updates the kraken2 tool to version 2.1.6 and adds support for custom databases. The custom databases are provided by directory types and because zip files without a common prefix get unpacked to a directory prefixed by the name of the original dataset (see the galaxy code), it uses some shell code to find the complete path to the custom database. |
| </param> | ||
| </when> | ||
| <when value="history"> | ||
| <param type="data" name="custom_database" label="Kraken2 database" format="directory" help="A kraken2 is a directory containing the files hash.k2d, opts.k2d and taxo.k2d"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- How big are those typically?
- Should we subclass directory for that?
- We should also add a tool that crates the K2 databases, or?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gigabyte scale. E.g. the Mycobacterium genus databases from Hall is 7.6 GB (https://zenodo.org/records/8343322). And yes, ideally we need a datatype and tool for creating Kraken2 databases, but there are already several Kraken2 databases available online that I am using for e.g. my M. tuberculosis analyses. I'm happy to work on PRs for the Kraken2 database datatype and the tool that creates them, but I don't think that that work should block this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this be of use for many users?
How about just adding this to the data manager?
We are just updating the dm anyway #6980
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DM is very useful... the use case here is in Galaxy's Workflow Landings API, which is oriented around inputs, not DMs. And also, once there is a kraken2-build tool, custom databases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mvdbeek when are data manager bundles expected. Would this help here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Directory datasets are a much better idea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the use case here is in Galaxy's Workflow Landings API
Would this then mean that the user uploads multiple GB "just" for a single workflow run? I have not yet used Galaxy's Workflow Landings API - how is this supposed to be used?
| #if $kraken2_database.database_source == "builtin" | ||
| --db '${kraken2_database.builtin_database.fields.path}' | ||
| #elif $kraken2_database.database_source == "history" | ||
| --db "\$DB_DIR" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should wait for a proper fix galaxyproject/galaxy#20857 and use
| --db "\$DB_DIR" | |
| --db '$kraken2_database.custom_database.extra_files_path' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we need a fix for that bug, but for now this works and a fix for the Galaxy bug will likely take some times (months) to be available in a Galaxy release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope that this will be considered a bug fix and goes in the current release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have reference your bugfix in a comment in the latest version of the code
| <param name="confidence" value=".2"/> | ||
| <conditional name="kraken2_database"> | ||
| <param name="database_source" value="history" /> | ||
| <param name="custom_database" ftype="zip" value="test_db.zip" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you unzip this, then use
| <param name="custom_database" ftype="zip" value="test_db.zip" /> | |
| <param name="custom_database" class="Directory" ftype="kraken2_database" value="path_to_directory" /> |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(kraken2_database or similar would subclass the directory datatype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a test for the Directory provided directly and provided via a zip. This revealed a bug in my previous version of the code so that is now updated.
|
Some notes on the future kraken2 custom database builder tool: The kraken2 db building process involves three steps:
This gives flexibility in how databases are built, and therefore it makes sense to have tools for (1), and some combination of (2), (3) and (4), balancing the need to not continually re-download the taxonomy files with the need to not pollute the history with intermediate products. |
FOR CONTRIBUTOR: