Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate Asklepian to use Datapipe products rather than its own MSA #61

Closed
grovesn opened this issue Apr 22, 2021 · 6 comments
Closed

Migrate Asklepian to use Datapipe products rather than its own MSA #61

grovesn opened this issue Apr 22, 2021 · 6 comments
Assignees
Labels

Comments

@grovesn
Copy link

grovesn commented Apr 22, 2021

Please can we have the genome table sequence column changed to the aligned sequence instead so we can use them in tree building?

Thanks!

@SamStudio8 SamStudio8 self-assigned this Apr 22, 2021
@SamStudio8
Copy link
Member

This should be reasonably straightforward as the naive MSA is right in the scope of Asklepian and I think only the "best ref" genomes are used in the MSA too.

@SamStudio8
Copy link
Member

@rmcolq suggested that we think about using the datapipe MSA for this. I think it would be good to converge on one canonical source for things and datapipe is officially the producer of the MSA product used by the rest of the consortium. Asklepian only produces an MSA because we originally couldn't run phylopipe1 often enough.

Asklepian is informally considered a "must run" service and so if we use the datapipe MSA product we'd need to guarantee it can run at least every day to ensure the Asklepian tables reach PHE in time for analysis (the informal expectation for this is the tables should be ingested PHE-side by early morning, requiring CLIMB-COVID to emit Asklepian products by the end of the working day). RC has confirmed they are willing to handle this.

Datapipe also has a deduplication step that picks a best reference using a very similar algorithm to get_best_ref.py. It also takes care of matching the metadata for the chosen sequences. Seems to me that leveraging the MSA from datapipe is a natural choice now, as RC has also improved the performance of that code compared to the phylopipe1-esque alignment we do inside Asklepian.

@SamStudio8
Copy link
Member

Plan is to migrate to using the datapipe products, but this will need testing to compare the outputs first. In the short term therefore, we'll make the smaller change and use the current MSA such that the genome table is updated to use the alignment data sooner rather than later (as it will improve analysis portability on the other end).

@SamStudio8
Copy link
Member

SamStudio8/asklepian@43ffb67 uses the MSA rather than the best_ref.fasta to generate the genomes table. As they use the exact same headers they are interchangeable.

@SamStudio8 SamStudio8 changed the title Change genome table to contain aligned sequences Migrate Asklepian to use Datapipe products rather than its own MSA Apr 22, 2021
@SamStudio8 SamStudio8 added enhancement New feature or request p:low labels Apr 22, 2021
@SamStudio8
Copy link
Member

As we discussed it here already I've updated this issue to track a longer term change for Asklepian to use Datapipe products.

@SamStudio8
Copy link
Member

Bumping to backlog #62

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants