Migrate Asklepian to use Datapipe products rather than its own MSA #61

grovesn · 2021-04-22T09:18:59Z

Please can we have the genome table sequence column changed to the aligned sequence instead so we can use them in tree building?

Thanks!

SamStudio8 · 2021-04-22T09:44:54Z

This should be reasonably straightforward as the naive MSA is right in the scope of Asklepian and I think only the "best ref" genomes are used in the MSA too.

SamStudio8 · 2021-04-22T10:04:51Z

@rmcolq suggested that we think about using the datapipe MSA for this. I think it would be good to converge on one canonical source for things and datapipe is officially the producer of the MSA product used by the rest of the consortium. Asklepian only produces an MSA because we originally couldn't run phylopipe1 often enough.

Asklepian is informally considered a "must run" service and so if we use the datapipe MSA product we'd need to guarantee it can run at least every day to ensure the Asklepian tables reach PHE in time for analysis (the informal expectation for this is the tables should be ingested PHE-side by early morning, requiring CLIMB-COVID to emit Asklepian products by the end of the working day). RC has confirmed they are willing to handle this.

Datapipe also has a deduplication step that picks a best reference using a very similar algorithm to get_best_ref.py. It also takes care of matching the metadata for the chosen sequences. Seems to me that leveraging the MSA from datapipe is a natural choice now, as RC has also improved the performance of that code compared to the phylopipe1-esque alignment we do inside Asklepian.

SamStudio8 · 2021-04-22T10:14:17Z

Plan is to migrate to using the datapipe products, but this will need testing to compare the outputs first. In the short term therefore, we'll make the smaller change and use the current MSA such that the genome table is updated to use the alignment data sooner rather than later (as it will improve analysis portability on the other end).

SamStudio8 · 2021-04-22T15:40:05Z

SamStudio8/asklepian@43ffb67 uses the MSA rather than the best_ref.fasta to generate the genomes table. As they use the exact same headers they are interchangeable.

SamStudio8 · 2021-04-22T15:47:32Z

As we discussed it here already I've updated this issue to track a longer term change for Asklepian to use Datapipe products.

SamStudio8 · 2021-05-17T14:17:07Z

Bumping to backlog #62

SamStudio8 self-assigned this Apr 22, 2021

SamStudio8 added the outbound-pha label Apr 22, 2021

SamStudio8 changed the title ~~Change genome table to contain aligned sequences~~ Migrate Asklepian to use Datapipe products rather than its own MSA Apr 22, 2021

SamStudio8 added enhancement New feature or request p:low labels Apr 22, 2021

This was referenced Apr 22, 2021

Improve performance of Asklepian export #17

Closed

[asklepian] Compress outputs #37

Closed

SamStudio8 closed this as completed May 17, 2021

SamStudio8 mentioned this issue May 17, 2021

Backlog tasks #62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Migrate Asklepian to use Datapipe products rather than its own MSA #61

Migrate Asklepian to use Datapipe products rather than its own MSA #61

grovesn commented Apr 22, 2021

SamStudio8 commented Apr 22, 2021

SamStudio8 commented Apr 22, 2021

SamStudio8 commented Apr 22, 2021

SamStudio8 commented Apr 22, 2021

SamStudio8 commented Apr 22, 2021

SamStudio8 commented May 17, 2021

Migrate Asklepian to use Datapipe products rather than its own MSA #61

Migrate Asklepian to use Datapipe products rather than its own MSA #61

Comments

grovesn commented Apr 22, 2021

SamStudio8 commented Apr 22, 2021

SamStudio8 commented Apr 22, 2021

SamStudio8 commented Apr 22, 2021

SamStudio8 commented Apr 22, 2021

SamStudio8 commented Apr 22, 2021

SamStudio8 commented May 17, 2021