[asklepian] Compress outputs #37

SamStudio8 · 2021-03-22T14:46:56Z

adm1
published_date
collection_pillar

The text was updated successfully, but these errors were encountered:

SamStudio8 · 2021-03-22T14:47:30Z

Checking whether collection_pillar is strictly necessary. I think it may be acting as a proxy here for something else, ideally that information would come from PHA as it's more likely to be correct.

NG confirms this is the field they want. Will proceed to that spec.

SamStudio8 · 2021-03-29T13:45:14Z

make_genomes_table.py is responsible for pulling the metadata from the core metadata file, and zipping it to the genomes. adm1 and collection_pillar are trivial, however published_date is not in scope. Ideally we'd achieve this request with a Majora dataview, but they are still too slow at whole-dataset scale (SamStudio8/majora2#27).

Given the amount of work required to address the performance constraints of Majora's MDV API endpoint, we'll need to tide this over with something. Easiest solution will be to pull all pairs of published_name and published_date and mix these in to the genome table. Ideally long term, everything would move to a faster version of the MDV endpoint.

SamStudio8 · 2021-03-29T13:53:09Z

As it happens, the get pag endpoint used to kick-off Asklepian should have all the metadata in scope -- this would be a good stepping stone towards my ideal solution: we'd cut out the core metadata table and leverage the API instead. In a parallel universe where I have time to re architect the MDV API, it would be quite easy to switch get pag over to use it.

SamStudio8 · 2021-03-29T14:12:54Z

Suspicion confirmed, the get pag API has everything we need.

SamStudio8 · 2021-03-29T15:47:59Z

SamStudio8/asklepian@db3387b adds an updated genome table script that will push a test_v2 copy of the genome table until we are ready to switch over.

SamStudio8 · 2021-03-29T15:49:45Z

Changes deployed ready for tomorrow's Asklepian.

Confirm test table working as expected (2021-03-30 PM) -- confirmed 20210330@1800
Notify NG and test output
Notify CJ
Switch test_v2 table to be default output and remove v1 script - 20210420

SamStudio8 · 2021-04-01T08:50:37Z

As per discussion with DG, compressing v2 genomics table as of today #43

SamStudio8 · 2021-04-20T11:57:42Z

CJ's team has picked this up now. Hopefully we can make the switch soon.

SamStudio8 · 2021-04-20T17:20:16Z

SamStudio8/asklepian@71ca455 deprecates the v1 genome table, and removes the test_ prefix from the v2 table. v1 genome table will not be generated 2021-04-21.

The v2 genomes table will not be automatically deleted (as usual) in case we need to resend or drop the columns to create the v1 table for whatever reason. Once we're happy we can return to deleting it as usual.

Check table was ingested successfully -- 20210423
Return to deleting genome table after sending to Azure as default
Also move variant table to compressed format

SamStudio8 · 2021-04-22T11:30:12Z

Ingest failed on the other side, engineers investigating. See JIRA ~~EDGE-2004~~, ~~EDGE-2152~~ DA-7013.

SamStudio8 · 2021-04-23T16:09:45Z

Chased this up with the engineers on the other side. Appears the compression was not taken into account? Regardless, issue with ingest appears to be resolved now.

SamStudio8 · 2021-04-23T16:20:10Z

NG confirms the genomes table ingested has the Sequence field as MSA (#61) so the ingest must be the latest data, hooray! 🎉 🦜

SamStudio8 · 2021-05-17T11:37:25Z

Going to chase compression on the variant table up this week to try and close this.

SamStudio8 · 2021-05-28T09:03:37Z

Moving this to backlog #62 as the change process on the other side is moving so slow

SamStudio8 · 2021-06-03T10:32:31Z

Discussed this with CG and have agreed to compress the variant table starting with tomorrow's run (20210604). I will add the gzip step to the Asklepian go.sh after today's (20210603) run has completed and notify CG. CG will update the ingest pipeline on their end to expect a gzipped input (like the genome table) after the 20210603 ingest completes later today. We will monitor the pipeline closely tomorrow to ensure continuity.

SamStudio8 · 2021-06-03T13:06:03Z

Change implemented by SamStudio8/asklepian@064817e. Output filename will now be suffixed with .gz: variant_table_$DATESTAMP.csv.gz. CG notified and acknowledged.

SamStudio8 · 2021-06-03T16:48:22Z

CG confirms partner change has been performed on their side. Green light for tomorrow 🚀

SamStudio8 · 2021-06-04T11:23:42Z

Compressed variant table written and sent. Variant table step was around 15 minutes faster compared to yesterday, and the compression ratio in the new CSV is around 7x.

SamStudio8 · 2021-06-04T11:25:34Z

Reinflated CSV is where it is supposed to be on CLIMB-COVID, downstream asklepian-db step has run successfully. Spoken to CG on the other end and the gzipped variant table is processing on the other side! 🔥 🚀

SamStudio8 · 2021-06-04T14:45:42Z

So apparently gzip files and Apache Spark are not friends (http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/%3CCANDWdjY2hN-=jXTSNZ8JHZ=G-S+ZKLNze=RGkJacJAW3tTOQQA@mail.gmail.com%3E) and this change has caused performance trouble on the other side, as Spark is not able to split up the input for efficiency. That link mentions Snappy compressed files are splittable so we can try that.

SamStudio8 · 2021-06-04T14:56:37Z

We're rolling this change back for the weekend and will experiment with Snappy compression (or alternatives) next week.

SamStudio8 · 2021-06-04T15:00:40Z

Reverted our side by SamStudio8/asklepian@e4511de

SamStudio8 · 2021-06-04T15:02:24Z

Reverted by CG on the other side

SamStudio8 · 2021-06-04T15:15:04Z

Installed python-snappy as it has a module that binds the snappy library to use easily from the CLI because obviously snappy is so hipster they can't possibly just distribute a binary. Sent over test_variant_table_20210604.csv.snappy just to try out.

SamStudio8 · 2021-06-04T15:21:56Z

As a quick sanity check the cat unsnappy | python -m snappy -c > snappy to python -m snappy -d snappy > unsnappy round trip does give us the same file back.

SamStudio8 · 2021-06-04T17:26:55Z

Naturally that file did not work because of course there are different poorly documented codecs for snappy. Sent a replacement generated with -t hadoop_snappy which is more likely to work according to some stranger on stackoverflow.

SamStudio8 · 2021-06-05T08:28:30Z

CG reports the hadoop_snappy file was not splittable either. This SO article (https://stackoverflow.com/questions/32382352/is-snappy-splittable-or-not-splittable) conflicts yesterday's reading and says that whole files compressed with Snappy won't be splittable after all. Given this was supposed to be a sticky plaster before we could get to implement the incremental tables I don't want to spend too long delving into wtf is going on here, and I didn't really like the half finished look of Snappy anyway.

Interestingly another SO article (https://stackoverflow.com/a/25888475/2576437) mentions bzip2 and LZ4 (via https://github.com/fingltd/4mc) are supposed to be splittable and those are totally normal compression algorithms.

SamStudio8 · 2021-06-08T10:55:57Z

CG confirms bzip2 is splittable 🎉 Problematically it also seems to be the slowest compression option we've tried. SN will do a couple of naive time tests to see what the impact of swapping to bzip2 would be. It may be that a small compression time penalty on the CLIMB side to speed up the PHA Spark side will be the best compromise.

SamStudio8 · 2021-06-08T12:56:23Z

The genome table example for bzip2 has been running for significantly longer than gzip now. From the bzip2 manual (below) it would seem that the genomic strings are quite likely the worst case input for compression.

The sorting phase of compression gathers together similar strings in the file. Because of this, files containing very long runs of repeated symbols, like "aabaabaabaab ..." (repeated several hundred times) may compress more slowly than normal. Versions 0.9.5 and above fare much better than previous versions in this respect. The ratio between worst-case and average-case compression time is in the region of 10:1.

My suggestion is we will continue to gzip the genomic table for transfer to PHE. Even though the PHE ingest will be unsplit, it remains reasonably fast and stable (the table grows linearly). We save precious time and I/O from having the table compressed at source this way.

I'll do some variant table tests when I get the final wall time of the bzip2 test.

SamStudio8 · 2021-06-08T13:48:55Z

Genome table takes 22m to process and gzip, 112m to process and bgzip2. Will try variant table now.

SamStudio8 · 2021-06-08T18:22:40Z

79m to process and gzip the variant table, 88m to process and bzip2. The 10m delay on this side is certainly worth the penalty given there is an order of magnitude (or so) difference in processing the variant table on the other side as a splittable format (or not). Will discuss with CG.

SamStudio8 · 2021-07-02T10:55:59Z

Closing due to lack of interest

SamStudio8 added enhancement New feature or request question Further information is requested metadata outbound-pha labels Mar 22, 2021

SamStudio8 removed the question Further information is requested label Mar 26, 2021

SamStudio8 self-assigned this Mar 29, 2021

SamStudio8 added almost finished in progress waiting labels Mar 29, 2021

SamStudio8 mentioned this issue Apr 1, 2021

Dataview for PAG published dates #44

Closed

SamStudio8 changed the title ~~[asklepian] Add fields to PHA genomes table~~ [asklepian] Add fields to PHA genomes table, compress outputs Apr 15, 2021

SamStudio8 mentioned this issue Apr 15, 2021

Compress Asklepian tables #43

Closed

SamStudio8 added waiting and removed waiting labels Apr 20, 2021

SamStudio8 changed the title ~~[asklepian] Add fields to PHA genomes table, compress outputs~~ [asklepian] ~Add fields to PHA genomes table~ Compress outputs May 18, 2021

SamStudio8 changed the title ~~[asklepian] ~Add fields to PHA genomes table~ Compress outputs~~ [asklepian] ~~Add fields to PHA genomes table~~ Compress outputs May 18, 2021

SamStudio8 changed the title ~~[asklepian] ~~Add fields to PHA genomes table~~ Compress outputs~~ [asklepian] Compress outputs May 18, 2021

SamStudio8 closed this as completed May 28, 2021

SamStudio8 mentioned this issue May 28, 2021

Backlog tasks #62

Open

SamStudio8 reopened this Jun 3, 2021

SamStudio8 removed the waiting label Jun 3, 2021

SamStudio8 closed this as completed Jul 2, 2021

SamStudio8 mentioned this issue Feb 28, 2022

Asklepian performance #195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[asklepian] Compress outputs #37

[asklepian] Compress outputs #37

SamStudio8 commented Mar 22, 2021

SamStudio8 commented Mar 22, 2021 •

edited

Loading

SamStudio8 commented Mar 29, 2021

SamStudio8 commented Mar 29, 2021

SamStudio8 commented Mar 29, 2021

SamStudio8 commented Mar 29, 2021

SamStudio8 commented Mar 29, 2021 •

edited

Loading

SamStudio8 commented Apr 1, 2021 •

edited

Loading

SamStudio8 commented Apr 20, 2021

SamStudio8 commented Apr 20, 2021 •

edited

Loading

SamStudio8 commented Apr 22, 2021 •

edited

Loading

SamStudio8 commented Apr 23, 2021

SamStudio8 commented Apr 23, 2021

SamStudio8 commented May 17, 2021

SamStudio8 commented May 28, 2021

SamStudio8 commented Jun 3, 2021

SamStudio8 commented Jun 3, 2021

SamStudio8 commented Jun 3, 2021

SamStudio8 commented Jun 4, 2021

SamStudio8 commented Jun 4, 2021

SamStudio8 commented Jun 4, 2021

SamStudio8 commented Jun 4, 2021

SamStudio8 commented Jun 4, 2021

SamStudio8 commented Jun 4, 2021

SamStudio8 commented Jun 4, 2021

SamStudio8 commented Jun 4, 2021

SamStudio8 commented Jun 4, 2021

SamStudio8 commented Jun 5, 2021 •

edited

Loading

SamStudio8 commented Jun 8, 2021

SamStudio8 commented Jun 8, 2021

SamStudio8 commented Jun 8, 2021

SamStudio8 commented Jun 8, 2021

SamStudio8 commented Jul 2, 2021

[asklepian] Compress outputs #37

[asklepian] Compress outputs #37

Comments

SamStudio8 commented Mar 22, 2021

SamStudio8 commented Mar 22, 2021 • edited Loading

SamStudio8 commented Mar 29, 2021

SamStudio8 commented Mar 29, 2021

SamStudio8 commented Mar 29, 2021

SamStudio8 commented Mar 29, 2021

SamStudio8 commented Mar 29, 2021 • edited Loading

SamStudio8 commented Apr 1, 2021 • edited Loading

SamStudio8 commented Apr 20, 2021

SamStudio8 commented Apr 20, 2021 • edited Loading

SamStudio8 commented Apr 22, 2021 • edited Loading

SamStudio8 commented Apr 23, 2021

SamStudio8 commented Apr 23, 2021

SamStudio8 commented May 17, 2021

SamStudio8 commented May 28, 2021

SamStudio8 commented Jun 3, 2021

SamStudio8 commented Jun 3, 2021

SamStudio8 commented Jun 3, 2021

SamStudio8 commented Jun 4, 2021

SamStudio8 commented Jun 4, 2021

SamStudio8 commented Jun 4, 2021

SamStudio8 commented Jun 4, 2021

SamStudio8 commented Jun 4, 2021

SamStudio8 commented Jun 4, 2021

SamStudio8 commented Jun 4, 2021

SamStudio8 commented Jun 4, 2021

SamStudio8 commented Jun 4, 2021

SamStudio8 commented Jun 5, 2021 • edited Loading

SamStudio8 commented Jun 8, 2021

SamStudio8 commented Jun 8, 2021

SamStudio8 commented Jun 8, 2021

SamStudio8 commented Jun 8, 2021

SamStudio8 commented Jul 2, 2021

SamStudio8 commented Mar 22, 2021 •

edited

Loading

SamStudio8 commented Mar 29, 2021 •

edited

Loading

SamStudio8 commented Apr 1, 2021 •

edited

Loading

SamStudio8 commented Apr 20, 2021 •

edited

Loading

SamStudio8 commented Apr 22, 2021 •

edited

Loading

SamStudio8 commented Jun 5, 2021 •

edited

Loading