Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[asklepian] Compress outputs #37

Closed
SamStudio8 opened this issue Mar 22, 2021 · 32 comments
Closed

[asklepian] Compress outputs #37

SamStudio8 opened this issue Mar 22, 2021 · 32 comments

Comments

@SamStudio8
Copy link
Member

  • adm1
  • published_date
  • collection_pillar
@SamStudio8 SamStudio8 added enhancement New feature or request question Further information is requested metadata outbound-pha labels Mar 22, 2021
@SamStudio8
Copy link
Member Author

SamStudio8 commented Mar 22, 2021

Checking whether collection_pillar is strictly necessary. I think it may be acting as a proxy here for something else, ideally that information would come from PHA as it's more likely to be correct.

NG confirms this is the field they want. Will proceed to that spec.

@SamStudio8 SamStudio8 removed the question Further information is requested label Mar 26, 2021
@SamStudio8 SamStudio8 self-assigned this Mar 29, 2021
@SamStudio8
Copy link
Member Author

make_genomes_table.py is responsible for pulling the metadata from the core metadata file, and zipping it to the genomes. adm1 and collection_pillar are trivial, however published_date is not in scope. Ideally we'd achieve this request with a Majora dataview, but they are still too slow at whole-dataset scale (SamStudio8/majora2#27).

Given the amount of work required to address the performance constraints of Majora's MDV API endpoint, we'll need to tide this over with something. Easiest solution will be to pull all pairs of published_name and published_date and mix these in to the genome table. Ideally long term, everything would move to a faster version of the MDV endpoint.

@SamStudio8
Copy link
Member Author

As it happens, the get pag endpoint used to kick-off Asklepian should have all the metadata in scope -- this would be a good stepping stone towards my ideal solution: we'd cut out the core metadata table and leverage the API instead. In a parallel universe where I have time to re architect the MDV API, it would be quite easy to switch get pag over to use it.

@SamStudio8
Copy link
Member Author

Suspicion confirmed, the get pag API has everything we need.

@SamStudio8
Copy link
Member Author

SamStudio8/asklepian@db3387b adds an updated genome table script that will push a test_v2 copy of the genome table until we are ready to switch over.

@SamStudio8
Copy link
Member Author

SamStudio8 commented Mar 29, 2021

Changes deployed ready for tomorrow's Asklepian.

  • Confirm test table working as expected (2021-03-30 PM) -- confirmed 20210330@1800
  • Notify NG and test output
  • Notify CJ
  • Switch test_v2 table to be default output and remove v1 script - 20210420

@SamStudio8
Copy link
Member Author

SamStudio8 commented Apr 1, 2021

As per discussion with DG, compressing v2 genomics table as of today #43

@SamStudio8 SamStudio8 changed the title [asklepian] Add fields to PHA genomes table [asklepian] Add fields to PHA genomes table, compress outputs Apr 15, 2021
@SamStudio8
Copy link
Member Author

CJ's team has picked this up now. Hopefully we can make the switch soon.

@SamStudio8
Copy link
Member Author

SamStudio8 commented Apr 20, 2021

SamStudio8/asklepian@71ca455 deprecates the v1 genome table, and removes the test_ prefix from the v2 table. v1 genome table will not be generated 2021-04-21.

The v2 genomes table will not be automatically deleted (as usual) in case we need to resend or drop the columns to create the v1 table for whatever reason. Once we're happy we can return to deleting it as usual.

  • Check table was ingested successfully -- 20210423
  • Return to deleting genome table after sending to Azure as default
  • Also move variant table to compressed format

@SamStudio8
Copy link
Member Author

SamStudio8 commented Apr 22, 2021

Ingest failed on the other side, engineers investigating. See JIRA EDGE-2004, EDGE-2152 DA-7013.

@SamStudio8
Copy link
Member Author

Chased this up with the engineers on the other side. Appears the compression was not taken into account? Regardless, issue with ingest appears to be resolved now.

@SamStudio8
Copy link
Member Author

NG confirms the genomes table ingested has the Sequence field as MSA (#61) so the ingest must be the latest data, hooray! 🎉 🦜

@SamStudio8
Copy link
Member Author

Going to chase compression on the variant table up this week to try and close this.

@SamStudio8 SamStudio8 changed the title [asklepian] Add fields to PHA genomes table, compress outputs [asklepian] ~Add fields to PHA genomes table~ Compress outputs May 18, 2021
@SamStudio8 SamStudio8 changed the title [asklepian] ~Add fields to PHA genomes table~ Compress outputs [asklepian] ~~Add fields to PHA genomes table~~ Compress outputs May 18, 2021
@SamStudio8 SamStudio8 changed the title [asklepian] ~~Add fields to PHA genomes table~~ Compress outputs [asklepian] Compress outputs May 18, 2021
@SamStudio8
Copy link
Member Author

Moving this to backlog #62 as the change process on the other side is moving so slow

@SamStudio8
Copy link
Member Author

Discussed this with CG and have agreed to compress the variant table starting with tomorrow's run (20210604). I will add the gzip step to the Asklepian go.sh after today's (20210603) run has completed and notify CG. CG will update the ingest pipeline on their end to expect a gzipped input (like the genome table) after the 20210603 ingest completes later today. We will monitor the pipeline closely tomorrow to ensure continuity.

@SamStudio8
Copy link
Member Author

Change implemented by SamStudio8/asklepian@064817e. Output filename will now be suffixed with .gz: variant_table_$DATESTAMP.csv.gz. CG notified and acknowledged.

@SamStudio8 SamStudio8 removed the waiting label Jun 3, 2021
@SamStudio8
Copy link
Member Author

CG confirms partner change has been performed on their side. Green light for tomorrow 🚀

@SamStudio8
Copy link
Member Author

Compressed variant table written and sent. Variant table step was around 15 minutes faster compared to yesterday, and the compression ratio in the new CSV is around 7x.

@SamStudio8
Copy link
Member Author

Reinflated CSV is where it is supposed to be on CLIMB-COVID, downstream asklepian-db step has run successfully. Spoken to CG on the other end and the gzipped variant table is processing on the other side! 🔥 🚀

@SamStudio8
Copy link
Member Author

So apparently gzip files and Apache Spark are not friends (http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/%3CCANDWdjY2hN-=jXTSNZ8JHZ=G-S+ZKLNze=RGkJacJAW3tTOQQA@mail.gmail.com%3E) and this change has caused performance trouble on the other side, as Spark is not able to split up the input for efficiency. That link mentions Snappy compressed files are splittable so we can try that.

@SamStudio8
Copy link
Member Author

We're rolling this change back for the weekend and will experiment with Snappy compression (or alternatives) next week.

@SamStudio8
Copy link
Member Author

Reverted our side by SamStudio8/asklepian@e4511de

@SamStudio8
Copy link
Member Author

Reverted by CG on the other side

@SamStudio8
Copy link
Member Author

Installed python-snappy as it has a module that binds the snappy library to use easily from the CLI because obviously snappy is so hipster they can't possibly just distribute a binary. Sent over test_variant_table_20210604.csv.snappy just to try out.

@SamStudio8
Copy link
Member Author

As a quick sanity check the cat unsnappy | python -m snappy -c > snappy to python -m snappy -d snappy > unsnappy round trip does give us the same file back.

@SamStudio8
Copy link
Member Author

Naturally that file did not work because of course there are different poorly documented codecs for snappy. Sent a replacement generated with -t hadoop_snappy which is more likely to work according to some stranger on stackoverflow.

@SamStudio8
Copy link
Member Author

SamStudio8 commented Jun 5, 2021

CG reports the hadoop_snappy file was not splittable either. This SO article (https://stackoverflow.com/questions/32382352/is-snappy-splittable-or-not-splittable) conflicts yesterday's reading and says that whole files compressed with Snappy won't be splittable after all. Given this was supposed to be a sticky plaster before we could get to implement the incremental tables I don't want to spend too long delving into wtf is going on here, and I didn't really like the half finished look of Snappy anyway.

Interestingly another SO article (https://stackoverflow.com/a/25888475/2576437) mentions bzip2 and LZ4 (via https://github.com/fingltd/4mc) are supposed to be splittable and those are totally normal compression algorithms.

@SamStudio8
Copy link
Member Author

CG confirms bzip2 is splittable 🎉 Problematically it also seems to be the slowest compression option we've tried. SN will do a couple of naive time tests to see what the impact of swapping to bzip2 would be. It may be that a small compression time penalty on the CLIMB side to speed up the PHA Spark side will be the best compromise.

@SamStudio8
Copy link
Member Author

The genome table example for bzip2 has been running for significantly longer than gzip now. From the bzip2 manual (below) it would seem that the genomic strings are quite likely the worst case input for compression.

The sorting phase of compression gathers together similar strings in the file. Because of this, files containing very long runs of repeated symbols, like "aabaabaabaab ..." (repeated several hundred times) may compress more slowly than normal. Versions 0.9.5 and above fare much better than previous versions in this respect. The ratio between worst-case and average-case compression time is in the region of 10:1.

My suggestion is we will continue to gzip the genomic table for transfer to PHE. Even though the PHE ingest will be unsplit, it remains reasonably fast and stable (the table grows linearly). We save precious time and I/O from having the table compressed at source this way.

I'll do some variant table tests when I get the final wall time of the bzip2 test.

@SamStudio8
Copy link
Member Author

Genome table takes 22m to process and gzip, 112m to process and bgzip2. Will try variant table now.

@SamStudio8
Copy link
Member Author

79m to process and gzip the variant table, 88m to process and bzip2. The 10m delay on this side is certainly worth the penalty given there is an order of magnitude (or so) difference in processing the variant table on the other side as a splittable format (or not). Will discuss with CG.

@SamStudio8
Copy link
Member Author

Closing due to lack of interest

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant