-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[asklepian] Compress outputs #37
Comments
Checking whether NG confirms this is the field they want. Will proceed to that spec. |
make_genomes_table.py is responsible for pulling the metadata from the core metadata file, and zipping it to the genomes. Given the amount of work required to address the performance constraints of Majora's MDV API endpoint, we'll need to tide this over with something. Easiest solution will be to pull all pairs of |
As it happens, the |
Suspicion confirmed, the |
SamStudio8/asklepian@db3387b adds an updated genome table script that will push a |
Changes deployed ready for tomorrow's Asklepian.
|
As per discussion with DG, compressing v2 genomics table as of today #43 |
CJ's team has picked this up now. Hopefully we can make the switch soon. |
SamStudio8/asklepian@71ca455 deprecates the v1 genome table, and removes the test_ prefix from the v2 table. v1 genome table will not be generated 2021-04-21. The v2 genomes table will not be automatically deleted (as usual) in case we need to resend or drop the columns to create the v1 table for whatever reason. Once we're happy we can return to deleting it as usual.
|
Ingest failed on the other side, engineers investigating. See JIRA |
Chased this up with the engineers on the other side. Appears the compression was not taken into account? Regardless, issue with ingest appears to be resolved now. |
NG confirms the genomes table ingested has the |
Going to chase compression on the variant table up this week to try and close this. |
Moving this to backlog #62 as the change process on the other side is moving so slow |
Discussed this with CG and have agreed to compress the variant table starting with tomorrow's run (20210604). I will add the |
Change implemented by SamStudio8/asklepian@064817e. Output filename will now be suffixed with |
CG confirms partner change has been performed on their side. Green light for tomorrow 🚀 |
Compressed variant table written and sent. Variant table step was around 15 minutes faster compared to yesterday, and the compression ratio in the new CSV is around 7x. |
Reinflated CSV is where it is supposed to be on CLIMB-COVID, downstream |
So apparently gzip files and Apache Spark are not friends (http://mail-archives.apache.org/mod_mbox/spark-user/201310.mbox/%3CCANDWdjY2hN-=jXTSNZ8JHZ=G-S+ZKLNze=RGkJacJAW3tTOQQA@mail.gmail.com%3E) and this change has caused performance trouble on the other side, as Spark is not able to split up the input for efficiency. That link mentions Snappy compressed files are splittable so we can try that. |
We're rolling this change back for the weekend and will experiment with Snappy compression (or alternatives) next week. |
Reverted our side by SamStudio8/asklepian@e4511de |
Reverted by CG on the other side |
Installed |
As a quick sanity check the |
Naturally that file did not work because of course there are different poorly documented codecs for snappy. Sent a replacement generated with |
CG reports the Interestingly another SO article (https://stackoverflow.com/a/25888475/2576437) mentions bzip2 and LZ4 (via https://github.com/fingltd/4mc) are supposed to be splittable and those are totally normal compression algorithms. |
CG confirms bzip2 is splittable 🎉 Problematically it also seems to be the slowest compression option we've tried. SN will do a couple of naive |
The genome table example for
My suggestion is we will continue to I'll do some variant table tests when I get the final wall time of the bzip2 test. |
Genome table takes 22m to process and |
79m to process and |
Closing due to lack of interest |
adm1
published_date
collection_pillar
The text was updated successfully, but these errors were encountered: