Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Note that highest compression levels are not useful in cloud storage #182

Open
jeromekelleher opened this issue Dec 4, 2024 · 6 comments

Comments

@jeromekelleher
Copy link
Contributor

An interesting and counterintuitive observation we should make is that trying to achieve the highest possible levels of compression for call_genotype is actually pointless. From @benjeffery's experiments on S3, we can decode to RAM at ~42GiB/s using about 32 cores using the standard compressor settings which emphasise highest compression levels. This corresponds to a network throughput of about 100MiB/s, which is nowhere near what the instance is capable of. So, if we used a faster codec which had slightly lower compression levels, I think we could increase that (or at least do it with fewer cores).

The neat thing is that S3 charges by object access, not by data volume, so it costs the same to store the slightly larger chunks as the smaller ones.

@benjeffery
Copy link
Contributor

Two important points:

  • 42GiB/s was achieved with 32 processes using 96 cores. Quicker decompress would allow the compute to be used for useful work on the genotypes while keeping the same high throughput.

  • S3 charges for access and storage, so the ideal compression ratio depends on how much you expect to read the data. If you are reading a lot then your costs are dominated by access costs and you don't mind larger chunks. If the use is archival then high compression is more important.

@jeromekelleher
Copy link
Contributor Author

42GiB/s was achieved with 32 processes using 96 cores.

Are each of these processes using more than one thread? If not, then one process=one core, right? I'm assuming the higher performance on the 96 core machine is due to more memory bandwidth or cache or something.

@jeromekelleher jeromekelleher transferred this issue from sgkit-dev/sgkit-publication Dec 5, 2024
@benjeffery
Copy link
Contributor

They are using more than one thread to fetch and multiple to decompress. The notebooks are in the airlock, so we can go over them Monday hopefully.

@benjeffery
Copy link
Contributor

I should also add that the 96 is logical cores and only 48 physical.

@tomwhite
Copy link
Contributor

tomwhite commented Dec 6, 2024

This sounds great! I'm interested to see how this can translate to processing with Cubed.

Is the machine on EC2 (if so which instance type is it?) or are you reading from S3 from a machine outside AWS? Also, what compressors are you using?

@benjeffery
Copy link
Contributor

@tomwhite I've just added the notebooks in #184 This is an EC2 machine on AWS, files compressed with blosc (details in the notebook).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants