Note that highest compression levels are not useful in cloud storage #182

jeromekelleher · 2024-12-04T17:17:00Z

An interesting and counterintuitive observation we should make is that trying to achieve the highest possible levels of compression for call_genotype is actually pointless. From @benjeffery's experiments on S3, we can decode to RAM at ~42GiB/s using about 32 cores using the standard compressor settings which emphasise highest compression levels. This corresponds to a network throughput of about 100MiB/s, which is nowhere near what the instance is capable of. So, if we used a faster codec which had slightly lower compression levels, I think we could increase that (or at least do it with fewer cores).

The neat thing is that S3 charges by object access, not by data volume, so it costs the same to store the slightly larger chunks as the smaller ones.

benjeffery · 2024-12-04T22:36:36Z

Two important points:

42GiB/s was achieved with 32 processes using 96 cores. Quicker decompress would allow the compute to be used for useful work on the genotypes while keeping the same high throughput.
S3 charges for access and storage, so the ideal compression ratio depends on how much you expect to read the data. If you are reading a lot then your costs are dominated by access costs and you don't mind larger chunks. If the use is archival then high compression is more important.

jeromekelleher · 2024-12-05T10:08:17Z

42GiB/s was achieved with 32 processes using 96 cores.

Are each of these processes using more than one thread? If not, then one process=one core, right? I'm assuming the higher performance on the 96 core machine is due to more memory bandwidth or cache or something.

benjeffery · 2024-12-05T12:36:27Z

They are using more than one thread to fetch and multiple to decompress. The notebooks are in the airlock, so we can go over them Monday hopefully.

benjeffery · 2024-12-05T13:55:28Z

I should also add that the 96 is logical cores and only 48 physical.

tomwhite · 2024-12-06T12:50:45Z

This sounds great! I'm interested to see how this can translate to processing with Cubed.

Is the machine on EC2 (if so which instance type is it?) or are you reading from S3 from a machine outside AWS? Also, what compressors are you using?

benjeffery · 2024-12-06T15:59:16Z

@tomwhite I've just added the notebooks in #184 This is an EC2 machine on AWS, files compressed with blosc (details in the notebook).

jeromekelleher transferred this issue from sgkit-dev/sgkit-publication Dec 5, 2024

jeromekelleher mentioned this issue Dec 5, 2024

Add LZ4 to decoding benchmarks. #183

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Note that highest compression levels are not useful in cloud storage #182

Note that highest compression levels are not useful in cloud storage #182

jeromekelleher commented Dec 4, 2024

benjeffery commented Dec 4, 2024

jeromekelleher commented Dec 5, 2024

benjeffery commented Dec 5, 2024

benjeffery commented Dec 5, 2024

tomwhite commented Dec 6, 2024

benjeffery commented Dec 6, 2024

Note that highest compression levels are not useful in cloud storage #182

Note that highest compression levels are not useful in cloud storage #182

Comments

jeromekelleher commented Dec 4, 2024

benjeffery commented Dec 4, 2024

jeromekelleher commented Dec 5, 2024

benjeffery commented Dec 5, 2024

benjeffery commented Dec 5, 2024

tomwhite commented Dec 6, 2024

benjeffery commented Dec 6, 2024