-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vcztools view: write performance #94
Comments
Reducing the amount of gperftools shows a lot of activity in the Finally, I checked to see how often Python was doubling the encoding buffer (see here). I recorded 27 doublings—probably not significant since there are over 21k variants. References |
I don't think there's a lot we can do here, as I had a close eye on write performance when I was developing the C code. Basically, this is about as fast as it will get without making the C code a lot more complicated. The "good enough" metric that I was going for here is to produce VCF text at a rate that's greater than bcftools can consume it in a pipeline. I think we're probably within that limit here? 42MB/s is disappointing though, I wonder if this is mostly due to PL fields or something. |
I tried deleting the PL field from the VCZ version:
There is a slight improvement. When I run this, vcztools writes the output in bursts. I think vcztools spends time decoding the chunks and then writing them. vcztools may benefit from parallelism here—reading chunks from multiple arrays simultaneously and reading the next set of chunks while writing output. |
Yes, some sort of double-buffering approach where we decode the next variant chunk in the background while the current chunk is being written to output would definitely improve things a lot. The initial latency is still pretty horrible, but I think that's a function of our current chunk-size defaults which are too big in the variants dimension. |
Just collecting some notes here... PEP-703 explains the challenges in achieving true parallelism in Python. However, Zarr supports multi-threaded parallelism by releasing the global interpreter lock whenever possible during compression and decompression operations (source). Therefore, we should be able to achieve the desired parallelism by using multiple threads to perform tasks. Ideally, we do not want to use multiple processes due to the additional overhead in starting a new Python process (~50 ms) and the need to share memory. References |
I think a decode thread operating on a double buffer system (i.e. we decode into one buffer while the main thread writes out vcf from the other) would work well here, as we are dominated by decompression time and this does thread well. |
I think you are right. I noticed that Python's memory consumption was exceeding the physical RAM available on my device, so I changed the variant chunk size to 1,000. With this smaller chunk size, Python's memory consumption stays within the amount of RAM available on my device. This improves the performance a lot and makes the output less bursty.
The profiler still shows the most activity in encoding, decoding, and converting types:
|
Description
As documented in #93, vcztools view is not running as fast as bcftools view on real genome data. I reproduce the performance data below.
This issue tracks understanding the performance and implementing optimizations to improve the performance.
Results
cd performance python -m compare 1
Profiling
Profiling
Using gperftools' CPU profiler:
References
The text was updated successfully, but these errors were encountered: