Issues with GVC: Performance, Integration, and Usability Concerns for Fair Benchmarking

I am benchmarking the tools for balancing "compression and accessibility" of genotypes, and I have encountered several issues while trying to use GVC. These issues may prevent fair comparisons with the latest research, as they fail to generate sufficient interest and provide users with adequate reasons to adopt GVC.

It seems that GVC is more of an "interdisciplinary experiment"—exploring the application of image encoding techniques in genomics—rather than being a user-friendly bioinformatics tool.

1. I am unsure about the difficulty of integrating JBIG-KIT into the setup.sh script for automatic compilation. Why is it necessary for the user to download and compile it separately? Why does jbigkit.py require the user to write it themselves instead of directly including it in the gvc/codec directory?

2. In recent versions of numpy, np.bool has been changed to np.bool_, which causes runtime exceptions. If GVC is not compatible with recent Python versions and mainstream versions, providing a Dockerfile with the following entry point might be a more user-friendly option:
ENTRYPOINT ["python", "-m", "gvc"]

3. The speed of GVC seems unusually slow. I'm not sure if this is a known issue, but it is definitely a critical problem. For example, creating an archive for a VCF.GZ with 1 sample and 13 million variant sites (file size: 60 MB) took 734.227s using GVC, while PLINK_AVX v2.0 only took 4.802s. The GTC tool, compared in your paper, took only 31.619s, which is much faster.

As stated in your paper:
"We believe that random access times below 0.5 s are not noticeable. Note that GVC is mostly written in Python (except for JBIG and the transformation steps which are written in C), whereas GTC is written entirely in C, introducing some overhead to GVC with regard to the run time."

In reality, even a small increase in compression ratio seems to come at the cost of an enormous increase in processing time, which is unreasonable.

4. GVC generates a large number of sub-files stored in metadata, which is unacceptable. The sheer number of sub-files severely impacts disk performance, especially in bioinformatics, where large genomic files are a common part of daily research.

5. Does GVC have the capability to automatically determine the optimal parameters based on the number of samples, or recommend running parameters? The default configuration for GVC does not enable --sort-rows and --sort-cols. What is the reasoning behind this decision?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with GVC: Performance, Integration, and Usability Concerns for Fair Benchmarking #2

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issues with GVC: Performance, Integration, and Usability Concerns for Fair Benchmarking #2

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions