Skip to content

Issues with GVC: Performance, Integration, and Usability Concerns for Fair Benchmarking #2

@Zhangliubin

Description

@Zhangliubin

I am benchmarking the tools for balancing "compression and accessibility" of genotypes, and I have encountered several issues while trying to use GVC. These issues may prevent fair comparisons with the latest research, as they fail to generate sufficient interest and provide users with adequate reasons to adopt GVC.

It seems that GVC is more of an "interdisciplinary experiment"—exploring the application of image encoding techniques in genomics—rather than being a user-friendly bioinformatics tool.

  1. I am unsure about the difficulty of integrating JBIG-KIT into the setup.sh script for automatic compilation. Why is it necessary for the user to download and compile it separately? Why does jbigkit.py require the user to write it themselves instead of directly including it in the gvc/codec directory?

  2. In recent versions of numpy, np.bool has been changed to np.bool_, which causes runtime exceptions. If GVC is not compatible with recent Python versions and mainstream versions, providing a Dockerfile with the following entry point might be a more user-friendly option:
    ENTRYPOINT ["python", "-m", "gvc"]

  3. The speed of GVC seems unusually slow. I'm not sure if this is a known issue, but it is definitely a critical problem. For example, creating an archive for a VCF.GZ with 1 sample and 13 million variant sites (file size: 60 MB) took 734.227s using GVC, while PLINK_AVX v2.0 only took 4.802s. The GTC tool, compared in your paper, took only 31.619s, which is much faster.

As stated in your paper:
"We believe that random access times below 0.5 s are not noticeable. Note that GVC is mostly written in Python (except for JBIG and the transformation steps which are written in C), whereas GTC is written entirely in C, introducing some overhead to GVC with regard to the run time."

In reality, even a small increase in compression ratio seems to come at the cost of an enormous increase in processing time, which is unreasonable.

  1. GVC generates a large number of sub-files stored in metadata, which is unacceptable. The sheer number of sub-files severely impacts disk performance, especially in bioinformatics, where large genomic files are a common part of daily research.

  2. Does GVC have the capability to automatically determine the optimal parameters based on the number of samples, or recommend running parameters? The default configuration for GVC does not enable --sort-rows and --sort-cols. What is the reasoning behind this decision?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions