Basic regions support #16

tomwhite · 2024-07-08T11:18:06Z

We should add the equivalent of the bcftools -r/--regions option to filter by regions.

The work in sgkit-dev/sgkit#658 (which was never merged) could form the basis for the implementation.

Also, it would be simpler to implement -t/--targets first, since unlike -r/--regions it only needs to check position, and not the variant length. To support the latter we probably need to store the variant length as a separate Zarr array in order to do efficient queries.

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2024-07-08T19:36:43Z

Agreed. See sgkit-dev/vcf-zarr-spec#21 and sgkit-dev/vcf-zarr-spec#22 for previous thoughts on how we support this efficiently.

I hadn't thought of storing a length array, that's an excellent idea. Would that really suffice to answer overlap queries efficiently?

tomwhite · 2024-07-09T10:56:18Z

I hadn't thought of storing a length array, that's an excellent idea. Would that really suffice to answer overlap queries efficiently?

Yes. I would actually store an array of variant end positions, and then use pyranges to efficiently compute the overlap.

For -t/--targets, pyranges is not needed since regular NumPy slicing is sufficient.

tomwhite · 2024-07-11T09:24:38Z

Bioframe is another option for overlap queries.

jeromekelleher · 2024-07-11T10:11:33Z

Thanks @tomwhite. I think the first step is to do -t as you say, as that can just read in the variant_position array in full as a first pass. I think we do need to consider the latency issues when we have, e.g., a whole genome stored in one Zarr and much smaller variant chunk sizes (which we may need: currently latency and memory usage are terrible in vcztools view for large datasets).

I think we do need to consider what would be an efficient index that we could store in a single/small number of chunks that would allow us to implement range queries (sgkit-dev/vcf-zarr-spec#21, sgkit-dev/vcf-zarr-spec#23)

tomwhite mentioned this issue Jul 11, 2024

Add basic support for '--targets/-t' #18

Merged

jeromekelleher closed this as completed in #18 Jul 18, 2024

tomwhite mentioned this issue Jul 22, 2024

Support '-r/--regions' #32

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic regions support #16

Basic regions support #16

tomwhite commented Jul 8, 2024

jeromekelleher commented Jul 8, 2024

tomwhite commented Jul 9, 2024

tomwhite commented Jul 11, 2024

jeromekelleher commented Jul 11, 2024

Basic regions support #16

Basic regions support #16

Comments

tomwhite commented Jul 8, 2024

jeromekelleher commented Jul 8, 2024

tomwhite commented Jul 9, 2024

tomwhite commented Jul 11, 2024

jeromekelleher commented Jul 11, 2024