Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic regions support #16

Closed
tomwhite opened this issue Jul 8, 2024 · 4 comments · Fixed by #18
Closed

Basic regions support #16

tomwhite opened this issue Jul 8, 2024 · 4 comments · Fixed by #18

Comments

@tomwhite
Copy link
Contributor

tomwhite commented Jul 8, 2024

We should add the equivalent of the bcftools -r/--regions option to filter by regions.

The work in sgkit-dev/sgkit#658 (which was never merged) could form the basis for the implementation.

Also, it would be simpler to implement -t/--targets first, since unlike -r/--regions it only needs to check position, and not the variant length. To support the latter we probably need to store the variant length as a separate Zarr array in order to do efficient queries.

@jeromekelleher
Copy link
Contributor

Agreed. See sgkit-dev/vcf-zarr-spec#21 and sgkit-dev/vcf-zarr-spec#22 for previous thoughts on how we support this efficiently.

I hadn't thought of storing a length array, that's an excellent idea. Would that really suffice to answer overlap queries efficiently?

@tomwhite
Copy link
Contributor Author

tomwhite commented Jul 9, 2024

I hadn't thought of storing a length array, that's an excellent idea. Would that really suffice to answer overlap queries efficiently?

Yes. I would actually store an array of variant end positions, and then use pyranges to efficiently compute the overlap.

For -t/--targets, pyranges is not needed since regular NumPy slicing is sufficient.

@tomwhite
Copy link
Contributor Author

Bioframe is another option for overlap queries.

@jeromekelleher
Copy link
Contributor

Thanks @tomwhite. I think the first step is to do -t as you say, as that can just read in the variant_position array in full as a first pass. I think we do need to consider the latency issues when we have, e.g., a whole genome stored in one Zarr and much smaller variant chunk sizes (which we may need: currently latency and memory usage are terrible in vcztools view for large datasets).

I think we do need to consider what would be an efficient index that we could store in a single/small number of chunks that would allow us to implement range queries (sgkit-dev/vcf-zarr-spec#21, sgkit-dev/vcf-zarr-spec#23)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants