-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python API for Zarr to cyvcf2 #112
Comments
I think it's out of scope for vcztools, as we're trying to keep this tightly focused on emulating bcftools. What you suggest is a great idea though - I guess the natural place for it is in cyvcf2? It would be a bit of work, and we'd need to convince Brent that it was worthwhile |
Yeah, I figured as much. I wonder if it would be out of scope for cyvcf2 too though as that is a cython wrapper around htslib and expects VCF/BCF as input? Maybe we need a zarr2bio package 😅 ? |
Yeah, it would depend on Brent's enthusiasm a lot. I guess the fundamental thing you need to do is efficiently iterate over the arrays in variant order, and select by position. We could imagine breaking this out into its own package, which vcztools and other python libs could import? Given that, putting a cyvcf2 like wrapper on top should be easy, if fiddly. |
Yeah, it would depend on Brent's enthusiasm a lot. I guess the fundamental thing you need to do is efficiently iterate over the arrays in variant order, and select by position. We could imagine breaking this out into its own package, which vcztools and other python libs could import? We could also just export that iteration API from this package I guess... |
I guess that was one of my first thoughts; if there is an exposable Python API in vcztools that iterates over variants, other libraries could easily build on that to convert to whatever. vcztools could still focus on bcftools compliance, where the API is a convenience for other libraries. |
Yes, I think you're right. This is a non trivial thing that would be really useful to have availabile, particularly with background fetching of chunks and caching etc. Great idea! |
I guess we could offer a duck-type best effort layer on top of this that aims to behave like cyvcf2? It would be fiddly, as the API is a bit quirky and could be a moving target. |
So to conclude, if I start playing around with this, should I a) draft an API in a separate branch of vcztools b) do the conversion to cyvcf2 in an auxiliary library (personal at the moment)? |
Let's draft in vcztools please, once there's no additional dependencies I'm happy to add it here. |
Having briefly glossed through the code, my understanding is that vcztools will target similar functionality as bcftools, converting Zarr to VCF. Would other output formats fit in this tool or should that end up elsewhere? I'm thinking of
cyvcf2.cyvcf2.Variant
output as this would allow downstream applications that rely oncyvcf2
to directly access Zarr archives through a Python API. The use case I'm thinking of right now is ancestral allele reconstruction with fastDFE for tsinfer analyses (see Sendrowski/fastDFE#13). I guess it would be better for vcztools to focus mainly on bcftools functionality and that other file format outputs end up in a separate toolkit?The text was updated successfully, but these errors were encountered: