Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python API for Zarr to cyvcf2 #112

Open
percyfal opened this issue Dec 5, 2024 · 9 comments
Open

Python API for Zarr to cyvcf2 #112

percyfal opened this issue Dec 5, 2024 · 9 comments

Comments

@percyfal
Copy link

percyfal commented Dec 5, 2024

Having briefly glossed through the code, my understanding is that vcztools will target similar functionality as bcftools, converting Zarr to VCF. Would other output formats fit in this tool or should that end up elsewhere? I'm thinking of cyvcf2.cyvcf2.Variant output as this would allow downstream applications that rely on cyvcf2 to directly access Zarr archives through a Python API. The use case I'm thinking of right now is ancestral allele reconstruction with fastDFE for tsinfer analyses (see Sendrowski/fastDFE#13). I guess it would be better for vcztools to focus mainly on bcftools functionality and that other file format outputs end up in a separate toolkit?

@jeromekelleher
Copy link
Contributor

I think it's out of scope for vcztools, as we're trying to keep this tightly focused on emulating bcftools.

What you suggest is a great idea though - I guess the natural place for it is in cyvcf2? It would be a bit of work, and we'd need to convince Brent that it was worthwhile

@percyfal
Copy link
Author

percyfal commented Dec 6, 2024

Yeah, I figured as much. I wonder if it would be out of scope for cyvcf2 too though as that is a cython wrapper around htslib and expects VCF/BCF as input? Maybe we need a zarr2bio package 😅 ?

@jeromekelleher
Copy link
Contributor

jeromekelleher commented Dec 6, 2024

Yeah, it would depend on Brent's enthusiasm a lot.

I guess the fundamental thing you need to do is efficiently iterate over the arrays in variant order, and select by position. We could imagine breaking this out into its own package, which vcztools and other python libs could import? Given that, putting a cyvcf2 like wrapper on top should be easy, if fiddly.

@jeromekelleher
Copy link
Contributor

Yeah, it would depend on Brent's enthusiasm a lot.

I guess the fundamental thing you need to do is efficiently iterate over the arrays in variant order, and select by position. We could imagine breaking this out into its own package, which vcztools and other python libs could import?

We could also just export that iteration API from this package I guess...

@percyfal
Copy link
Author

percyfal commented Dec 6, 2024

I guess that was one of my first thoughts; if there is an exposable Python API in vcztools that iterates over variants, other libraries could easily build on that to convert to whatever. vcztools could still focus on bcftools compliance, where the API is a convenience for other libraries.

@jeromekelleher
Copy link
Contributor

Yes, I think you're right. This is a non trivial thing that would be really useful to have availabile, particularly with background fetching of chunks and caching etc. Great idea!

@jeromekelleher
Copy link
Contributor

I guess we could offer a duck-type best effort layer on top of this that aims to behave like cyvcf2? It would be fiddly, as the API is a bit quirky and could be a moving target.

@percyfal
Copy link
Author

percyfal commented Dec 6, 2024

So to conclude, if I start playing around with this, should I a) draft an API in a separate branch of vcztools b) do the conversion to cyvcf2 in an auxiliary library (personal at the moment)?

@jeromekelleher
Copy link
Contributor

Let's draft in vcztools please, once there's no additional dependencies I'm happy to add it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants