GenVarLoader provides a fast, memory efficient data structure for training sequence models on genetic variation. For example, this can be used to train a DNA language model on human genetic variation (e.g. Dalla-Torre et al.) or train sequence to function models with genetic variation (e.g. Celaj et al., Drusinsky et al., He et al., and Rastogi et al.).
- Avoid writing any sequences to disk (can save >2,000x storage vs. writing personalized genomes with bcftools consensus)
- Generate haplotypes up to 1,000 times faster than reading a FASTA file
- Generate tracks up to 450 times faster than reading a BigWig
- Supports indels and re-aligns tracks to haplotypes that have them
- Extensible to new file formats: drop a feature request! Currently supports VCF, PGEN, and BigWig
Documentation is available here. See our preprint for benchmarking and implementation details.
pip install genvarloader
A PyTorch dependency is not included since it may require special instructions.
- Clone the repo.
- Assuming you have Pixi, install pre-commit hooks
pixi run -e dev pre-commit
. If you forget to do this, your PR will likely fail to pass CI checks. - Activate and use the appropriate Pixi environment for your needs. A decent catch-all is
dev
but you might need a different environment if using a GPU.
All the tests are designed to use pytest (sans Rust extension code) and live under tests/
. These tests ensure the code works as intended so they must all pass before any features are merged into main
and subsequently released. These tests will automatically run on every PR and failing tests will block PRs from being merged.
If your PR has merge conflicts, this is usually because the main
branch received updates while you've been working on it. In this case, please rebase your branch via git rebase main
to resolve merge conflicts, rather than using a merge commit via git merge main
.
Note
Do not edit the version number in pyproject.toml
. This is handled automatically by GitHub Actions.