Skip to content

mcvickerlab/GenVarLoader

Repository files navigation

PyPI version Documentation Status Downloads PyPI - Downloads GitHub stars bioRxiv

Features

GenVarLoader provides a fast, memory efficient data structure for training sequence models on genetic variation. For example, this can be used to train a DNA language model on human genetic variation (e.g. Dalla-Torre et al.) or train sequence to function models with genetic variation (e.g. Celaj et al., Drusinsky et al., He et al., and Rastogi et al.).

  • Avoid writing any sequences to disk (can save >2,000x storage vs. writing personalized genomes with bcftools consensus)
  • Generate haplotypes up to 1,000 times faster than reading a FASTA file
  • Generate tracks up to 450 times faster than reading a BigWig
  • Supports indels and re-aligns tracks to haplotypes that have them
  • Extensible to new file formats: drop a feature request! Currently supports VCF, PGEN, and BigWig

Documentation is available here. See our preprint for benchmarking and implementation details.

Installation

pip install genvarloader

A PyTorch dependency is not included since it may require special instructions.

Contributing

  1. Clone the repo.
  2. Assuming you have Pixi, install pre-commit hooks pixi run -e dev pre-commit. If you forget to do this, your PR will likely fail to pass CI checks.
  3. Activate and use the appropriate Pixi environment for your needs. A decent catch-all is dev but you might need a different environment if using a GPU.

All the tests are designed to use pytest (sans Rust extension code) and live under tests/. These tests ensure the code works as intended so they must all pass before any features are merged into main and subsequently released. These tests will automatically run on every PR and failing tests will block PRs from being merged.

If your PR has merge conflicts, this is usually because the main branch received updates while you've been working on it. In this case, please rebase your branch via git rebase main to resolve merge conflicts, rather than using a merge commit via git merge main.

Note

Do not edit the version number in pyproject.toml. This is handled automatically by GitHub Actions.

About

Dataloader for applying sequence models to personalized genomics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 7