GitHub - mcvickerlab/GenVarLoader: Dataloader for applying sequence models to personalized genomics

Features

GenVarLoader provides a fast, memory efficient data structure for training sequence models on genetic variation. For example, this can be used to train a DNA language model on human genetic variation (e.g. Dalla-Torre et al.) or train sequence to function models with genetic variation (e.g. Celaj et al., Drusinsky et al., He et al., and Rastogi et al.).

Avoid writing any sequences to disk (can save >2,000x storage vs. writing personalized genomes with bcftools consensus)
Generate haplotypes up to 1,000 times faster than reading a FASTA file
Generate tracks up to 450 times faster than reading a BigWig
Supports indels and re-aligns tracks to haplotypes that have them
Extensible to new file formats: drop a feature request! Currently supports VCF, PGEN, and BigWig

Documentation is available here. See our preprint for benchmarking and implementation details.

Installation

pip install genvarloader

A PyTorch dependency is not included since it may require special instructions.

Contributing

Clone the repo.
Assuming you have Pixi, install pre-commit hooks pixi run -e dev pre-commit. If you forget to do this, your PR will likely fail to pass CI checks.
Activate and use the appropriate Pixi environment for your needs. A decent catch-all is dev but you might need a different environment if using a GPU.

All the tests are designed to use pytest (sans Rust extension code) and live under tests/. These tests ensure the code works as intended so they must all pass before any features are merged into main and subsequently released. These tests will automatically run on every PR and failing tests will block PRs from being merged.

If your PR has merge conflicts, this is usually because the main branch received updates while you've been working on it. In this case, please rebase your branch via git rebase main to resolve merge conflicts, rather than using a merge commit via git merge main.

Note

Do not edit the version number in pyproject.toml. This is handled automatically by GitHub Actions.

Name		Name	Last commit message	Last commit date
Latest commit History 757 Commits
.github/workflows		.github/workflows
docs		docs
python/genvarloader		python/genvarloader
src		src
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.txt		LICENSE.txt
README.md		README.md
pixi.lock		pixi.lock
pixi.toml		pixi.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Features

Installation

Contributing

About

Uh oh!

Releases 21

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

License

mcvickerlab/GenVarLoader

Folders and files

Latest commit

History

Repository files navigation

Features

Installation

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 21

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages