Skip to content

broadinstitute/seqr-loading-pipelines

Repository files navigation

Seqr Loading Pipeline

Code Build Docker Build

This repository contains pipelines and infrastructure for loading genomic data from VCF -> ClickHouse to support queries by the seqr application


📁 Repository Structure

v03_pipeline/api/

Contains the interface layer to the seqr application.

  • api/model.py defines pydantic models for the REST interface.
  • api/app.py specifies an aiohttp webserver that handles load data requests.

v03_pipeline/bin/

Scripts or command-line utilities used for setup or task execution.

  • bin/pipeline_worker.py — manages asynchronous jobs requested by seqr.

v03_pipeline/deploy/

Dockerfiles for the loading pipeline itself & any annotation utilities. Kubernetes manifests are managed separately in seqr-helm

v03_pipeline/lib/

Core logic and shared libraries.

  • annotations defines hail logic to re-format and standardize fields.
  • methods wraps hail-defined genomics methods for QC.
  • misc contains single modules with defined utilities.
    • misc/clickhouse hosts the logic that manages the parquet ingestion into ClickHouse itself.
  • core defines key constants/enums/config.
  • reference_datasets manages parsing of raw reference sources into hail tables.
  • tasks specifies the Luigi defined pipeline. Note that Luigi pipelines are defined by their requirements, so the pipeline is defined, effectively, in reverse.
    • WriteSuccessFile is the last task, defining a requires() method that runs the pipeline either locally or on scalable compute.
    • WriteImportedCallset is the first task, importing a VCF into a Hail Matrix table, an "imported callset".
  • test holds a few utilities used by the tests, which are dispersed throughout the rest of the repository.
  • paths.py defines paths for all intermediate and output files of the pipeline.

v03_pipeline/ops/

Manual operations scripts.

v03_pipeline/var/

Static configuration and test files.


⚙️ Setup for Local Development

The production pipeline runs with python 3.11.

Clone the repo and install python requirements

git clone https://github.com/broadinstitute/seqr-loading-pipelines.git
cd seqr-loading-pipelines
pip install -r requirements.txt
pip install -r requirements-dev.txt

Install & start ClickHouse with provided test configuration:

curl https://clickhouse.com/ | sh
./clickhouse server --config-file=./seqr-loading-pipelines/v03_pipeline/var/clickhouse_config/test-clickhouse.xml

Run an Individual Test

nosetests v03_pipeline/lib/misc/math_test.py

Formatting and Linting

ruff format .
ruff check .

🚪 Schema Entrypoints

  • The expected fields and types are defined in dataset_type.py as the col_fields, entry_fields, and row_fields properties. Examples of the SNV_INDEL/MITO/SV/GCNV callset schemas may be found in the tests.
  • The VEP schema is defined in JSON within the vep*.json config files, then parsed into hail in lib/annotations/vep.py.
  • Examples of exported parquets may be found in lib/tasks/exports/*_parquet_test.py

🚶‍♂️ ClickHouse Loader Walkthrough

  • The Clickhouse Loader follows the pattern established in the Making a Large Data Load Resilient blog
    • Rows are first loaded into a staging database that copies the production TABLEs and MATERIALIZED VIEWS.
    • After all entries are inserted, we validate the inserted row count and finalize the per-project allele frequency aggregation.
    • Partitions are atomically moved from the staging environment to production.

About

hail-based pipelines for annotating variant callsets and exporting them to elasticsearch

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Contributors 23

Languages