-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Rene Snajder
committed
May 19, 2021
1 parent
311aa08
commit 197b977
Showing
6 changed files
with
320 additions
and
307 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,11 @@ | ||
dist: xenial | ||
language: python | ||
python: 3.7 | ||
branches: | ||
only: | ||
- main | ||
|
||
install: | ||
- pip install meth5 | ||
|
||
script: true | ||
dist: xenial | ||
language: python | ||
python: 3.7 | ||
branches: | ||
only: | ||
- main | ||
|
||
install: | ||
- pip install meth5 | ||
|
||
script: true |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,167 +1,179 @@ | ||
from re import match# MetH5Format 0.3.1 | ||
|
||
[![GitHub license](https://img.shields.io/github/license/snajder-r/meth5format.svg)](https://github.com/snajder-r/meth5format/blob/master/LICENSE) | ||
[![DOI](https://zenodo.org/badge/303672813.svg)](https://zenodo.org/badge/latestdoi/303672813) | ||
[![Language](https://img.shields.io/badge/Language-Python3.7+-yellow.svg)](https://www.python.org/) | ||
[![Build Status](https://travis-ci.com/snajder-r/meth5format.svg?branch=main)](https://travis-ci.com/snajder-r/meth5format) | ||
[![Code style: black](https://img.shields.io/badge/code%20style-black-black.svg?style=flat)](https://github.com/snajder-r/black "Black (modified)") | ||
|
||
|
||
[![PyPI version](https://badge.fury.io/py/meth5.svg)](https://badge.fury.io/py/meth5) | ||
[![PyPI downloads](https://pepy.tech/badge/meth5)](https://pepy.tech/project/meth5) | ||
[![Anaconda Version](https://img.shields.io/conda/v/snajder-r/meth5?color=blue)](https://anaconda.org/snajder-r/meth5) | ||
[![Anaconda Downloads](https://anaconda.org/snajder-r/meth5/badges/downloads.svg)](https://anaconda.org/snajder-r/meth5) | ||
|
||
MetH5 is an HDF5-based container format for methylation calls from long reads. | ||
|
||
In the current version, the MetH5 format can store the following information: | ||
* Log-likelihood ratio of each methylation call | ||
* Genomic coordinates (start and end) of each methylation call | ||
* The read name associated with each call | ||
* Read grouping (i.e. annotation such as samples or haplotypes) | ||
|
||
## Installation | ||
|
||
Through pip: | ||
|
||
``` | ||
pip install meth5 | ||
```` | ||
Through anaconda: | ||
``` | ||
conda install -c snajder-r meth5 | ||
``` | ||
## Usage | ||
### Creating a MetH5 file from nanopolish methylation calls | ||
Assuming you have nanopolish methylation calls with filenames `*.tsv`, you can create a MetH5 file with the following command: | ||
``` | ||
meth5 create_h5 --input_dir INPUT_DIR/ --output_file OUTPUT_FILE.m5 | ||
``` | ||
In order to annotate reads with read grouping (for example as samples or haplotypes) you can do so by running: | ||
``` | ||
meth annotate_reads --m5file M5FILE.m5 --read_groups_key READ_GROUPS_KEY --read_group_file READ_GROUP_FILE | ||
``` | ||
Where the `READ_GROUPS_KEY` is the key under which you want to store the annotation (you can store multiple read annotations), | ||
and `READ_GROUP_FILE` is a tab-delimited file containg read name and read group. For example: | ||
``` | ||
read_name group | ||
7741f9ee-ad41-42a4-99b2-290c66960410 1 | ||
4f18b48e-a1d3-49ad-ace3-cfb96b78ad79 2 | ||
... | ||
``` | ||
### Quick start for python API | ||
Here an example on how to access methylation values from a MetH5 file: | ||
```python | ||
from meth5.meth5 import MetH5File | ||
with MetH5File(filename, mode="r") as m: | ||
# List chromosomes in the MetH5 file | ||
m.get_chromosomes() | ||
# Access chromosome 7 | ||
chr7 = m["chr7"] | ||
# Get number of chunks | ||
chr7.get_number_of_chunks() | ||
# Get a container that manages the values of chunk 3 | ||
# (note that the data is not yet loaded into memory) | ||
values = chr7.get_chunk(3) | ||
# Get the log-likelihood ratios in the container as a numpy array of shape (n,) | ||
llrs = values.get_llrs() | ||
# Get the genomic start and end locations for each methylation call in the | ||
# chunk as a numpy array of shape (n,2) | ||
ranges = values.get_ranges() | ||
# Compute methylation rate (beta-score of methylation) for each genomic location, | ||
# as well as the respective coordinates | ||
met_rates, met_rate_ranges = values.get_llr_site_rate() | ||
# You can also compute other aggregates if you like | ||
met_count, met_count_ranges = values.get_llr_site_aggregate(aggregation_fun=lambda llrs: (llrs>2).sum()) | ||
# Instead of accessing chunk wise, you can query a genomic range | ||
values = chr7.get_values_in_range(36852906, 37449223) | ||
``` | ||
|
||
A more detailed API documentation is in the works. Stay tuned! | ||
|
||
### Sparse methylation matrix | ||
|
||
In addition to accessing methylation calls in its unraveled form, the `meth5` library also contains a way to represent | ||
the methylation calls as a sparse matrix. Seeing how the values are already stored in the MetH5 file in the same way a | ||
coordinate sparse matrix would be stored in memory, this is a very cheap operation. Example: | ||
|
||
```python | ||
from meth5.meth5 import MetH5File | ||
|
||
with MetH5File(filename, mode="r") as m: | ||
values = m["chr7"].get_values_in_range(36852906, 37449223) | ||
|
||
# The parameter "read_read_names" allows is to choose whether we want to load the actual | ||
# read names into memory. It's slightly more expensive than not reading it, so only load them | ||
# if you are interested in them | ||
matrix = values.to_sparse_methylation_matrix(read_read_names=True) | ||
|
||
# This is a scipy.sparse.csc_matrix matrix of dimension (r,s), containing the log-likelihood ratios of methylation | ||
# where r is the number of reads covering the genomic range we selected, and s is the number of unique genomic | ||
# ranges for which we have methylation calls. Since an LLR of 0 means total uncertainty, a 0 indicates no call. | ||
matrix.met_matrix | ||
|
||
# A numpy array of shape (s, ) containing the start position for each unique genomic range | ||
matrix.genomic_coord | ||
# A numpy array of shape (s, ) containing the end position for each unique genomic range | ||
matrix.genomic_coord_end | ||
|
||
# A numpy array of shape (r, ) containing the read names | ||
matrix.read_names | ||
|
||
# Get a submatrix containing only the first 10 genomic locations | ||
submatrix = matrix.get_submatrix(0, 10) | ||
|
||
# Get a submatrix containing only the reads in the provided list of read names | ||
submatrix = matrix.get_submatrix_from_read_names(allowed_read_names) | ||
``` | ||
|
||
|
||
|
||
## The MetH5 Format | ||
|
||
A MetH5 file is an HDF5 container that stores methylation calls for long reads. The structure of the HDF5 file is as follows: | ||
|
||
``` | ||
/ | ||
├─ chromosomes | ||
│ ├─ CHROMOSOME_NAME1 | ||
│ │ ├─ llr (float dataset of shape (n,)) | ||
│ │ ├─ read_id (int dataset of shape (n,)) | ||
│ │ ├─ range (int dataset of shape (n,2)) | ||
│ │ └─ chunk_ranges (dataset of shape (c, 2)) | ||
│ │ | ||
│ ├─ CHROMOSOME_NAME2 | ||
│ │ └─ ... | ||
│ └─ ... | ||
└─ reads | ||
├─ read_name_mapping (string dataset of shape (r,)) | ||
└─ read_groups | ||
├─ READ_GROUP_KEY1 (int dataset of shape (r,)) | ||
├─ READ_GROUP_KEY2 (int dataset of shape (r,)) | ||
└─ ... | ||
``` | ||
|
||
Where `n` is the number of methylation calls in the respective chromosome, `c` is the number of chunks, and `r`is the total number of reads across all chromosomes. | ||
# MetH5Format 0.3.1 | ||
|
||
[![GitHub license](https://img.shields.io/github/license/snajder-r/meth5format.svg)](https://github.com/snajder-r/meth5format/blob/master/LICENSE) | ||
[![DOI](https://zenodo.org/badge/303672813.svg)](https://zenodo.org/badge/latestdoi/303672813) | ||
[![Language](https://img.shields.io/badge/Language-Python3.7+-yellow.svg)](https://www.python.org/) | ||
[![Build Status](https://travis-ci.com/snajder-r/meth5format.svg?branch=main)](https://travis-ci.com/snajder-r/meth5format) | ||
[![Code style: black](https://img.shields.io/badge/code%20style-black-black.svg?style=flat)](https://github.com/snajder-r/black "Black (modified)") | ||
|
||
|
||
[![PyPI version](https://badge.fury.io/py/meth5.svg)](https://badge.fury.io/py/meth5) | ||
[![PyPI downloads](https://pepy.tech/badge/meth5)](https://pepy.tech/project/meth5) | ||
[![Anaconda Version](https://img.shields.io/conda/v/snajder-r/meth5?color=blue)](https://anaconda.org/snajder-r/meth5) | ||
[![Anaconda Downloads](https://anaconda.org/snajder-r/meth5/badges/downloads.svg)](https://anaconda.org/snajder-r/meth5) | ||
|
||
MetH5 is an HDF5-based container format for methylation calls from long reads. | ||
|
||
In the current version, the MetH5 format can store the following information: | ||
* Log-likelihood ratio of each methylation call | ||
* Genomic coordinates (start and end) of each methylation call | ||
* The read name associated with each call | ||
* Read grouping (i.e. annotation such as samples or haplotypes) | ||
|
||
## Installation | ||
|
||
Through pip: | ||
|
||
``` | ||
pip install meth5 | ||
```` | ||
Through anaconda: | ||
``` | ||
conda install -c snajder-r meth5 | ||
``` | ||
## Usage | ||
### Creating a MetH5 file from nanopolish methylation calls | ||
Assuming you have nanopolish methylation calls with filenames `*.tsv`, you can create a MetH5 file with the following command: | ||
``` | ||
meth5 create_h5 --input_dir INPUT_DIR/ --output_file OUTPUT_FILE.m5 | ||
``` | ||
In order to annotate reads with read grouping (for example as samples or haplotypes) you can do so by running: | ||
``` | ||
meth annotate_reads --m5file M5FILE.m5 --read_groups_key READ_GROUPS_KEY --read_group_file READ_GROUP_FILE | ||
``` | ||
Where the `READ_GROUPS_KEY` is the key under which you want to store the annotation (you can store multiple read annotations), | ||
and `READ_GROUP_FILE` is a tab-delimited file containg read name and read group. For example: | ||
``` | ||
read_name group | ||
7741f9ee-ad41-42a4-99b2-290c66960410 1 | ||
4f18b48e-a1d3-49ad-ace3-cfb96b78ad79 2 | ||
... | ||
``` | ||
### Quick start for python API | ||
Here an example on how to access methylation values from a MetH5 file: | ||
```python | ||
from meth5.meth5 import MetH5File | ||
with MetH5File(filename, mode="r") as m: | ||
# List chromosomes in the MetH5 file | ||
m.get_chromosomes() | ||
# Access chromosome 7 | ||
chr7 = m["chr7"] | ||
# Get number of chunks | ||
chr7.get_number_of_chunks() | ||
# Get a container that manages the values of chunk 3 | ||
# (note that the data is not yet loaded into memory) | ||
values = chr7.get_chunk(3) | ||
# Get the log-likelihood ratios in the container as a numpy array of shape (n,) | ||
llrs = values.get_llrs() | ||
# Get the genomic start and end locations for each methylation call in the | ||
# chunk as a numpy array of shape (n,2) | ||
ranges = values.get_ranges() | ||
# Compute methylation rate (beta-score of methylation) for each genomic location, | ||
# as well as the respective coordinates | ||
met_rates, met_rate_ranges = values.get_llr_site_rate() | ||
# You can also compute other aggregates if you like | ||
met_count, met_count_ranges = values.get_llr_site_aggregate(aggregation_fun=lambda llrs: (llrs>2).sum()) | ||
# Instead of accessing chunk wise, you can query a genomic range | ||
values = chr7.get_values_in_range(36852906, 37449223) | ||
``` | ||
|
||
A more detailed API documentation is in the works. Stay tuned! | ||
|
||
### Sparse methylation matrix | ||
|
||
In addition to accessing methylation calls in its unraveled form, the `meth5` library also contains a way to represent | ||
the methylation calls as a sparse matrix. Seeing how the values are already stored in the MetH5 file in the same way a | ||
coordinate sparse matrix would be stored in memory, this is a very cheap operation. Example: | ||
|
||
```python | ||
from meth5.meth5 import MetH5File | ||
|
||
with MetH5File(filename, mode="r") as m: | ||
values = m["chr7"].get_values_in_range(36852906, 37449223) | ||
|
||
# The parameter "read_read_names" allows is to choose whether we want to load the actual | ||
# read names into memory. It's slightly more expensive than not reading it, so only load them | ||
# if you are interested in them | ||
matrix = values.to_sparse_methylation_matrix(read_read_names=True) | ||
|
||
# This is a scipy.sparse.csc_matrix matrix of dimension (r,s), containing the log-likelihood ratios of methylation | ||
# where r is the number of reads covering the genomic range we selected, and s is the number of unique genomic | ||
# ranges for which we have methylation calls. Since an LLR of 0 means total uncertainty, a 0 indicates no call. | ||
matrix.met_matrix | ||
|
||
# A numpy array of shape (s, ) containing the start position for each unique genomic range | ||
matrix.genomic_coord | ||
# A numpy array of shape (s, ) containing the end position for each unique genomic range | ||
matrix.genomic_coord_end | ||
|
||
# A numpy array of shape (r, ) containing the read names | ||
matrix.read_names | ||
|
||
# Get a submatrix containing only the first 10 genomic locations | ||
submatrix = matrix.get_submatrix(0, 10) | ||
|
||
# Get a submatrix containing only the reads in the provided list of read names | ||
submatrix = matrix.get_submatrix_from_read_names(allowed_read_names) | ||
``` | ||
|
||
|
||
|
||
## The MetH5 Format | ||
|
||
A MetH5 file is an HDF5 container that stores methylation calls for long reads. The structure of the HDF5 file is as follows: | ||
|
||
``` | ||
/ | ||
├─ chromosomes | ||
│ ├─ CHROMOSOME_NAME1 | ||
│ │ ├─ llr (float dataset of shape (n,)) | ||
│ │ ├─ read_id (int dataset of shape (n,)) | ||
│ │ ├─ range (int dataset of shape (n,2)) | ||
│ │ └─ chunk_ranges (dataset of shape (c, 2)) | ||
│ │ | ||
│ ├─ CHROMOSOME_NAME2 | ||
│ │ └─ ... | ||
│ └─ ... | ||
└─ reads | ||
├─ read_name_mapping (string dataset of shape (r,)) | ||
└─ read_groups | ||
├─ READ_GROUP_KEY1 (int dataset of shape (r,)) | ||
├─ READ_GROUP_KEY2 (int dataset of shape (r,)) | ||
└─ ... | ||
``` | ||
|
||
Where `n` is the number of methylation calls in the respective chromosome, `c` is the number of chunks, and `r`is the total number of reads across all chromosomes. | ||
|
||
--- | ||
|
||
## Citing | ||
|
||
The repository is archived at Zenodo. If you use `meth5` please cite as follow: | ||
|
||
Rene Snajder. (2021, May 18). snajder-r/meth5. Zenodo. https://doi.org/10.5281/zenodo.4772327 | ||
|
||
## Authors and contributors | ||
|
||
* Rene Snajder (@snajder-r): rene.snajder(at)dkfz-heidelberg.de |
Oops, something went wrong.