Skip to content

Commit 197b977

Browse files
author
Rene Snajder
committed
Versipy auto bump-up
1 parent 311aa08 commit 197b977

6 files changed

+320
-307
lines changed

.travis.yml

+11-11
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
dist: xenial
2-
language: python
3-
python: 3.7
4-
branches:
5-
only:
6-
- main
7-
8-
install:
9-
- pip install meth5
10-
11-
script: true
1+
dist: xenial
2+
language: python
3+
python: 3.7
4+
branches:
5+
only:
6+
- main
7+
8+
install:
9+
- pip install meth5
10+
11+
script: true

README.md

+179-167
Original file line numberDiff line numberDiff line change
@@ -1,167 +1,179 @@
1-
from re import match# MetH5Format 0.3.1
2-
3-
[![GitHub license](https://img.shields.io/github/license/snajder-r/meth5format.svg)](https://github.com/snajder-r/meth5format/blob/master/LICENSE)
4-
[![DOI](https://zenodo.org/badge/303672813.svg)](https://zenodo.org/badge/latestdoi/303672813)
5-
[![Language](https://img.shields.io/badge/Language-Python3.7+-yellow.svg)](https://www.python.org/)
6-
[![Build Status](https://travis-ci.com/snajder-r/meth5format.svg?branch=main)](https://travis-ci.com/snajder-r/meth5format)
7-
[![Code style: black](https://img.shields.io/badge/code%20style-black-black.svg?style=flat)](https://github.com/snajder-r/black "Black (modified)")
8-
9-
10-
[![PyPI version](https://badge.fury.io/py/meth5.svg)](https://badge.fury.io/py/meth5)
11-
[![PyPI downloads](https://pepy.tech/badge/meth5)](https://pepy.tech/project/meth5)
12-
[![Anaconda Version](https://img.shields.io/conda/v/snajder-r/meth5?color=blue)](https://anaconda.org/snajder-r/meth5)
13-
[![Anaconda Downloads](https://anaconda.org/snajder-r/meth5/badges/downloads.svg)](https://anaconda.org/snajder-r/meth5)
14-
15-
MetH5 is an HDF5-based container format for methylation calls from long reads.
16-
17-
In the current version, the MetH5 format can store the following information:
18-
* Log-likelihood ratio of each methylation call
19-
* Genomic coordinates (start and end) of each methylation call
20-
* The read name associated with each call
21-
* Read grouping (i.e. annotation such as samples or haplotypes)
22-
23-
## Installation
24-
25-
Through pip:
26-
27-
```
28-
pip install meth5
29-
````
30-
31-
Through anaconda:
32-
33-
```
34-
conda install -c snajder-r meth5
35-
```
36-
37-
## Usage
38-
39-
### Creating a MetH5 file from nanopolish methylation calls
40-
41-
Assuming you have nanopolish methylation calls with filenames `*.tsv`, you can create a MetH5 file with the following command:
42-
43-
```
44-
meth5 create_h5 --input_dir INPUT_DIR/ --output_file OUTPUT_FILE.m5
45-
```
46-
47-
In order to annotate reads with read grouping (for example as samples or haplotypes) you can do so by running:
48-
49-
```
50-
meth annotate_reads --m5file M5FILE.m5 --read_groups_key READ_GROUPS_KEY --read_group_file READ_GROUP_FILE
51-
```
52-
53-
Where the `READ_GROUPS_KEY` is the key under which you want to store the annotation (you can store multiple read annotations),
54-
and `READ_GROUP_FILE` is a tab-delimited file containg read name and read group. For example:
55-
56-
```
57-
read_name group
58-
7741f9ee-ad41-42a4-99b2-290c66960410 1
59-
4f18b48e-a1d3-49ad-ace3-cfb96b78ad79 2
60-
...
61-
```
62-
63-
### Quick start for python API
64-
65-
Here an example on how to access methylation values from a MetH5 file:
66-
67-
```python
68-
from meth5.meth5 import MetH5File
69-
70-
with MetH5File(filename, mode="r") as m:
71-
# List chromosomes in the MetH5 file
72-
m.get_chromosomes()
73-
74-
# Access chromosome 7
75-
chr7 = m["chr7"]
76-
77-
# Get number of chunks
78-
chr7.get_number_of_chunks()
79-
80-
# Get a container that manages the values of chunk 3
81-
# (note that the data is not yet loaded into memory)
82-
values = chr7.get_chunk(3)
83-
84-
# Get the log-likelihood ratios in the container as a numpy array of shape (n,)
85-
llrs = values.get_llrs()
86-
87-
# Get the genomic start and end locations for each methylation call in the
88-
# chunk as a numpy array of shape (n,2)
89-
ranges = values.get_ranges()
90-
91-
# Compute methylation rate (beta-score of methylation) for each genomic location,
92-
# as well as the respective coordinates
93-
met_rates, met_rate_ranges = values.get_llr_site_rate()
94-
95-
# You can also compute other aggregates if you like
96-
met_count, met_count_ranges = values.get_llr_site_aggregate(aggregation_fun=lambda llrs: (llrs>2).sum())
97-
98-
# Instead of accessing chunk wise, you can query a genomic range
99-
values = chr7.get_values_in_range(36852906, 37449223)
100-
```
101-
102-
A more detailed API documentation is in the works. Stay tuned!
103-
104-
### Sparse methylation matrix
105-
106-
In addition to accessing methylation calls in its unraveled form, the `meth5` library also contains a way to represent
107-
the methylation calls as a sparse matrix. Seeing how the values are already stored in the MetH5 file in the same way a
108-
coordinate sparse matrix would be stored in memory, this is a very cheap operation. Example:
109-
110-
```python
111-
from meth5.meth5 import MetH5File
112-
113-
with MetH5File(filename, mode="r") as m:
114-
values = m["chr7"].get_values_in_range(36852906, 37449223)
115-
116-
# The parameter "read_read_names" allows is to choose whether we want to load the actual
117-
# read names into memory. It's slightly more expensive than not reading it, so only load them
118-
# if you are interested in them
119-
matrix = values.to_sparse_methylation_matrix(read_read_names=True)
120-
121-
# This is a scipy.sparse.csc_matrix matrix of dimension (r,s), containing the log-likelihood ratios of methylation
122-
# where r is the number of reads covering the genomic range we selected, and s is the number of unique genomic
123-
# ranges for which we have methylation calls. Since an LLR of 0 means total uncertainty, a 0 indicates no call.
124-
matrix.met_matrix
125-
126-
# A numpy array of shape (s, ) containing the start position for each unique genomic range
127-
matrix.genomic_coord
128-
# A numpy array of shape (s, ) containing the end position for each unique genomic range
129-
matrix.genomic_coord_end
130-
131-
# A numpy array of shape (r, ) containing the read names
132-
matrix.read_names
133-
134-
# Get a submatrix containing only the first 10 genomic locations
135-
submatrix = matrix.get_submatrix(0, 10)
136-
137-
# Get a submatrix containing only the reads in the provided list of read names
138-
submatrix = matrix.get_submatrix_from_read_names(allowed_read_names)
139-
```
140-
141-
142-
143-
## The MetH5 Format
144-
145-
A MetH5 file is an HDF5 container that stores methylation calls for long reads. The structure of the HDF5 file is as follows:
146-
147-
```
148-
/
149-
├─ chromosomes
150-
│ ├─ CHROMOSOME_NAME1
151-
│ │ ├─ llr (float dataset of shape (n,))
152-
│ │ ├─ read_id (int dataset of shape (n,))
153-
│ │ ├─ range (int dataset of shape (n,2))
154-
│ │ └─ chunk_ranges (dataset of shape (c, 2))
155-
│ │
156-
│ ├─ CHROMOSOME_NAME2
157-
│ │ └─ ...
158-
│ └─ ...
159-
└─ reads
160-
├─ read_name_mapping (string dataset of shape (r,))
161-
└─ read_groups
162-
├─ READ_GROUP_KEY1 (int dataset of shape (r,))
163-
├─ READ_GROUP_KEY2 (int dataset of shape (r,))
164-
└─ ...
165-
```
166-
167-
Where `n` is the number of methylation calls in the respective chromosome, `c` is the number of chunks, and `r`is the total number of reads across all chromosomes.
1+
# MetH5Format 0.3.1
2+
3+
[![GitHub license](https://img.shields.io/github/license/snajder-r/meth5format.svg)](https://github.com/snajder-r/meth5format/blob/master/LICENSE)
4+
[![DOI](https://zenodo.org/badge/303672813.svg)](https://zenodo.org/badge/latestdoi/303672813)
5+
[![Language](https://img.shields.io/badge/Language-Python3.7+-yellow.svg)](https://www.python.org/)
6+
[![Build Status](https://travis-ci.com/snajder-r/meth5format.svg?branch=main)](https://travis-ci.com/snajder-r/meth5format)
7+
[![Code style: black](https://img.shields.io/badge/code%20style-black-black.svg?style=flat)](https://github.com/snajder-r/black "Black (modified)")
8+
9+
10+
[![PyPI version](https://badge.fury.io/py/meth5.svg)](https://badge.fury.io/py/meth5)
11+
[![PyPI downloads](https://pepy.tech/badge/meth5)](https://pepy.tech/project/meth5)
12+
[![Anaconda Version](https://img.shields.io/conda/v/snajder-r/meth5?color=blue)](https://anaconda.org/snajder-r/meth5)
13+
[![Anaconda Downloads](https://anaconda.org/snajder-r/meth5/badges/downloads.svg)](https://anaconda.org/snajder-r/meth5)
14+
15+
MetH5 is an HDF5-based container format for methylation calls from long reads.
16+
17+
In the current version, the MetH5 format can store the following information:
18+
* Log-likelihood ratio of each methylation call
19+
* Genomic coordinates (start and end) of each methylation call
20+
* The read name associated with each call
21+
* Read grouping (i.e. annotation such as samples or haplotypes)
22+
23+
## Installation
24+
25+
Through pip:
26+
27+
```
28+
pip install meth5
29+
````
30+
31+
Through anaconda:
32+
33+
```
34+
conda install -c snajder-r meth5
35+
```
36+
37+
## Usage
38+
39+
### Creating a MetH5 file from nanopolish methylation calls
40+
41+
Assuming you have nanopolish methylation calls with filenames `*.tsv`, you can create a MetH5 file with the following command:
42+
43+
```
44+
meth5 create_h5 --input_dir INPUT_DIR/ --output_file OUTPUT_FILE.m5
45+
```
46+
47+
In order to annotate reads with read grouping (for example as samples or haplotypes) you can do so by running:
48+
49+
```
50+
meth annotate_reads --m5file M5FILE.m5 --read_groups_key READ_GROUPS_KEY --read_group_file READ_GROUP_FILE
51+
```
52+
53+
Where the `READ_GROUPS_KEY` is the key under which you want to store the annotation (you can store multiple read annotations),
54+
and `READ_GROUP_FILE` is a tab-delimited file containg read name and read group. For example:
55+
56+
```
57+
read_name group
58+
7741f9ee-ad41-42a4-99b2-290c66960410 1
59+
4f18b48e-a1d3-49ad-ace3-cfb96b78ad79 2
60+
...
61+
```
62+
63+
### Quick start for python API
64+
65+
Here an example on how to access methylation values from a MetH5 file:
66+
67+
```python
68+
from meth5.meth5 import MetH5File
69+
70+
with MetH5File(filename, mode="r") as m:
71+
# List chromosomes in the MetH5 file
72+
m.get_chromosomes()
73+
74+
# Access chromosome 7
75+
chr7 = m["chr7"]
76+
77+
# Get number of chunks
78+
chr7.get_number_of_chunks()
79+
80+
# Get a container that manages the values of chunk 3
81+
# (note that the data is not yet loaded into memory)
82+
values = chr7.get_chunk(3)
83+
84+
# Get the log-likelihood ratios in the container as a numpy array of shape (n,)
85+
llrs = values.get_llrs()
86+
87+
# Get the genomic start and end locations for each methylation call in the
88+
# chunk as a numpy array of shape (n,2)
89+
ranges = values.get_ranges()
90+
91+
# Compute methylation rate (beta-score of methylation) for each genomic location,
92+
# as well as the respective coordinates
93+
met_rates, met_rate_ranges = values.get_llr_site_rate()
94+
95+
# You can also compute other aggregates if you like
96+
met_count, met_count_ranges = values.get_llr_site_aggregate(aggregation_fun=lambda llrs: (llrs>2).sum())
97+
98+
# Instead of accessing chunk wise, you can query a genomic range
99+
values = chr7.get_values_in_range(36852906, 37449223)
100+
```
101+
102+
A more detailed API documentation is in the works. Stay tuned!
103+
104+
### Sparse methylation matrix
105+
106+
In addition to accessing methylation calls in its unraveled form, the `meth5` library also contains a way to represent
107+
the methylation calls as a sparse matrix. Seeing how the values are already stored in the MetH5 file in the same way a
108+
coordinate sparse matrix would be stored in memory, this is a very cheap operation. Example:
109+
110+
```python
111+
from meth5.meth5 import MetH5File
112+
113+
with MetH5File(filename, mode="r") as m:
114+
values = m["chr7"].get_values_in_range(36852906, 37449223)
115+
116+
# The parameter "read_read_names" allows is to choose whether we want to load the actual
117+
# read names into memory. It's slightly more expensive than not reading it, so only load them
118+
# if you are interested in them
119+
matrix = values.to_sparse_methylation_matrix(read_read_names=True)
120+
121+
# This is a scipy.sparse.csc_matrix matrix of dimension (r,s), containing the log-likelihood ratios of methylation
122+
# where r is the number of reads covering the genomic range we selected, and s is the number of unique genomic
123+
# ranges for which we have methylation calls. Since an LLR of 0 means total uncertainty, a 0 indicates no call.
124+
matrix.met_matrix
125+
126+
# A numpy array of shape (s, ) containing the start position for each unique genomic range
127+
matrix.genomic_coord
128+
# A numpy array of shape (s, ) containing the end position for each unique genomic range
129+
matrix.genomic_coord_end
130+
131+
# A numpy array of shape (r, ) containing the read names
132+
matrix.read_names
133+
134+
# Get a submatrix containing only the first 10 genomic locations
135+
submatrix = matrix.get_submatrix(0, 10)
136+
137+
# Get a submatrix containing only the reads in the provided list of read names
138+
submatrix = matrix.get_submatrix_from_read_names(allowed_read_names)
139+
```
140+
141+
142+
143+
## The MetH5 Format
144+
145+
A MetH5 file is an HDF5 container that stores methylation calls for long reads. The structure of the HDF5 file is as follows:
146+
147+
```
148+
/
149+
├─ chromosomes
150+
│ ├─ CHROMOSOME_NAME1
151+
│ │ ├─ llr (float dataset of shape (n,))
152+
│ │ ├─ read_id (int dataset of shape (n,))
153+
│ │ ├─ range (int dataset of shape (n,2))
154+
│ │ └─ chunk_ranges (dataset of shape (c, 2))
155+
│ │
156+
│ ├─ CHROMOSOME_NAME2
157+
│ │ └─ ...
158+
│ └─ ...
159+
└─ reads
160+
├─ read_name_mapping (string dataset of shape (r,))
161+
└─ read_groups
162+
├─ READ_GROUP_KEY1 (int dataset of shape (r,))
163+
├─ READ_GROUP_KEY2 (int dataset of shape (r,))
164+
└─ ...
165+
```
166+
167+
Where `n` is the number of methylation calls in the respective chromosome, `c` is the number of chunks, and `r`is the total number of reads across all chromosomes.
168+
169+
---
170+
171+
## Citing
172+
173+
The repository is archived at Zenodo. If you use `meth5` please cite as follow:
174+
175+
Rene Snajder. (2021, May 18). snajder-r/meth5. Zenodo. https://doi.org/10.5281/zenodo.4772327
176+
177+
## Authors and contributors
178+
179+
* Rene Snajder (@snajder-r): rene.snajder(at)dkfz-heidelberg.de

0 commit comments

Comments
 (0)