You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+167-87
Original file line number
Diff line number
Diff line change
@@ -5,10 +5,12 @@ designed to predict Hi-C contact matrices from one-dimensional
5
5
chromatin feature data, e. g. from ChIP-seq experiments.
6
6
The network architecture is inspired by [pix2pix from Isola et al.](https://doi.org/10.1109/CVPR.2017.632), amended by custom embedding networks to embed the one-dimensional chromatin feature data into grayscale images.
7
7
8
-
Hi-cGAN was created in 2020/2021 as part of a master thesis at Albert-Ludwigs university, Freiburg, Germany. It is provided under the [GPLv3 license](https://github.com/MasterprojectRK/Hi-cGAN/blob/main/LICENSE).
8
+
Hi-cGAN was created in 2020/2021 as part of a master thesis at Albert-Ludwigs university, Freiburg, Germany. It is provided under the [GPLv3 license](https://github.com/MasterprojectRK/Hi-cGAN/blob/main/LICENSE).
9
9
10
10
## Installation
11
11
12
+
Hi-cGAN has been designed for Linux operating systems (tested under Ubuntu 20.04 and CentOS 7.9.2009). Other operating systems are not supported and probably won't work.
13
+
12
14
Simply `git clone` this repository into an empty folder of your choice.
13
15
It is recommended to use conda or another package manager to install
14
16
the following dependencies into an empty environment:
@@ -29,7 +31,8 @@ tensorflow-gpu | 2.2.0
29
31
tqdm | 4.50.2
30
32
31
33
Other versions *might* work, but are untested and might cause dependency
32
-
conflicts. Using tensorflow without GPU support is possible, but will be very slow and is thus not recommended.
34
+
conflicts. Updating to tensorflow 2.3.x should be possible but has not been tested. Using tensorflow without GPU support is possible, but will be very slow and is thus not recommended.
35
+
33
36
34
37
## Input data requirements
35
38
* Hi-C matrix / matrices in cooler format for training.
@@ -57,69 +60,128 @@ Hi-cGAN is using a sliding window approach to generate training samples (and tes
57
60
58
61
Synopsis: `python training.py [parameters and options]`
59
62
Parameters / Options:
60
-
* --trainmatrices, -tm [required]
61
-
Hi-C matrices for training. Must be in cooler format. Use this option multiple times to specify more than one matrix (e.g. `-tm matrix1.cool -tm matrix2.cool`). First matrix belongs to first training chromatin feature path and so on, see below.
62
-
* --trainChroms, -tchroms [required]
63
-
Chromosomes for training. Specify without leading "chr" and separated by spaces,
64
-
e.g. "1 3 5 11". These chromosomes must be present in all train matrices.
65
-
* --trainChromPaths, -tcp [required]
66
-
Path where chromatin features for training reside.
67
-
The program will look for bigwig files in this folder, subfolders are not considered.
68
-
Specify one trainChromPath for each training matrix, in the desired order,
69
-
see above.
70
-
Note that the chromatin features for training and prediction must have the same names.
71
-
* --valMatrices, -vm, [required]
72
-
Hi-C matrices for validation. Must be in cooler format. Use this option multiple times to specify more than one matrix.
73
-
* --valChroms, -vchroms [required]
74
-
Same as trainChroms, just for validation
75
-
* --valChromPaths, -vcp [required]
76
-
Same as trainChromPaths, just for validation
77
-
* --windowsize, -ws [required]
78
-
Windowsize in bins for submatrices in sliding window approach. 64, 128 and 256 are supported. If the matrix has a bin size of 5kbp, then a windowsize of 64 corresponds to an actual windowsize of 64*5kbp = 320kbp
79
-
Default: 64.
80
-
* --outfolder, -o [required]
81
-
Folder where output will be stored.
82
-
Must be writable and have several 100s of MB of free storage space.
83
-
* --epochs, -ep [required]
84
-
Number of epochs for training.
85
-
* --batchsize, -bs [required]
86
-
Batch size for training. Choose integer between 1 and 256.
87
-
Mind the memory limits of your GPU; in a test environment with 15GB GPU memory, batchsizes 32,4,2 were safely within limits for windowsizes 64,128,256, respectively.
88
-
* --lossWeightPixel, -lwp
89
-
Loss weight for the L1 or L2 loss in the generator.
90
-
Floating point value, default: 100.
91
-
* --lossWeightDisc, -lwd
92
-
loss weight for the discriminator error, floating point value, default: 0.5
93
-
* --lossTypePixel, -ltp
94
-
Type of per-pixel loss to use for the generator; choose from L1 (mean abs. error) or L2 (mean squared error). Default: L1.
95
-
* --lossWeightTv, -lwt
96
-
loss weight for Total-Variation-loss of generator; higher value - more smoothing.
97
-
Default: 1e-10.
98
-
* --lossWeightAdv, -lwa
99
-
loss weight for adversarial loss in the generator.
100
-
Default: 1.0
101
-
* --learningRateGen, -lrg
102
-
Learning rate for the Adam optimizer of the Generator.
103
-
Default: 2e-5.
104
-
* --learningRateDisc, -lrd
105
-
Learning rate for the Adam optimizer of the Discriminator.
106
-
Default: 1e-6
107
-
* --beta1, -b1
108
-
beta1 parameter for the Adam optimizers (Generator and Discriminator). Default 0.5.
109
-
* --flipsamples, -fs
110
-
Flip training matrices and chromatin features (data augmentation). Default: False.
111
-
* --embeddingType, -emb
112
-
Type of embedding to use for generator and discriminator.
113
-
Choose from 'CNN' (convolutional neural network), 'DNN' (dense neural network by [Farré et al.](https://doi.org/10.1186/s12859-018-2286-z)), or 'mixed' (Generator - CNN, Discriminator - DNN).
114
-
Default: CNN
115
-
* --pretrainedIntroModel, -ptm
116
-
Undocumented, developer use only.
117
-
* --figuretype, -ft
118
-
Figure type for all plots, choose from png, pdf, svg. Default: png.
119
-
* --recordsize, -rs
120
-
Approx. size (number of samples) of the tfRecords used in the data pipeline for training. Can be tweaked to balance the load between RAM / GPU / CPU. Default: 2000.
121
-
* --plotFrequency, -pfreq
122
-
Update and save loss over epoch plots after this number of epochs
63
+
- --trainmatrices | -tm
64
+
- required
65
+
- Hi-C matrices for training
66
+
- must be in cooler format
67
+
- use this option multiple times to specify more than one matrix (e.g. `-tm matrix1.cool -tm matrix2.cool`)
68
+
- first matrix belongs to first training chromatin feature path and so on, see below
69
+
- --trainChroms | -tchroms
70
+
- required
71
+
- chromosomes for training
72
+
- specify without leading "chr" and separated by spaces,
73
+
e.g. "1 3 5 11"
74
+
- these chromosomes must be present in all train matrices
75
+
- --trainChromPaths | -tcp
76
+
- required
77
+
- path where chromatin features for training reside
78
+
- program will look for bigwig files in this folder, subfolders are not considered
79
+
- file extension must be "bigwig", "bigWig" or "bw"
80
+
- specify one trainChromPath for each training matrix, in the desired order
81
+
- chromatin features for training and prediction must have the same base names
82
+
- --valMatrices | -vm
83
+
- required
84
+
- Hi-C matrices for validation
85
+
- must be in cooler format.
86
+
- use this option multiple times to specify more than one matrix
87
+
- --valChroms | -vchroms
88
+
- required
89
+
- same as trainChroms, just for validation
90
+
- --valChromPaths | -vcp
91
+
- required
92
+
- same as trainChromPaths, just for validation
93
+
- --windowsize | -ws
94
+
- required
95
+
- window size in bins for submatrices in sliding window approach
96
+
- choose from 64, 128, 256
97
+
- default: 64
98
+
- choose reasonable value according to matrix bin size
99
+
- if the matrix has a bin size of 5kbp, then a windowsize of 64 corresponds to an actual windowsize of 64*5kbp = 320kbp
100
+
- --outfolder | -o
101
+
- required
102
+
- folder where output will be stored
103
+
- must be writable and have several 100s of MB of free storage space
104
+
- --epochs | -ep
105
+
- required
106
+
- number of epochs for training
107
+
- --batchsize | -bs
108
+
- required
109
+
- batch size for training
110
+
- integer between 1 and 256
111
+
- default: 32
112
+
- mind the memory limits of your GPU
113
+
- in a test environment with 15GB GPU memory, batchsizes 32,4,2 were safely within limits for windowsizes 64,128,256, respectively
114
+
- --lossWeightPixel | -lwp
115
+
- optional
116
+
- loss weight for the L1 or L2 loss in the generator
117
+
- float >= 1e-10
118
+
- default: 100.0
119
+
- --lossWeightDisc | -lwd
120
+
- optional
121
+
- loss weight for the discriminator error
122
+
- float >= 1e-10
123
+
- default: 0.5
124
+
- --lossTypePixel | -ltp
125
+
- optional
126
+
- type of per-pixel loss to use for the generator
127
+
- choose from "L1" (mean abs. error) or "L2" (mean squared error)
128
+
- default: L1
129
+
- --lossWeightTv | -lwt
130
+
- optional
131
+
- loss weight for Total-Variation-loss of generator
132
+
- float >= 0.0
133
+
- default: 1e-10
134
+
- higher value - more smoothing
135
+
- --lossWeightAdv | -lwa
136
+
- optional
137
+
- loss weight for adversarial loss in the generator
138
+
- float >= 1e-10
139
+
- default: 1.0
140
+
- --learningRateGen | -lrg
141
+
- optional
142
+
- learning rate for the Adam optimizer of the generator
143
+
- float in 1e-10...1.0
144
+
- default: 2e-5
145
+
- --learningRateDisc | -lrd
146
+
- optional
147
+
- learning rate for the Adam optimizer of the discriminator
148
+
- float in 1e-10...1.0
149
+
- default: 1e-6
150
+
- --beta1 | -b1
151
+
- optional
152
+
- beta1 parameter for the Adam optimizers (generator and discriminator)
153
+
- float in 1e-2...1.0
154
+
- default 0.5.
155
+
- --flipsamples | -fs
156
+
- optional
157
+
- flip training matrices and chromatin features (data augmentation)
158
+
- boolean
159
+
- default: False
160
+
- --embeddingType | -emb
161
+
- optional
162
+
- type of embedding to use for generator and discriminator
163
+
- choose from 'CNN' (convolutional neural network), 'DNN' (dense neural network by [Farré et al.](https://doi.org/10.1186/s12859-018-2286-z)), or 'mixed' (Generator - CNN, Discriminator - DNN)
164
+
- default: CNN
165
+
- CNN is recommended
166
+
- --pretrainedIntroModel | -ptm
167
+
- optional
168
+
- undocumented, developer use only
169
+
- --figuretype | -ft
170
+
- optional
171
+
- figure type for all plots
172
+
- choose from png, pdf, svg
173
+
- default: png
174
+
- --recordsize | -rs
175
+
- optional
176
+
- approx. size (number of samples) of the tfRecords used in the data pipeline for training
177
+
- can be tweaked to balance the load between RAM / GPU / CPU
178
+
- integer >= 10
179
+
- default: 2000
180
+
- --plotFrequency | -pfreq
181
+
- optional
182
+
- update and save loss over epoch plots after this number of epochs
183
+
- integer >= 1
184
+
- default: 10
123
185
124
186
Returns:
125
187
* The following files will be stored in the chosen output path (option `-o`)
@@ -134,28 +196,46 @@ This script will predict Hi-C matrices using chromatin features and a trained ge
134
196
135
197
Synopsis: `python predict.py [parameters and options]`
136
198
Parameters / Options:
137
-
* --trainedModel, -trm [required]
138
-
Trained generator model to predict from, h5py format.
139
-
Generated by training.py above.
140
-
* --testChromPath, -tcp [required]
141
-
Same as trainChromPaths, just for testing / prediction.
142
-
The number and names of bigwig files in this path must be the same as for training.
143
-
* --testChroms, -tchroms [required]
144
-
Chromosomes for testing (to be predicted). Must be available in all bigwig files.
145
-
Input format is the same as above, e.g. "8 12 21"
146
-
* --outfolder, -o
147
-
Output path for predicted Hi-C matrices (in cooler format). Default: current path
148
-
* --multiplier, -mul
149
-
Multiplier for better visualization of results.
150
-
Integer value greater equal 1, default: 1000.
151
-
* --binsize, -b [required]
152
-
Binsize for binning the proteins. Usually equal to binsize for training (but not mandatory)
153
-
* --batchsize, -bs
154
-
Batchsize for prediction. Same considerations as for training.py hold.
155
-
* --windowsize, -ws [required]
156
-
Windowsize for prediction. Must be the same as for training.
157
-
Could in future be detected from trained model.
158
-
For now, just enter the appropriate value (64, 128, 256).
199
+
- --trainedModel | -trm
200
+
- required
201
+
- trained generator model to predict from, h5py format
202
+
- generated by training.py above
203
+
- --testChromPath | -tcp
204
+
- required
205
+
- Same as trainChromPaths, just for testing / prediction
206
+
- number and base names of bigwig files in this path must be the same as for training
207
+
- --testChroms | -tchroms
208
+
- required
209
+
- chromosomes for testing (to be predicted)
210
+
- must be available in all bigwig files
211
+
- input format: without "chr" and separated by spaces, e.g. "8 12 21"
212
+
- --outfolder | -o
213
+
- required
214
+
- output path for predicted Hi-C matrices (in cooler format)
215
+
- default: current path
216
+
- --multiplier | -mul
217
+
- optional
218
+
- multiplier for better visualization of results
219
+
- integer >= 1
220
+
- default: 1000
221
+
- --binsize | -b
222
+
- required
223
+
- bin size for binning the proteins
224
+
- usually equal to binsize for training (but not mandatory)
225
+
- integer >= 1000
226
+
* --batchsize | -bs
227
+
- optional
228
+
- batch size for prediction
229
+
- same considerations as for training.py hold
230
+
- integer >= 1
231
+
- default: 32
232
+
- --windowsize | -ws
233
+
- required
234
+
- window size for prediction
235
+
- choose from 64, 128, 256
236
+
- must be the same as for training
237
+
- could in future be detected from trained model
238
+
- for now, just enter the appropriate value
159
239
160
240
Returns:
161
241
* Predicted matrix in cooler format, defined for the specified test chromosomes.
0 commit comments