Skip to content

Commit 7694c43

Browse files
author
Ralf
committed
update readme for readability
1 parent 2ca079e commit 7694c43

File tree

1 file changed

+167
-87
lines changed

1 file changed

+167
-87
lines changed

README.md

+167-87
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,12 @@ designed to predict Hi-C contact matrices from one-dimensional
55
chromatin feature data, e. g. from ChIP-seq experiments.
66
The network architecture is inspired by [pix2pix from Isola et al.](https://doi.org/10.1109/CVPR.2017.632), amended by custom embedding networks to embed the one-dimensional chromatin feature data into grayscale images.
77

8-
Hi-cGAN was created in 2020/2021 as part of a master thesis at Albert-Ludwigs university, Freiburg, Germany. It is provided under the [GPLv3 license](https://github.com/MasterprojectRK/Hi-cGAN/blob/main/LICENSE).
8+
Hi-cGAN was created in 2020/2021 as part of a master thesis at Albert-Ludwigs university, Freiburg, Germany. It is provided under the [GPLv3 license](https://github.com/MasterprojectRK/Hi-cGAN/blob/main/LICENSE).
99

1010
## Installation
1111

12+
Hi-cGAN has been designed for Linux operating systems (tested under Ubuntu 20.04 and CentOS 7.9.2009). Other operating systems are not supported and probably won't work.
13+
1214
Simply `git clone` this repository into an empty folder of your choice.
1315
It is recommended to use conda or another package manager to install
1416
the following dependencies into an empty environment:
@@ -29,7 +31,8 @@ tensorflow-gpu | 2.2.0
2931
tqdm | 4.50.2
3032

3133
Other versions *might* work, but are untested and might cause dependency
32-
conflicts. Using tensorflow without GPU support is possible, but will be very slow and is thus not recommended.
34+
conflicts. Updating to tensorflow 2.3.x should be possible but has not been tested. Using tensorflow without GPU support is possible, but will be very slow and is thus not recommended.
35+
3336

3437
## Input data requirements
3538
* Hi-C matrix / matrices in cooler format for training.
@@ -57,69 +60,128 @@ Hi-cGAN is using a sliding window approach to generate training samples (and tes
5760

5861
Synopsis: `python training.py [parameters and options]`
5962
Parameters / Options:
60-
* --trainmatrices, -tm [required]
61-
Hi-C matrices for training. Must be in cooler format. Use this option multiple times to specify more than one matrix (e.g. `-tm matrix1.cool -tm matrix2.cool`). First matrix belongs to first training chromatin feature path and so on, see below.
62-
* --trainChroms, -tchroms [required]
63-
Chromosomes for training. Specify without leading "chr" and separated by spaces,
64-
e.g. "1 3 5 11". These chromosomes must be present in all train matrices.
65-
* --trainChromPaths, -tcp [required]
66-
Path where chromatin features for training reside.
67-
The program will look for bigwig files in this folder, subfolders are not considered.
68-
Specify one trainChromPath for each training matrix, in the desired order,
69-
see above.
70-
Note that the chromatin features for training and prediction must have the same names.
71-
* --valMatrices, -vm, [required]
72-
Hi-C matrices for validation. Must be in cooler format. Use this option multiple times to specify more than one matrix.
73-
* --valChroms, -vchroms [required]
74-
Same as trainChroms, just for validation
75-
* --valChromPaths, -vcp [required]
76-
Same as trainChromPaths, just for validation
77-
* --windowsize, -ws [required]
78-
Windowsize in bins for submatrices in sliding window approach. 64, 128 and 256 are supported. If the matrix has a bin size of 5kbp, then a windowsize of 64 corresponds to an actual windowsize of 64*5kbp = 320kbp
79-
Default: 64.
80-
* --outfolder, -o [required]
81-
Folder where output will be stored.
82-
Must be writable and have several 100s of MB of free storage space.
83-
* --epochs, -ep [required]
84-
Number of epochs for training.
85-
* --batchsize, -bs [required]
86-
Batch size for training. Choose integer between 1 and 256.
87-
Mind the memory limits of your GPU; in a test environment with 15GB GPU memory, batchsizes 32,4,2 were safely within limits for windowsizes 64,128,256, respectively.
88-
* --lossWeightPixel, -lwp
89-
Loss weight for the L1 or L2 loss in the generator.
90-
Floating point value, default: 100.
91-
* --lossWeightDisc, -lwd
92-
loss weight for the discriminator error, floating point value, default: 0.5
93-
* --lossTypePixel, -ltp
94-
Type of per-pixel loss to use for the generator; choose from L1 (mean abs. error) or L2 (mean squared error). Default: L1.
95-
* --lossWeightTv, -lwt
96-
loss weight for Total-Variation-loss of generator; higher value - more smoothing.
97-
Default: 1e-10.
98-
* --lossWeightAdv, -lwa
99-
loss weight for adversarial loss in the generator.
100-
Default: 1.0
101-
* --learningRateGen, -lrg
102-
Learning rate for the Adam optimizer of the Generator.
103-
Default: 2e-5.
104-
* --learningRateDisc, -lrd
105-
Learning rate for the Adam optimizer of the Discriminator.
106-
Default: 1e-6
107-
* --beta1, -b1
108-
beta1 parameter for the Adam optimizers (Generator and Discriminator). Default 0.5.
109-
* --flipsamples, -fs
110-
Flip training matrices and chromatin features (data augmentation). Default: False.
111-
* --embeddingType, -emb
112-
Type of embedding to use for generator and discriminator.
113-
Choose from 'CNN' (convolutional neural network), 'DNN' (dense neural network by [Farré et al.](https://doi.org/10.1186/s12859-018-2286-z)), or 'mixed' (Generator - CNN, Discriminator - DNN).
114-
Default: CNN
115-
* --pretrainedIntroModel, -ptm
116-
Undocumented, developer use only.
117-
* --figuretype, -ft
118-
Figure type for all plots, choose from png, pdf, svg. Default: png.
119-
* --recordsize, -rs
120-
Approx. size (number of samples) of the tfRecords used in the data pipeline for training. Can be tweaked to balance the load between RAM / GPU / CPU. Default: 2000.
121-
* --plotFrequency, -pfreq
122-
Update and save loss over epoch plots after this number of epochs
63+
- --trainmatrices | -tm
64+
- required
65+
- Hi-C matrices for training
66+
- must be in cooler format
67+
- use this option multiple times to specify more than one matrix (e.g. `-tm matrix1.cool -tm matrix2.cool`)
68+
- first matrix belongs to first training chromatin feature path and so on, see below
69+
- --trainChroms | -tchroms
70+
- required
71+
- chromosomes for training
72+
- specify without leading "chr" and separated by spaces,
73+
e.g. "1 3 5 11"
74+
- these chromosomes must be present in all train matrices
75+
- --trainChromPaths | -tcp
76+
- required
77+
- path where chromatin features for training reside
78+
- program will look for bigwig files in this folder, subfolders are not considered
79+
- file extension must be "bigwig", "bigWig" or "bw"
80+
- specify one trainChromPath for each training matrix, in the desired order
81+
- chromatin features for training and prediction must have the same base names
82+
- --valMatrices | -vm
83+
- required
84+
- Hi-C matrices for validation
85+
- must be in cooler format.
86+
- use this option multiple times to specify more than one matrix
87+
- --valChroms | -vchroms
88+
- required
89+
- same as trainChroms, just for validation
90+
- --valChromPaths | -vcp
91+
- required
92+
- same as trainChromPaths, just for validation
93+
- --windowsize | -ws
94+
- required
95+
- window size in bins for submatrices in sliding window approach
96+
- choose from 64, 128, 256
97+
- default: 64
98+
- choose reasonable value according to matrix bin size
99+
- if the matrix has a bin size of 5kbp, then a windowsize of 64 corresponds to an actual windowsize of 64*5kbp = 320kbp
100+
- --outfolder | -o
101+
- required
102+
- folder where output will be stored
103+
- must be writable and have several 100s of MB of free storage space
104+
- --epochs | -ep
105+
- required
106+
- number of epochs for training
107+
- --batchsize | -bs
108+
- required
109+
- batch size for training
110+
- integer between 1 and 256
111+
- default: 32
112+
- mind the memory limits of your GPU
113+
- in a test environment with 15GB GPU memory, batchsizes 32,4,2 were safely within limits for windowsizes 64,128,256, respectively
114+
- --lossWeightPixel | -lwp
115+
- optional
116+
- loss weight for the L1 or L2 loss in the generator
117+
- float >= 1e-10
118+
- default: 100.0
119+
- --lossWeightDisc | -lwd
120+
- optional
121+
- loss weight for the discriminator error
122+
- float >= 1e-10
123+
- default: 0.5
124+
- --lossTypePixel | -ltp
125+
- optional
126+
- type of per-pixel loss to use for the generator
127+
- choose from "L1" (mean abs. error) or "L2" (mean squared error)
128+
- default: L1
129+
- --lossWeightTv | -lwt
130+
- optional
131+
- loss weight for Total-Variation-loss of generator
132+
- float >= 0.0
133+
- default: 1e-10
134+
- higher value - more smoothing
135+
- --lossWeightAdv | -lwa
136+
- optional
137+
- loss weight for adversarial loss in the generator
138+
- float >= 1e-10
139+
- default: 1.0
140+
- --learningRateGen | -lrg
141+
- optional
142+
- learning rate for the Adam optimizer of the generator
143+
- float in 1e-10...1.0
144+
- default: 2e-5
145+
- --learningRateDisc | -lrd
146+
- optional
147+
- learning rate for the Adam optimizer of the discriminator
148+
- float in 1e-10...1.0
149+
- default: 1e-6
150+
- --beta1 | -b1
151+
- optional
152+
- beta1 parameter for the Adam optimizers (generator and discriminator)
153+
- float in 1e-2...1.0
154+
- default 0.5.
155+
- --flipsamples | -fs
156+
- optional
157+
- flip training matrices and chromatin features (data augmentation)
158+
- boolean
159+
- default: False
160+
- --embeddingType | -emb
161+
- optional
162+
- type of embedding to use for generator and discriminator
163+
- choose from 'CNN' (convolutional neural network), 'DNN' (dense neural network by [Farré et al.](https://doi.org/10.1186/s12859-018-2286-z)), or 'mixed' (Generator - CNN, Discriminator - DNN)
164+
- default: CNN
165+
- CNN is recommended
166+
- --pretrainedIntroModel | -ptm
167+
- optional
168+
- undocumented, developer use only
169+
- --figuretype | -ft
170+
- optional
171+
- figure type for all plots
172+
- choose from png, pdf, svg
173+
- default: png
174+
- --recordsize | -rs
175+
- optional
176+
- approx. size (number of samples) of the tfRecords used in the data pipeline for training
177+
- can be tweaked to balance the load between RAM / GPU / CPU
178+
- integer >= 10
179+
- default: 2000
180+
- --plotFrequency | -pfreq
181+
- optional
182+
- update and save loss over epoch plots after this number of epochs
183+
- integer >= 1
184+
- default: 10
123185

124186
Returns:
125187
* The following files will be stored in the chosen output path (option `-o`)
@@ -134,28 +196,46 @@ This script will predict Hi-C matrices using chromatin features and a trained ge
134196

135197
Synopsis: `python predict.py [parameters and options]`
136198
Parameters / Options:
137-
* --trainedModel, -trm [required]
138-
Trained generator model to predict from, h5py format.
139-
Generated by training.py above.
140-
* --testChromPath, -tcp [required]
141-
Same as trainChromPaths, just for testing / prediction.
142-
The number and names of bigwig files in this path must be the same as for training.
143-
* --testChroms, -tchroms [required]
144-
Chromosomes for testing (to be predicted). Must be available in all bigwig files.
145-
Input format is the same as above, e.g. "8 12 21"
146-
* --outfolder, -o
147-
Output path for predicted Hi-C matrices (in cooler format). Default: current path
148-
* --multiplier, -mul
149-
Multiplier for better visualization of results.
150-
Integer value greater equal 1, default: 1000.
151-
* --binsize, -b [required]
152-
Binsize for binning the proteins. Usually equal to binsize for training (but not mandatory)
153-
* --batchsize, -bs
154-
Batchsize for prediction. Same considerations as for training.py hold.
155-
* --windowsize, -ws [required]
156-
Windowsize for prediction. Must be the same as for training.
157-
Could in future be detected from trained model.
158-
For now, just enter the appropriate value (64, 128, 256).
199+
- --trainedModel | -trm
200+
- required
201+
- trained generator model to predict from, h5py format
202+
- generated by training.py above
203+
- --testChromPath | -tcp
204+
- required
205+
- Same as trainChromPaths, just for testing / prediction
206+
- number and base names of bigwig files in this path must be the same as for training
207+
- --testChroms | -tchroms
208+
- required
209+
- chromosomes for testing (to be predicted)
210+
- must be available in all bigwig files
211+
- input format: without "chr" and separated by spaces, e.g. "8 12 21"
212+
- --outfolder | -o
213+
- required
214+
- output path for predicted Hi-C matrices (in cooler format)
215+
- default: current path
216+
- --multiplier | -mul
217+
- optional
218+
- multiplier for better visualization of results
219+
- integer >= 1
220+
- default: 1000
221+
- --binsize | -b
222+
- required
223+
- bin size for binning the proteins
224+
- usually equal to binsize for training (but not mandatory)
225+
- integer >= 1000
226+
* --batchsize | -bs
227+
- optional
228+
- batch size for prediction
229+
- same considerations as for training.py hold
230+
- integer >= 1
231+
- default: 32
232+
- --windowsize | -ws
233+
- required
234+
- window size for prediction
235+
- choose from 64, 128, 256
236+
- must be the same as for training
237+
- could in future be detected from trained model
238+
- for now, just enter the appropriate value
159239

160240
Returns:
161241
* Predicted matrix in cooler format, defined for the specified test chromosomes.

0 commit comments

Comments
 (0)