forked from Dfam-consortium/RepeatModeler
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
297 lines (214 loc) · 11.4 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
RepeatModeler
=============
Arian Smit, Robert Hubley - Institute for Systems Biology
RepeatModeler is a de novo repeat family identification and modeling
package. At the heart of RepeatModeler are two de-novo repeat finding
programs ( RECON and RepeatScout ) which employ complementary computational
methods for identifying repeat element boundaries and family relationships
from sequence data. RepeatModeler assists in automating the runs of RECON
and RepeatScout given a genomic database and uses the output to build,
refine and classify consensus models of putative interspersed repeats.
Prerequisites
-------------
Perl
Available at http://www.perl.org/get.html. Developed and tested
with version 5.8.8.
RepeatMasker & Libraries
Developed and tested with open-4.0.7. The program is available at
http://www.repeatmasker.org/RMDownload.html and is distributed with
a minimal library set ( Dfam_consensus ). A larger set of libraries are
available from http://www.girinst.org.
RECON - De Novo Repeat Finder, Bao Z. and Eddy S.R.
Developed and tested with our patched version of RECON ( 1.08 ).
The 1.08 version fixes problems with running RECON on 64 bit machines and
supplies a workaround to a division by zero bug along with some buffer
overrun fixes. The program is available at:
http://www.repeatmasker.org/RECON-1.08.tar.gz.
The original version is available at http://eddylab.org/software/recon/.
RepeatScout - De Novo Repeat Finder, Price A.L., Jones N.C. and Pevzner P.A.
Developed and tested with our multiple sequence version of RepeatScout
( 1.0.5 ). This version is available at
http://www.repeatmasker.org/RepeatScout-1.0.5.tar.gz
TRF - Tandem Repeat Finder, G. Benson et al.
You can obtain a free copy at http://tandem.bu.edu/trf/trf.html.
RepeatModeler was developed using 4.0.4.
And one or both of the following:
RMBlast - A modified version of NCBI Blast for use with RepeatMasker
and RepeatModeler. Precompiled binaries and source can be found at
http://www.repeatmasker.org/RMBlast.html
or
WUBlast/ABBlast - Sequence Search Engine, W. Gish et al.
Developed and tested with 2.0 [04-May-2006]. The latest versions
of ABBlast may be downloaded from: http://blast.advbiocomp.com/licensing/
Installation
------------
1. Uncompress and expand the distribution archive:
Typically: tar -zxvf RepeatModeler-open-#.#.#.tar.gz
or
gunzip RepeatModeler-open-#.#.#.tar.gz
tar -xvf RepeatModeler-open-#.#.#.tar
2. Configure for your site:
Automatic:
1. Run the "configure" script contained in the RepeatModeler
distribution as:
perl ./configure
By Hand:
1. Copy the file "RepModelConfig.pm.tmpl" to "RepModelConfig.pm"
2. Edit the configuration parameters with a text editor.
Example Run
-----------
In this example we first downloaded elephant sequences
from Genbank ( approx 11MB ) into a file called elephant.fa.
1. Create a Database for RepeatModeler
RepeatModeler uses a ABBlast/WUBlast XDF database or a NCBI
DB ( depending on the search engine used ) as input to the
repeat modeling pipeline. A utility is provided to assist
the user in creating a single database from several
types of input structures.
<RepeatModelerPath>/BuildDatabase -name elephant elephant.fa
Run "BuildDatabase" without any options in order to see the
full documentation on this utility. There are several options
which make it easier to import multiple sequence files into
one database.
NOTE: It is a good idea to place your datafiles and run this
program suite from a local disk rather than over NFS.
2. Run RepeatModeler
RepeatModeler runs several compute intensive programs on the
input sequence. For best results run this on a machine with
a moderate amount of memory and several processors. Our typical
setup was P4 - 4 cpus, 2.4Ghz, 3GB Memory, and Red Hat Linux.
nohup <RepeatModelerPath>/RepeatModeler -database elephant >& run.out &
The nohup is used on our machines when running long ( > 3-4 hour )
jobs. The output is saved to a file and the process is backgrounded.
For typical runtimes ( can be > 2 days with this configuration )
see the run statistics section of this file.
3. Interpret the results
This development version of RepeatModeler produces a voluminous
amount of output. The raw output is directed to a working directory
named RM_<PID>.<DATE> ie. "RM_5098.MonMar141305172005" and remains
after each run for debugging purposes. At the completion of the
run two files are generated:
<database_name>-families.fa : Consensus sequences
<database_name>-families.stk : Seed alignments
The seed alignment file is in a Dfam compatible Stockholm format and
may be uploaded to the new open Dfam_consensus database using the
util/dfamConsensusTool.pl.
See http://www.repeatmasker.org/RepeatModeler/dfamConsensusTool for
details.
The fasta format is useful for running quick custom library searches
using RepeatMasker. Ie.:
<RepeatMaskerPath>/RepeatMasker -lib <database_name>-families.fa mySequence.fa
Other files produced in the working directory include:
RM_<PID>.<DATE>/
round-1/
sampleDB-#.fa : The genomic sample used in this round
sampleDB-#.fa.lfreq : The RepeatScout lmer table
sampleDB-#.fa.rscons: The RepeatScout generated consensi
sampleDB-#.fa.rscons.filtered : The simple repeat/low
complexity filtered
version of *.rscons
consensi.fa : The final consensi db for this round
family-#-cons.html : A visualization of the model
refinement process. This can be opened
in web browsers that support zooming.
( such as firefox ).
This is used to track down problems
with the Refiner.pl
index.html : A HTML index to all the family-#-cons.html
files.
round-2/
sampleDB-#.fa : The genomic sample used in this round
msps.out : The output of the sample all-vs-all
comparison
summary/ : The RECON output directory
eles : The RECON family output
consensi.fa : Same as above
family-#-cons.html : Same as above
index.html : Same as above
round-3/
Same as round-2
..
round-n/
Recover from a failure
If for some reason RepeatModeler fails, you may restart an
analysis starting from the last round it was working on. The
-recoverDir [<i>ResultDir</i>] option allows you to specify a
diretory ( i.e RM_<PID>.<DATE>/ ) where a previous run of
RepeatModeler was working and it will automatically determine
how to continue the analysis.
Please see the RELEASE-NOTES file for more details.
RepeatModeler Statistics
------------------------
Benchmarks and statistics for runs of RepeatModeler on several sample
genomes.
RepeatModeler 1.0.3 ( RECON + RepeatScout ):
Genome DB Sample*** Run Time* Models Models % Sample
Genome Size (bp) Size (bp) (hh:mm) Built Classified Masked**
---------- ----------- ---------- ---------- ------- ----------- --------
Marmoset 3.0 Bbp 228 Mbp 55:37 582 564 36.4
Zebrafinch 1.2 Bbp 222 Mbp 66:29 178 75 8.6
* Analysis run on a 4 processor P4, 2.4Ghz, 3GB RAM, machine
running Red Hat Linux.
** Includes simple repeats and low complexity DNA. Results
obtained with RepeatMasker open-3.2.5, WUBlast and
the -lib option.
*** Sample size does not include 40 Mbp used in the RepeatScout analysis.
This 40 Mbp is randomly chosen and may overlap 0-100% of the
sample used in the RECON analysis.
RepeatModeler 1.0.2 ( RECON + RepeatScout ):
Genome DB Sample*** Run Time* Models Models % Sample
Genome Size (bp) Size (bp) (hh:mm) Built Classified Masked**
---------- ----------- ---------- ---------- ------- ----------- --------
Human HG18 3.1 Bbp 238 Mbp 46:36 614 611 35.66
Platypus 1.9 Bbp 220 Mbp 76:02 540 457
Zebrafinch 1.3 Bbp 220 Mbp 63:57 233 104 9.41
Sea Urchin 867 Mbp 220 Mbp 40:03 1830 360 33.85
diatom 32,930,227 32,930,227 4:41 128 35 2.86
Rabbit 11,770,949 11,770,949 3:14 83 72 31.30
* Analysis run on a 4 processor P4, 2.4Ghz, 3GB RAM, machine
running Red Hat Linux.
** Includes simple repeats and low complexity DNA. Results
obtained with RepeatMasker open-3.1.9, WUBlast and
the -lib option.
*** Sample size does not include 40 Mbp used in the RepeatScout analysis.
This 40 Mbp is randomly chosen and may overlap 0-100% of the
sample used in the RECON analysis.
RepeatModeler 1.0.0 ( RECON ):
Genome DB Sample Run Time* Models Models % Sample
Genome Size (bp) Size (bp) (hh:mm) Built Classified Masked**
---------- ----------- ---------- ---------- ------- ----------- --------
Opossum 3.5 Billion 319 Mbp 140:52 1137 467 52.55
Armadillo 47,332,136 47,332,136 6:20 121 92 36.07
Platypus 14,768,992 14,768,992 3:46 18 13 40.69
Rabbit 11,770,949 11,770,949 2:17 20 16 28.67
Elephant 11,550,090 11,550,090 1:21 34 28 37.08
* Analysis run on a 4 processor P4, 2.4Ghz, 3GB RAM, machine
running Red Hat Linux.
** Includes simple repeats and low complexity DNA. Results
obtained with RepeatMasker open-3.0.9, WUBlast and
the -lib option.
Caveats
-------
o Genomes with numerous short contigs ( Diatom for example )
will take longer to BLAST than larger genomes with
larger contigs. This is an optimization problem left for
future releases.
o This program is not parallelized. It can only run on
one node. This is something we are considering for future
releases.
Credits
-------
Arnie Kas for the work done on the original MultAln.pm.
Andy Siegel for statistics consultations.
Thanks so much to Warren Gish for his invaluable assistance
and consultation on his WUBlast program suite.
Alkes Price and Pavel Pevzner for assistance with RepeatScout
and hosting my multi-sequence version of RepeatScout.
This work was supported by the NIH ( R44 HG02244-02),
and the Institute for Systems Biology.
License
-------
This work is licensed under the Open Source License v2.1.
To view a copy of this license, visit
http://www.opensource.org/licenses/osl-2.1.php or
see the LICENSE file contained in this distribution.