@@ -5,222 +5,47 @@ Configuration
5
5
*************
6
6
7
7
The GA4GH reference server has two basic elements to its configuration:
8
- the `Data repository `_ and the `Configuration file `_. The repository is most easily configured via the ` Repository manager `_ command line tool.
8
+ the `Data repository `_ and the `Configuration file `_.
9
9
10
10
---------------
11
11
Data repository
12
12
---------------
13
13
14
- Data is input to the GA4GH server as a directory hierarchy, in which
15
- the structure of data to be served is represented by the file system.
16
- At the top level of the data hierarchy there are two required
17
- directories to hold the top level container types: ``referenceSets `` and
18
- ``datasets ``.
19
-
20
- .. todo :: We need to link to the high-level API documentation for descriptions
21
- of what the various objects here mean.
22
-
23
- +++++++++++++
24
- ReferenceSets
25
- +++++++++++++
26
-
27
- Within the data directory there must be a directory called ``referenceSets ``.
28
- Within this directory, each directory is interpreted as containing a
29
- ``ReferenceSet `` with the directory name mapped to the name of the
30
- reference set. Here is an example of how reference data should be arranged::
31
-
32
- references/
33
- GRCh37.json
34
- GRCh37/
35
- 1.fa.gz
36
- 1.fa.gz.fai
37
- 1.json
38
- 2.fa.gz
39
- 2.fa.gz.fai
40
- 2.json
41
- # More references
42
- GRCh38.json
43
- GRCh38/
44
- 1.fa.gz
45
- 1.fa.gz.fai
46
- 1.json
47
- 2.fa.gz
48
- 2.fa.gz.fai
49
- 2.json
50
- # More references
51
-
52
- In this example we have two reference sets, with names ``GRCh37 `` and ``GRCh38 ``.
53
- Each reference set directory must be accompanied by a file
54
- in JSON format, which lists the metadata for a given reference. For example,
55
- the ``GRCh37.json `` file above might look something like
56
-
57
- .. code-block :: json
58
-
59
- {
60
- "description" : " GRCh37 primary assembly" ,
61
- "sourceUri" : " TODO" ,
62
- "assemblyId" : " TODO" ,
63
- "sourceAccessions" : [],
64
- "isDerived" : false ,
65
- "ncbiTaxonId" : 9606
66
- }
67
-
68
- Within a reference set directory is a set of files defining the references
69
- themselves. Each reference object corresponds to three files: the bgzip
70
- compressed FASTA sequences, the FAI index and a JSON file providing the
71
- metadata. There must be exactly one sequence per FASTA file, and the
72
- sequence ID in the FASTA file must be equal to the reference name
73
- (i.e., the first line in ``1.fa `` above should start with ``>1 ``.)
74
-
75
- The JSON metadata required for a reference is similar to a reference set.
76
- An example might look something like:
77
-
78
- .. code-block :: json
79
-
80
- {
81
- "sourceUri" : " TODO" ,
82
- "sourceAccessions" : [
83
- " CM000663.2"
84
- ],
85
- "sourceDivergence" : null ,
86
- "md5checksum" : " bb07c91cda4645ad8e75e375e3d6e5eb" ,
87
- "isDerived" : false ,
88
- "ncbiTaxonId" : 9606
89
- }
90
-
91
-
92
- ++++++++++
93
- Datasets
94
- ++++++++++
95
-
96
- The main container for genetic data is the dataset. Within the
97
- main data directory there must be a directory called ``datasets ``.
98
- Within this directory each subdirectory is interpreted as a
99
- dataset of that name. For example, we might have something like::
100
-
101
- datasets/
102
- 1kg-phase1
103
- variants/
104
- # Variant data
105
- reads/
106
- # Read data
107
- 1kg-phase3
108
- variants/
109
- # Variant data
110
- reads/
111
- # Read data
112
-
113
- In this case we specify two datasets with name equal to ``1kg-phase1 `` and
114
- ``1kg-phase3 ``. These directories contain the read and variant data
115
- within the ``variants `` and ``reads `` directory, respectively.
116
-
117
- ++++++++
118
- Variants
119
- ++++++++
120
-
121
- Each dataset can contain a number of VariantSets, each of which basically
122
- corresponds to a VCF file. Because VCF files are commonly split by chromosome
123
- a VariantSet can consist of many VCF files that have consistent metadata.
124
- Within the ``variants `` directory, each directory is interpreted as a
125
- variant set with that name. A variant set directory then contains
126
- one or more indexed VCF/BCF files.
127
-
128
- +++++
129
- Reads
130
- +++++
131
-
132
- A dataset can contain many ReadGroupSets, and each ReadGroupSet contains
133
- a number of ReadGroups. The ``reads `` directory contains a number of BAM
134
- files, each of which corresponds to a single ReadGroupSet. ReadGroups are
135
- then mapped to the ReadGroups that we find within the BAM file.
14
+ The repository in the GA4GH reference server defines how your data is organised. The
15
+ repository itself is a SQLite database, which contains information about your
16
+ datasets, reference sets and so on. Bulk data (such as variants and reads)
17
+ is not stored in database, but instead accessed directly from the primary
18
+ data files at run time. The locations of these data files is entirely up
19
+ to the administrator.
136
20
137
- +++++++
138
- Example
139
- +++++++
140
-
141
- An example layout might look like::
142
-
143
- ga4gh-data/
144
- referencesSet/
145
- referenceSet1.json
146
- referenceSet1/
147
- 1.fa.gz
148
- 1.fa.gz.fai
149
- 1.json
150
- 2.fa.gz
151
- 2.fa.gz.fai
152
- 2.json
153
- # More references
154
- datasets/
155
- dataset1/
156
- /variants/
157
- variantSet1/
158
- chr1.vcf.gz
159
- chr1.vcf.gz.tbi
160
- chr2.vcf.gz
161
- chr2.vcf.gz.tbi
162
- # More VCFs
163
- variantSet2/
164
- chr1.bcf
165
- chr1.bcf.csi
166
- chr2.bcf
167
- chr2.bcf.csi
168
- # More BCFs
169
- /reads/
170
- sample1.bam
171
- sample1.bam.bai
172
- sample2.bam
173
- sample2.bam.bai
174
- # More BAMS
175
-
176
- .. note :: Any change to the data repository (using the repository manager or
177
- otherwise) requires a restart of the server to be picked up by the
178
- server. The server does not detect changes in the data repository
179
- while running.
180
-
181
- ------------------
182
- Repository manager
183
- ------------------
184
-
185
- The repository manager is a tool provided to abstract away the details of
186
- building a data repository behind a convenient command line interface. It can
187
- be accessed via ``ga4gh_repo `` (or ``python repo_dev.py `` if developing).
188
- Following are descriptions of the commands that the repo manager exposes.
189
-
190
- All of the ``add-* `` commands take a ``--moveMode `` flag which specifies how
191
- to transfer the given file (or directory) into the data repository. The
192
- options are ``move `` (moves the file from its original path to the new
193
- path), ``copy `` (copies the contents of the file into the data repository) and
194
- ``link `` (creates a symlink in the data repository to the file). The
195
- default is ``link ``.
196
-
197
- Many of the ``add-* `` commands take additional flags to specify fields to be
198
- entered into the ``.json `` files that are created for the given file.
199
- Utilize the command line help for a particular command to get a list of
200
- these flags.
21
+ The repository manager provides an administration interface to the the data
22
+ repository. It can be accessed via ``ga4gh_repo `` (or ``python repo_dev.py `` if
23
+ developing). Following are descriptions of the commands that the repo manager
24
+ exposes.
201
25
202
26
+++++++
203
27
init
204
28
+++++++
205
29
206
30
Initializes a data repository at the path provided. All of the other
207
- commands require a data repository path as an argument, so this will likely be
31
+ commands require a data repository file as an argument, so this will likely be
208
32
the first command you run.
209
33
210
34
.. code-block :: bash
211
35
212
- $ ga4gh_repo init path/to/datarepo
36
+ $ ga4gh_repo init path/to/repo.db
37
+
213
38
214
39
+++++++
215
- check
40
+ verify
216
41
+++++++
217
42
218
43
Performs some consistency checks on the given data repository to ensure it is
219
44
well-formed.
220
45
221
46
.. code-block :: bash
222
47
223
- $ ga4gh_repo check path/to/datarepo
48
+ $ ga4gh_repo verify path/to/repo.db
224
49
225
50
+++++++
226
51
list
@@ -230,17 +55,7 @@ Lists the contents of the given data repository.
230
55
231
56
.. code-block :: bash
232
57
233
- $ ga4gh_repo list path/to/datarepo
234
-
235
- +++++++
236
- destroy
237
- +++++++
238
-
239
- Destroys the given data repository by deleting its directory tree.
240
-
241
- .. code-block :: bash
242
-
243
- $ ga4gh_repo destroy path/to/datarepo
58
+ $ ga4gh_repo list path/to/repo.db
244
59
245
60
+++++++++++
246
61
add-dataset
@@ -250,7 +65,7 @@ Creates a dataset in the given repository with a given name.
250
65
251
66
.. code-block :: bash
252
67
253
- $ ga4gh_repo add-dataset path/to/datarepo aDataset
68
+ $ ga4gh_repo add-dataset path/to/repo.db aDataset
254
69
255
70
+++++++++++++++
256
71
remove-dataset
@@ -260,18 +75,17 @@ Destroys a dataset in the given repository with a given name.
260
75
261
76
.. code-block :: bash
262
77
263
- $ ga4gh_repo remove-dataset path/to/datarepo aDataset
78
+ $ ga4gh_repo remove-dataset path/to/repo.db aDataset
264
79
265
80
++++++++++++++++
266
81
add-referenceset
267
82
++++++++++++++++
268
83
269
- Adds a given reference set file to a given data repository. The file must
270
- have the extension ``.fa.gz ``.
84
+ Adds a given reference set file to a given data repository.
271
85
272
86
.. code-block :: bash
273
87
274
- $ ga4gh_repo add-referenceset path/to/datarepo path/to/aReferenceSet.fa.gz
88
+ $ ga4gh_repo add-referenceset path/to/repo.db path/to/aReferenceSet.fa.gz
275
89
276
90
++++++++++++++++++++
277
91
remove-referenceset
@@ -281,30 +95,30 @@ Removes a given reference set from a given data repository.
281
95
282
96
.. code-block :: bash
283
97
284
- $ ga4gh_repo remove-referenceset path/to/datarepo aReferenceSet
98
+ $ ga4gh_repo remove-referenceset path/to/repo.db aReferenceSet
285
99
286
100
++++++++++++++++
287
- add-ontologymap
101
+ add-ontology
288
102
++++++++++++++++
289
103
290
- Adds an Ontology Map, which maps identifiers to ontology terms, to
104
+ Adds an Ontology Map, which maps identifiers to ontology terms, to
291
105
the repository. Ontology maps are tab delimited files with an
292
106
identifier/term pair per row.
293
107
294
108
295
109
.. code-block :: bash
296
110
297
- $ ga4gh_repo add-ontologymap path/to/datarepo path/to/aOntoMap.txt
111
+ $ ga4gh_repo add-ontology path/to/repo.db path/to/aOntoMap.txt
298
112
299
113
++++++++++++++++++++
300
- remove-ontologymap
114
+ remove-ontology
301
115
++++++++++++++++++++
302
116
303
117
Removes a given Ontology Map from a given data repository.
304
118
305
119
.. code-block :: bash
306
120
307
- $ ga4gh_repo remove-ontologymap path/to/datarepo aOntoMap
121
+ $ ga4gh_repo remove-ontology path/to/repo.db aOntoMap
308
122
309
123
310
124
+++++++++++++++++
@@ -316,7 +130,7 @@ file must have the extension ``.bam``.
316
130
317
131
.. code-block :: bash
318
132
319
- $ ga4gh_repo add-readgroupset path/to/datarepo aDataset path/to/aReadGroupSet.bam
133
+ $ ga4gh_repo add-readgroupset path/to/repo.db aDataset path/to/aReadGroupSet.bam
320
134
321
135
++++++++++++++++++++
322
136
remove-readgroupset
@@ -326,18 +140,19 @@ Removes a read group set from a given data repository and dataset.
326
140
327
141
.. code-block :: bash
328
142
329
- $ ga4gh_repo remove-readgroupset path/to/datarepo aDataset aReadGroupSet
143
+ $ ga4gh_repo remove-readgroupset path/to/repo.db aDataset aReadGroupSet
330
144
331
145
+++++++++++++++
332
146
add-variantset
333
147
+++++++++++++++
334
148
335
149
Adds a variant set directory to a given data repository and dataset. The
336
- directory should contain file(s) with extension ``.vcf.gz ``. If a variant set is annotated it will be added as both a variant set and a variant annotation set.
150
+ directory should contain file(s) with extension ``.vcf.gz ``. If a variant set
151
+ is annotated it will be added as both a variant set and a variant annotation set.
337
152
338
153
.. code-block :: bash
339
154
340
- $ ga4gh_repo add-variantset path/to/datarepo aDataset path/to/aVariantSet
155
+ $ ga4gh_repo add-variantset path/to/repo.db aDataset path/to/aVariantSet
341
156
342
157
+++++++++++++++++
343
158
remove-variantset
@@ -347,7 +162,7 @@ Removes a variant set from a given data repository and dataset.
347
162
348
163
.. code-block :: bash
349
164
350
- $ ga4gh_repo remove-variantset path/to/datarepo aDataset aVariantSet
165
+ $ ga4gh_repo remove-variantset path/to/repo.db aDataset aVariantSet
351
166
352
167
------------------
353
168
Configuration file
@@ -357,12 +172,12 @@ The GA4GH reference server is a `Flask application <http://flask.pocoo.org/>`_
357
172
and uses the standard `Flask configuration file mechanisms
358
173
<http://flask.pocoo.org/docs/0.10/config/> `_.
359
174
Many configuration files will be very simple, and will consist of just
360
- one directive instructing the server where to look for data; for
175
+ one directive instructing the server where to find the data repository;
361
176
example, we might have
362
177
363
178
.. code-block :: python
364
179
365
- DATA_SOURCE = " /path/to/data/root "
180
+ DATA_SOURCE = " /path/to/repo.db "
366
181
367
182
For production deployments, we shouldn't need to add any more configuration
368
183
than this, as the other keys have sensible defaults. However,
@@ -413,7 +228,7 @@ RESPONSE_VALIDATION
413
228
purposes.
414
229
415
230
LANDING_MESSAGE_HTML
416
- The server provides a simple landing page at its root. By setting this
231
+ The server provides a simple landing page at its root. By setting this
417
232
value to point at a file containing an HTML block element it is possible to
418
233
customize the landing page. This can be helpful to provide support links
419
234
or details about the hosted datasets.
0 commit comments