Skip to content
This repository was archived by the owner on Jan 24, 2018. It is now read-only.

Commit fe1d2f1

Browse files
committed
Merge pull request #1166 from ga4gh/sql_repo
Sql repo
2 parents 2f32ab0 + b2e6334 commit fe1d2f1

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

89 files changed

+3987
-3159
lines changed

.travis.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ before_script:
3232
script:
3333
- flake8 *.py tests ga4gh scripts --exclude=ez_setup.py
3434
- nosetests --with-coverage --cover-package ga4gh
35-
--cover-inclusive --cover-min-percentage 85
35+
--cover-inclusive --cover-min-percentage 80
3636
--cover-branches --cover-erase
3737
- make clean -C docs
3838
- make -C docs

docs/configuration.rst

+36-221
Original file line numberDiff line numberDiff line change
@@ -5,222 +5,47 @@ Configuration
55
*************
66

77
The GA4GH reference server has two basic elements to its configuration:
8-
the `Data repository`_ and the `Configuration file`_. The repository is most easily configured via the `Repository manager`_ command line tool.
8+
the `Data repository`_ and the `Configuration file`_.
99

1010
---------------
1111
Data repository
1212
---------------
1313

14-
Data is input to the GA4GH server as a directory hierarchy, in which
15-
the structure of data to be served is represented by the file system.
16-
At the top level of the data hierarchy there are two required
17-
directories to hold the top level container types: ``referenceSets`` and
18-
``datasets``.
19-
20-
.. todo:: We need to link to the high-level API documentation for descriptions
21-
of what the various objects here mean.
22-
23-
+++++++++++++
24-
ReferenceSets
25-
+++++++++++++
26-
27-
Within the data directory there must be a directory called ``referenceSets``.
28-
Within this directory, each directory is interpreted as containing a
29-
``ReferenceSet`` with the directory name mapped to the name of the
30-
reference set. Here is an example of how reference data should be arranged::
31-
32-
references/
33-
GRCh37.json
34-
GRCh37/
35-
1.fa.gz
36-
1.fa.gz.fai
37-
1.json
38-
2.fa.gz
39-
2.fa.gz.fai
40-
2.json
41-
# More references
42-
GRCh38.json
43-
GRCh38/
44-
1.fa.gz
45-
1.fa.gz.fai
46-
1.json
47-
2.fa.gz
48-
2.fa.gz.fai
49-
2.json
50-
# More references
51-
52-
In this example we have two reference sets, with names ``GRCh37`` and ``GRCh38``.
53-
Each reference set directory must be accompanied by a file
54-
in JSON format, which lists the metadata for a given reference. For example,
55-
the ``GRCh37.json`` file above might look something like
56-
57-
.. code-block:: json
58-
59-
{
60-
"description": "GRCh37 primary assembly",
61-
"sourceUri": "TODO",
62-
"assemblyId": "TODO",
63-
"sourceAccessions": [],
64-
"isDerived": false,
65-
"ncbiTaxonId": 9606
66-
}
67-
68-
Within a reference set directory is a set of files defining the references
69-
themselves. Each reference object corresponds to three files: the bgzip
70-
compressed FASTA sequences, the FAI index and a JSON file providing the
71-
metadata. There must be exactly one sequence per FASTA file, and the
72-
sequence ID in the FASTA file must be equal to the reference name
73-
(i.e., the first line in ``1.fa`` above should start with ``>1``.)
74-
75-
The JSON metadata required for a reference is similar to a reference set.
76-
An example might look something like:
77-
78-
.. code-block:: json
79-
80-
{
81-
"sourceUri": "TODO",
82-
"sourceAccessions": [
83-
"CM000663.2"
84-
],
85-
"sourceDivergence": null,
86-
"md5checksum": "bb07c91cda4645ad8e75e375e3d6e5eb",
87-
"isDerived": false,
88-
"ncbiTaxonId": 9606
89-
}
90-
91-
92-
++++++++++
93-
Datasets
94-
++++++++++
95-
96-
The main container for genetic data is the dataset. Within the
97-
main data directory there must be a directory called ``datasets``.
98-
Within this directory each subdirectory is interpreted as a
99-
dataset of that name. For example, we might have something like::
100-
101-
datasets/
102-
1kg-phase1
103-
variants/
104-
# Variant data
105-
reads/
106-
# Read data
107-
1kg-phase3
108-
variants/
109-
# Variant data
110-
reads/
111-
# Read data
112-
113-
In this case we specify two datasets with name equal to ``1kg-phase1`` and
114-
``1kg-phase3``. These directories contain the read and variant data
115-
within the ``variants`` and ``reads`` directory, respectively.
116-
117-
++++++++
118-
Variants
119-
++++++++
120-
121-
Each dataset can contain a number of VariantSets, each of which basically
122-
corresponds to a VCF file. Because VCF files are commonly split by chromosome
123-
a VariantSet can consist of many VCF files that have consistent metadata.
124-
Within the ``variants`` directory, each directory is interpreted as a
125-
variant set with that name. A variant set directory then contains
126-
one or more indexed VCF/BCF files.
127-
128-
+++++
129-
Reads
130-
+++++
131-
132-
A dataset can contain many ReadGroupSets, and each ReadGroupSet contains
133-
a number of ReadGroups. The ``reads`` directory contains a number of BAM
134-
files, each of which corresponds to a single ReadGroupSet. ReadGroups are
135-
then mapped to the ReadGroups that we find within the BAM file.
14+
The repository in the GA4GH reference server defines how your data is organised. The
15+
repository itself is a SQLite database, which contains information about your
16+
datasets, reference sets and so on. Bulk data (such as variants and reads)
17+
is not stored in database, but instead accessed directly from the primary
18+
data files at run time. The locations of these data files is entirely up
19+
to the administrator.
13620

137-
+++++++
138-
Example
139-
+++++++
140-
141-
An example layout might look like::
142-
143-
ga4gh-data/
144-
referencesSet/
145-
referenceSet1.json
146-
referenceSet1/
147-
1.fa.gz
148-
1.fa.gz.fai
149-
1.json
150-
2.fa.gz
151-
2.fa.gz.fai
152-
2.json
153-
# More references
154-
datasets/
155-
dataset1/
156-
/variants/
157-
variantSet1/
158-
chr1.vcf.gz
159-
chr1.vcf.gz.tbi
160-
chr2.vcf.gz
161-
chr2.vcf.gz.tbi
162-
# More VCFs
163-
variantSet2/
164-
chr1.bcf
165-
chr1.bcf.csi
166-
chr2.bcf
167-
chr2.bcf.csi
168-
# More BCFs
169-
/reads/
170-
sample1.bam
171-
sample1.bam.bai
172-
sample2.bam
173-
sample2.bam.bai
174-
# More BAMS
175-
176-
.. note:: Any change to the data repository (using the repository manager or
177-
otherwise) requires a restart of the server to be picked up by the
178-
server. The server does not detect changes in the data repository
179-
while running.
180-
181-
------------------
182-
Repository manager
183-
------------------
184-
185-
The repository manager is a tool provided to abstract away the details of
186-
building a data repository behind a convenient command line interface. It can
187-
be accessed via ``ga4gh_repo`` (or ``python repo_dev.py`` if developing).
188-
Following are descriptions of the commands that the repo manager exposes.
189-
190-
All of the ``add-*`` commands take a ``--moveMode`` flag which specifies how
191-
to transfer the given file (or directory) into the data repository. The
192-
options are ``move`` (moves the file from its original path to the new
193-
path), ``copy`` (copies the contents of the file into the data repository) and
194-
``link`` (creates a symlink in the data repository to the file). The
195-
default is ``link``.
196-
197-
Many of the ``add-*`` commands take additional flags to specify fields to be
198-
entered into the ``.json`` files that are created for the given file.
199-
Utilize the command line help for a particular command to get a list of
200-
these flags.
21+
The repository manager provides an administration interface to the the data
22+
repository. It can be accessed via ``ga4gh_repo`` (or ``python repo_dev.py`` if
23+
developing). Following are descriptions of the commands that the repo manager
24+
exposes.
20125

20226
+++++++
20327
init
20428
+++++++
20529

20630
Initializes a data repository at the path provided. All of the other
207-
commands require a data repository path as an argument, so this will likely be
31+
commands require a data repository file as an argument, so this will likely be
20832
the first command you run.
20933

21034
.. code-block:: bash
21135
212-
$ ga4gh_repo init path/to/datarepo
36+
$ ga4gh_repo init path/to/repo.db
37+
21338
21439
+++++++
215-
check
40+
verify
21641
+++++++
21742

21843
Performs some consistency checks on the given data repository to ensure it is
21944
well-formed.
22045

22146
.. code-block:: bash
22247
223-
$ ga4gh_repo check path/to/datarepo
48+
$ ga4gh_repo verify path/to/repo.db
22449
22550
+++++++
22651
list
@@ -230,17 +55,7 @@ Lists the contents of the given data repository.
23055

23156
.. code-block:: bash
23257
233-
$ ga4gh_repo list path/to/datarepo
234-
235-
+++++++
236-
destroy
237-
+++++++
238-
239-
Destroys the given data repository by deleting its directory tree.
240-
241-
.. code-block:: bash
242-
243-
$ ga4gh_repo destroy path/to/datarepo
58+
$ ga4gh_repo list path/to/repo.db
24459
24560
+++++++++++
24661
add-dataset
@@ -250,7 +65,7 @@ Creates a dataset in the given repository with a given name.
25065

25166
.. code-block:: bash
25267
253-
$ ga4gh_repo add-dataset path/to/datarepo aDataset
68+
$ ga4gh_repo add-dataset path/to/repo.db aDataset
25469
25570
+++++++++++++++
25671
remove-dataset
@@ -260,18 +75,17 @@ Destroys a dataset in the given repository with a given name.
26075

26176
.. code-block:: bash
26277
263-
$ ga4gh_repo remove-dataset path/to/datarepo aDataset
78+
$ ga4gh_repo remove-dataset path/to/repo.db aDataset
26479
26580
++++++++++++++++
26681
add-referenceset
26782
++++++++++++++++
26883

269-
Adds a given reference set file to a given data repository. The file must
270-
have the extension ``.fa.gz``.
84+
Adds a given reference set file to a given data repository.
27185

27286
.. code-block:: bash
27387
274-
$ ga4gh_repo add-referenceset path/to/datarepo path/to/aReferenceSet.fa.gz
88+
$ ga4gh_repo add-referenceset path/to/repo.db path/to/aReferenceSet.fa.gz
27589
27690
++++++++++++++++++++
27791
remove-referenceset
@@ -281,30 +95,30 @@ Removes a given reference set from a given data repository.
28195

28296
.. code-block:: bash
28397
284-
$ ga4gh_repo remove-referenceset path/to/datarepo aReferenceSet
98+
$ ga4gh_repo remove-referenceset path/to/repo.db aReferenceSet
28599
286100
++++++++++++++++
287-
add-ontologymap
101+
add-ontology
288102
++++++++++++++++
289103

290-
Adds an Ontology Map, which maps identifiers to ontology terms, to
104+
Adds an Ontology Map, which maps identifiers to ontology terms, to
291105
the repository. Ontology maps are tab delimited files with an
292106
identifier/term pair per row.
293107

294108

295109
.. code-block:: bash
296110
297-
$ ga4gh_repo add-ontologymap path/to/datarepo path/to/aOntoMap.txt
111+
$ ga4gh_repo add-ontology path/to/repo.db path/to/aOntoMap.txt
298112
299113
++++++++++++++++++++
300-
remove-ontologymap
114+
remove-ontology
301115
++++++++++++++++++++
302116

303117
Removes a given Ontology Map from a given data repository.
304118

305119
.. code-block:: bash
306120
307-
$ ga4gh_repo remove-ontologymap path/to/datarepo aOntoMap
121+
$ ga4gh_repo remove-ontology path/to/repo.db aOntoMap
308122
309123
310124
+++++++++++++++++
@@ -316,7 +130,7 @@ file must have the extension ``.bam``.
316130

317131
.. code-block:: bash
318132
319-
$ ga4gh_repo add-readgroupset path/to/datarepo aDataset path/to/aReadGroupSet.bam
133+
$ ga4gh_repo add-readgroupset path/to/repo.db aDataset path/to/aReadGroupSet.bam
320134
321135
++++++++++++++++++++
322136
remove-readgroupset
@@ -326,18 +140,19 @@ Removes a read group set from a given data repository and dataset.
326140

327141
.. code-block:: bash
328142
329-
$ ga4gh_repo remove-readgroupset path/to/datarepo aDataset aReadGroupSet
143+
$ ga4gh_repo remove-readgroupset path/to/repo.db aDataset aReadGroupSet
330144
331145
+++++++++++++++
332146
add-variantset
333147
+++++++++++++++
334148

335149
Adds a variant set directory to a given data repository and dataset. The
336-
directory should contain file(s) with extension ``.vcf.gz``. If a variant set is annotated it will be added as both a variant set and a variant annotation set.
150+
directory should contain file(s) with extension ``.vcf.gz``. If a variant set
151+
is annotated it will be added as both a variant set and a variant annotation set.
337152

338153
.. code-block:: bash
339154
340-
$ ga4gh_repo add-variantset path/to/datarepo aDataset path/to/aVariantSet
155+
$ ga4gh_repo add-variantset path/to/repo.db aDataset path/to/aVariantSet
341156
342157
+++++++++++++++++
343158
remove-variantset
@@ -347,7 +162,7 @@ Removes a variant set from a given data repository and dataset.
347162

348163
.. code-block:: bash
349164
350-
$ ga4gh_repo remove-variantset path/to/datarepo aDataset aVariantSet
165+
$ ga4gh_repo remove-variantset path/to/repo.db aDataset aVariantSet
351166
352167
------------------
353168
Configuration file
@@ -357,12 +172,12 @@ The GA4GH reference server is a `Flask application <http://flask.pocoo.org/>`_
357172
and uses the standard `Flask configuration file mechanisms
358173
<http://flask.pocoo.org/docs/0.10/config/>`_.
359174
Many configuration files will be very simple, and will consist of just
360-
one directive instructing the server where to look for data; for
175+
one directive instructing the server where to find the data repository;
361176
example, we might have
362177

363178
.. code-block:: python
364179
365-
DATA_SOURCE = "/path/to/data/root"
180+
DATA_SOURCE = "/path/to/repo.db"
366181
367182
For production deployments, we shouldn't need to add any more configuration
368183
than this, as the other keys have sensible defaults. However,
@@ -413,7 +228,7 @@ RESPONSE_VALIDATION
413228
purposes.
414229

415230
LANDING_MESSAGE_HTML
416-
The server provides a simple landing page at its root. By setting this
231+
The server provides a simple landing page at its root. By setting this
417232
value to point at a file containing an HTML block element it is possible to
418233
customize the landing page. This can be helpful to provide support links
419234
or details about the hosted datasets.

0 commit comments

Comments
 (0)