-
Notifications
You must be signed in to change notification settings - Fork 91
Sql repo #1166
Changes from all commits
769c813
6c9eeac
9fb9aeb
017ce51
7e97010
0192a22
e32f7ec
089c3e4
2965783
fe8ad0a
7a7009d
87879c4
b8cc2bb
920a622
13e0906
1445e43
5b8b095
2e6544c
2d0e64f
2730605
f9afea4
96a88ae
46b9eed
c41460b
f45c1a8
e6a1e92
c573843
88d2c61
63e514e
d1bd433
c2b5718
a67d66f
51e47c6
35a6776
5b0d32a
136e030
c228a2d
dade81e
3fc072b
03aa396
f15fb7f
a50daa7
d73cff3
7bddff6
507e296
426f899
e93914a
26f12c5
1f027a4
157e775
148737b
4603de8
34a4e5f
765015d
ff3b773
dc34d2d
5495fad
6aae314
8e2cdd9
f45f5f1
fd2aa74
71c031a
3f36131
f2a7388
b2e6334
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -5,222 +5,47 @@ Configuration | |
************* | ||
|
||
The GA4GH reference server has two basic elements to its configuration: | ||
the `Data repository`_ and the `Configuration file`_. The repository is most easily configured via the `Repository manager`_ command line tool. | ||
the `Data repository`_ and the `Configuration file`_. | ||
|
||
--------------- | ||
Data repository | ||
--------------- | ||
|
||
Data is input to the GA4GH server as a directory hierarchy, in which | ||
the structure of data to be served is represented by the file system. | ||
At the top level of the data hierarchy there are two required | ||
directories to hold the top level container types: ``referenceSets`` and | ||
``datasets``. | ||
|
||
.. todo:: We need to link to the high-level API documentation for descriptions | ||
of what the various objects here mean. | ||
|
||
+++++++++++++ | ||
ReferenceSets | ||
+++++++++++++ | ||
|
||
Within the data directory there must be a directory called ``referenceSets``. | ||
Within this directory, each directory is interpreted as containing a | ||
``ReferenceSet`` with the directory name mapped to the name of the | ||
reference set. Here is an example of how reference data should be arranged:: | ||
|
||
references/ | ||
GRCh37.json | ||
GRCh37/ | ||
1.fa.gz | ||
1.fa.gz.fai | ||
1.json | ||
2.fa.gz | ||
2.fa.gz.fai | ||
2.json | ||
# More references | ||
GRCh38.json | ||
GRCh38/ | ||
1.fa.gz | ||
1.fa.gz.fai | ||
1.json | ||
2.fa.gz | ||
2.fa.gz.fai | ||
2.json | ||
# More references | ||
|
||
In this example we have two reference sets, with names ``GRCh37`` and ``GRCh38``. | ||
Each reference set directory must be accompanied by a file | ||
in JSON format, which lists the metadata for a given reference. For example, | ||
the ``GRCh37.json`` file above might look something like | ||
|
||
.. code-block:: json | ||
|
||
{ | ||
"description": "GRCh37 primary assembly", | ||
"sourceUri": "TODO", | ||
"assemblyId": "TODO", | ||
"sourceAccessions": [], | ||
"isDerived": false, | ||
"ncbiTaxonId": 9606 | ||
} | ||
|
||
Within a reference set directory is a set of files defining the references | ||
themselves. Each reference object corresponds to three files: the bgzip | ||
compressed FASTA sequences, the FAI index and a JSON file providing the | ||
metadata. There must be exactly one sequence per FASTA file, and the | ||
sequence ID in the FASTA file must be equal to the reference name | ||
(i.e., the first line in ``1.fa`` above should start with ``>1``.) | ||
|
||
The JSON metadata required for a reference is similar to a reference set. | ||
An example might look something like: | ||
|
||
.. code-block:: json | ||
|
||
{ | ||
"sourceUri": "TODO", | ||
"sourceAccessions": [ | ||
"CM000663.2" | ||
], | ||
"sourceDivergence": null, | ||
"md5checksum": "bb07c91cda4645ad8e75e375e3d6e5eb", | ||
"isDerived": false, | ||
"ncbiTaxonId": 9606 | ||
} | ||
|
||
|
||
++++++++++ | ||
Datasets | ||
++++++++++ | ||
|
||
The main container for genetic data is the dataset. Within the | ||
main data directory there must be a directory called ``datasets``. | ||
Within this directory each subdirectory is interpreted as a | ||
dataset of that name. For example, we might have something like:: | ||
|
||
datasets/ | ||
1kg-phase1 | ||
variants/ | ||
# Variant data | ||
reads/ | ||
# Read data | ||
1kg-phase3 | ||
variants/ | ||
# Variant data | ||
reads/ | ||
# Read data | ||
|
||
In this case we specify two datasets with name equal to ``1kg-phase1`` and | ||
``1kg-phase3``. These directories contain the read and variant data | ||
within the ``variants`` and ``reads`` directory, respectively. | ||
|
||
++++++++ | ||
Variants | ||
++++++++ | ||
|
||
Each dataset can contain a number of VariantSets, each of which basically | ||
corresponds to a VCF file. Because VCF files are commonly split by chromosome | ||
a VariantSet can consist of many VCF files that have consistent metadata. | ||
Within the ``variants`` directory, each directory is interpreted as a | ||
variant set with that name. A variant set directory then contains | ||
one or more indexed VCF/BCF files. | ||
|
||
+++++ | ||
Reads | ||
+++++ | ||
|
||
A dataset can contain many ReadGroupSets, and each ReadGroupSet contains | ||
a number of ReadGroups. The ``reads`` directory contains a number of BAM | ||
files, each of which corresponds to a single ReadGroupSet. ReadGroups are | ||
then mapped to the ReadGroups that we find within the BAM file. | ||
The repository in the GA4GH reference server defines how your data is organised. The | ||
repository itself is a SQLite database, which contains information about your | ||
datasets, reference sets and so on. Bulk data (such as variants and reads) | ||
is not stored in database, but instead accessed directly from the primary | ||
data files at run time. The locations of these data files is entirely up | ||
to the administrator. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ...but in some cases still requires write access. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We would need to be precise about how and when here. It would just be confusing to say that we'll sometimes need write access without fully qualifying. |
||
|
||
+++++++ | ||
Example | ||
+++++++ | ||
|
||
An example layout might look like:: | ||
|
||
ga4gh-data/ | ||
referencesSet/ | ||
referenceSet1.json | ||
referenceSet1/ | ||
1.fa.gz | ||
1.fa.gz.fai | ||
1.json | ||
2.fa.gz | ||
2.fa.gz.fai | ||
2.json | ||
# More references | ||
datasets/ | ||
dataset1/ | ||
/variants/ | ||
variantSet1/ | ||
chr1.vcf.gz | ||
chr1.vcf.gz.tbi | ||
chr2.vcf.gz | ||
chr2.vcf.gz.tbi | ||
# More VCFs | ||
variantSet2/ | ||
chr1.bcf | ||
chr1.bcf.csi | ||
chr2.bcf | ||
chr2.bcf.csi | ||
# More BCFs | ||
/reads/ | ||
sample1.bam | ||
sample1.bam.bai | ||
sample2.bam | ||
sample2.bam.bai | ||
# More BAMS | ||
|
||
.. note:: Any change to the data repository (using the repository manager or | ||
otherwise) requires a restart of the server to be picked up by the | ||
server. The server does not detect changes in the data repository | ||
while running. | ||
|
||
------------------ | ||
Repository manager | ||
------------------ | ||
|
||
The repository manager is a tool provided to abstract away the details of | ||
building a data repository behind a convenient command line interface. It can | ||
be accessed via ``ga4gh_repo`` (or ``python repo_dev.py`` if developing). | ||
Following are descriptions of the commands that the repo manager exposes. | ||
|
||
All of the ``add-*`` commands take a ``--moveMode`` flag which specifies how | ||
to transfer the given file (or directory) into the data repository. The | ||
options are ``move`` (moves the file from its original path to the new | ||
path), ``copy`` (copies the contents of the file into the data repository) and | ||
``link`` (creates a symlink in the data repository to the file). The | ||
default is ``link``. | ||
|
||
Many of the ``add-*`` commands take additional flags to specify fields to be | ||
entered into the ``.json`` files that are created for the given file. | ||
Utilize the command line help for a particular command to get a list of | ||
these flags. | ||
The repository manager provides an administration interface to the the data | ||
repository. It can be accessed via ``ga4gh_repo`` (or ``python repo_dev.py`` if | ||
developing). Following are descriptions of the commands that the repo manager | ||
exposes. | ||
|
||
+++++++ | ||
init | ||
+++++++ | ||
|
||
Initializes a data repository at the path provided. All of the other | ||
commands require a data repository path as an argument, so this will likely be | ||
commands require a data repository file as an argument, so this will likely be | ||
the first command you run. | ||
|
||
.. code-block:: bash | ||
|
||
$ ga4gh_repo init path/to/datarepo | ||
$ ga4gh_repo init path/to/repo.db | ||
|
||
|
||
+++++++ | ||
check | ||
verify | ||
+++++++ | ||
|
||
Performs some consistency checks on the given data repository to ensure it is | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should be more specific about what this is and why one would run it. What is a warning, fatal error? From below:
|
||
well-formed. | ||
|
||
.. code-block:: bash | ||
|
||
$ ga4gh_repo check path/to/datarepo | ||
$ ga4gh_repo verify path/to/repo.db | ||
|
||
+++++++ | ||
list | ||
|
@@ -230,17 +55,7 @@ Lists the contents of the given data repository. | |
|
||
.. code-block:: bash | ||
|
||
$ ga4gh_repo list path/to/datarepo | ||
|
||
+++++++ | ||
destroy | ||
+++++++ | ||
|
||
Destroys the given data repository by deleting its directory tree. | ||
|
||
.. code-block:: bash | ||
|
||
$ ga4gh_repo destroy path/to/datarepo | ||
$ ga4gh_repo list path/to/repo.db | ||
|
||
+++++++++++ | ||
add-dataset | ||
|
@@ -250,7 +65,7 @@ Creates a dataset in the given repository with a given name. | |
|
||
.. code-block:: bash | ||
|
||
$ ga4gh_repo add-dataset path/to/datarepo aDataset | ||
$ ga4gh_repo add-dataset path/to/repo.db aDataset | ||
|
||
+++++++++++++++ | ||
remove-dataset | ||
|
@@ -260,18 +75,17 @@ Destroys a dataset in the given repository with a given name. | |
|
||
.. code-block:: bash | ||
|
||
$ ga4gh_repo remove-dataset path/to/datarepo aDataset | ||
$ ga4gh_repo remove-dataset path/to/repo.db aDataset | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The optional arguments should be documented as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the time for these comments was in the #1158 PR @david4096. This was never meant to be a full and final update of the docs. The stated intention was to make them less inaccurate. We definitely want to do a full run through of the docs once this has been merged, but I don't think it makes sense to block the whole PR on it. |
||
|
||
++++++++++++++++ | ||
add-referenceset | ||
++++++++++++++++ | ||
|
||
Adds a given reference set file to a given data repository. The file must | ||
have the extension ``.fa.gz``. | ||
Adds a given reference set file to a given data repository. | ||
|
||
.. code-block:: bash | ||
|
||
$ ga4gh_repo add-referenceset path/to/datarepo path/to/aReferenceSet.fa.gz | ||
$ ga4gh_repo add-referenceset path/to/repo.db path/to/aReferenceSet.fa.gz | ||
|
||
++++++++++++++++++++ | ||
remove-referenceset | ||
|
@@ -281,30 +95,30 @@ Removes a given reference set from a given data repository. | |
|
||
.. code-block:: bash | ||
|
||
$ ga4gh_repo remove-referenceset path/to/datarepo aReferenceSet | ||
$ ga4gh_repo remove-referenceset path/to/repo.db aReferenceSet | ||
|
||
++++++++++++++++ | ||
add-ontologymap | ||
add-ontology | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This has gone back and forth once or twice, just making sure we land where want to be "map" or not "map". Looks like this is the correct signature according to the code. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
++++++++++++++++ | ||
|
||
Adds an Ontology Map, which maps identifiers to ontology terms, to | ||
Adds an Ontology Map, which maps identifiers to ontology terms, to | ||
the repository. Ontology maps are tab delimited files with an | ||
identifier/term pair per row. | ||
|
||
|
||
.. code-block:: bash | ||
|
||
$ ga4gh_repo add-ontologymap path/to/datarepo path/to/aOntoMap.txt | ||
$ ga4gh_repo add-ontology path/to/repo.db path/to/aOntoMap.txt | ||
|
||
++++++++++++++++++++ | ||
remove-ontologymap | ||
remove-ontology | ||
++++++++++++++++++++ | ||
|
||
Removes a given Ontology Map from a given data repository. | ||
|
||
.. code-block:: bash | ||
|
||
$ ga4gh_repo remove-ontologymap path/to/datarepo aOntoMap | ||
$ ga4gh_repo remove-ontology path/to/repo.db aOntoMap | ||
|
||
|
||
+++++++++++++++++ | ||
|
@@ -316,7 +130,7 @@ file must have the extension ``.bam``. | |
|
||
.. code-block:: bash | ||
|
||
$ ga4gh_repo add-readgroupset path/to/datarepo aDataset path/to/aReadGroupSet.bam | ||
$ ga4gh_repo add-readgroupset path/to/repo.db aDataset path/to/aReadGroupSet.bam | ||
|
||
++++++++++++++++++++ | ||
remove-readgroupset | ||
|
@@ -326,18 +140,19 @@ Removes a read group set from a given data repository and dataset. | |
|
||
.. code-block:: bash | ||
|
||
$ ga4gh_repo remove-readgroupset path/to/datarepo aDataset aReadGroupSet | ||
$ ga4gh_repo remove-readgroupset path/to/repo.db aDataset aReadGroupSet | ||
|
||
+++++++++++++++ | ||
add-variantset | ||
+++++++++++++++ | ||
|
||
Adds a variant set directory to a given data repository and dataset. The | ||
directory should contain file(s) with extension ``.vcf.gz``. If a variant set is annotated it will be added as both a variant set and a variant annotation set. | ||
directory should contain file(s) with extension ``.vcf.gz``. If a variant set | ||
is annotated it will be added as both a variant set and a variant annotation set. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are we providing indexes for them? We should document that no two VCFs can be on the same contig. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should create an issue holding the documentation requirements. This is not related to the actual changes in the PR. |
||
|
||
.. code-block:: bash | ||
|
||
$ ga4gh_repo add-variantset path/to/datarepo aDataset path/to/aVariantSet | ||
$ ga4gh_repo add-variantset path/to/repo.db aDataset path/to/aVariantSet | ||
|
||
+++++++++++++++++ | ||
remove-variantset | ||
|
@@ -347,7 +162,7 @@ Removes a variant set from a given data repository and dataset. | |
|
||
.. code-block:: bash | ||
|
||
$ ga4gh_repo remove-variantset path/to/datarepo aDataset aVariantSet | ||
$ ga4gh_repo remove-variantset path/to/repo.db aDataset aVariantSet | ||
|
||
------------------ | ||
Configuration file | ||
|
@@ -357,12 +172,12 @@ The GA4GH reference server is a `Flask application <http://flask.pocoo.org/>`_ | |
and uses the standard `Flask configuration file mechanisms | ||
<http://flask.pocoo.org/docs/0.10/config/>`_. | ||
Many configuration files will be very simple, and will consist of just | ||
one directive instructing the server where to look for data; for | ||
one directive instructing the server where to find the data repository; | ||
example, we might have | ||
|
||
.. code-block:: python | ||
|
||
DATA_SOURCE = "/path/to/data/root" | ||
DATA_SOURCE = "/path/to/repo.db" | ||
|
||
For production deployments, we shouldn't need to add any more configuration | ||
than this, as the other keys have sensible defaults. However, | ||
|
@@ -413,7 +228,7 @@ RESPONSE_VALIDATION | |
purposes. | ||
|
||
LANDING_MESSAGE_HTML | ||
The server provides a simple landing page at its root. By setting this | ||
The server provides a simple landing page at its root. By setting this | ||
value to point at a file containing an HTML block element it is possible to | ||
customize the landing page. This can be helpful to provide support links | ||
or details about the hosted datasets. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: data "are" organized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depends on who you ask... we should rephrase this in any case if we're going for the
registry db
nomenclature.