Skip to content
This repository was archived by the owner on Jan 24, 2018. It is now read-only.

Sql repo #1166

Merged
merged 65 commits into from
May 5, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
769c813
Initial work on SQL DB for data repo.
jeromekelleher Apr 11, 2016
6c9eeac
Incremental support for FileSystemDataRepository.
jeromekelleher Apr 11, 2016
9fb9aeb
Changed test referenceSets to use single FASTA.
jeromekelleher Apr 9, 2016
017ce51
Updates to Ontologies and VariantAnnotationSet.
jeromekelleher Apr 11, 2016
7e97010
Initial repo manager add-readgroupset.
jeromekelleher Apr 10, 2016
0192a22
Included sequence annotations in DB.
jeromekelleher Apr 11, 2016
e32f7ec
Resolved conflicts in reads code.
jeromekelleher Apr 11, 2016
089c3e4
Refactored repoman CLI and started unit tests.
jeromekelleher Apr 11, 2016
2965783
Temporarily drop min-coverage to 80%
jeromekelleher Apr 11, 2016
fe8ad0a
Refactor for variant annotations code.
jeromekelleher Apr 12, 2016
7a7009d
Moved Variant Annotation Sets into VariantSets.
jeromekelleher Apr 13, 2016
87879c4
Finished refactor of variant annotations.
jeromekelleher Apr 14, 2016
b8cc2bb
Merge pull request #1079 from jeromekelleher/sql-repo-experiment
david4096 Apr 14, 2016
920a622
Added UNIQUE indexes to some tables.
jeromekelleher Apr 14, 2016
13e0906
Added URL support for ReadGroupSets.
jeromekelleher Apr 15, 2016
1445e43
Added primary keys, unique and foreign keys.
jeromekelleher Apr 15, 2016
5b8b095
Added CASCADE delete for datasets.
jeromekelleher Apr 15, 2016
2e6544c
added remove-readgroupset and tidied tests.
jeromekelleher Apr 15, 2016
2d0e64f
Miscellaneous fixes from review.
jeromekelleher Apr 18, 2016
2730605
Merge pull request #1104 from jeromekelleher/finish-repomanager
dcolligan Apr 21, 2016
f9afea4
Add feature set add / delete to repo manager
dcolligan Apr 20, 2016
96a88ae
Merge pull request #1125 from dcolligan/1109_featureset
jeromekelleher Apr 22, 2016
46b9eed
Changed Ontology to OntologyTermMap.
jeromekelleher Apr 18, 2016
c41460b
Added ontology add/delete/list functionality.
jeromekelleher Apr 22, 2016
f45c1a8
Merge pull request #1117 from jeromekelleher/ontologies-update
jeromekelleher Apr 26, 2016
e6a1e92
Add version check to data repo
dcolligan Apr 21, 2016
c573843
Merge pull request #1128 from dcolligan/1116_versioning
dcolligan Apr 26, 2016
88d2c61
Store various readGroup(Set) fields in db
dcolligan Apr 22, 2016
63e514e
Merge pull request #1140 from dcolligan/1129_file_handles
dcolligan Apr 26, 2016
d1bd433
Support for Variants/VariantAnnotations.
jeromekelleher Apr 22, 2016
c2b5718
Merge pull request #1126 from jeromekelleher/sql-repo-variants
jeromekelleher Apr 27, 2016
a67d66f
Merge remote-tracking branch 'upstream/master' into update-sql-repo
jeromekelleher Apr 27, 2016
51e47c6
Merge pull request #1153 from jeromekelleher/update-sql-repo
dcolligan Apr 27, 2016
35a6776
Throw exception if DB does not exist
dcolligan Apr 26, 2016
5b0d32a
Merge pull request #1148 from dcolligan/1141_db_error
dcolligan Apr 27, 2016
136e030
Updated download example data script.
jeromekelleher Apr 27, 2016
c228a2d
Merge pull request #1156 from jeromekelleher/sql-repo-example-data
david4096 Apr 27, 2016
dade81e
Removed the FileSystemDataRepository.
jeromekelleher Apr 22, 2016
3fc072b
Merge pull request #1155 from jeromekelleher/remove-filesystemdataset2
jeromekelleher Apr 28, 2016
03aa396
Updated documentation for SQL repo.
jeromekelleher Apr 27, 2016
f15fb7f
Merge pull request #1158 from jeromekelleher/sql-repo-documentation
dcolligan Apr 28, 2016
a50daa7
Fix for issue 1164, ensures isDerived is boolean.
jeromekelleher Apr 28, 2016
d73cff3
Merge pull request #1165 from jeromekelleher/1164-referencesets
jeromekelleher Apr 28, 2016
7bddff6
Add options to repo manager's add-referenceset
dcolligan Apr 28, 2016
507e296
Merge pull request #1184 from dcolligan/1150_ars_options
dcolligan Apr 29, 2016
426f899
Remove unintended db.commit
david4096 Apr 29, 2016
e93914a
Change type of ncbiTaxonId to integer
david4096 Apr 29, 2016
26f12c5
Set the name of a VASet to its localId
david4096 Apr 29, 2016
1f027a4
Prepare compliance data updated
david4096 Apr 29, 2016
157e775
Merge pull request #1189 from david4096/commit_ref
david4096 Apr 29, 2016
148737b
Duplicate removal test for objects under datasets
dcolligan May 2, 2016
4603de8
Merge pull request #1195 from dcolligan/1181_duplicate_test
david4096 May 2, 2016
34a4e5f
Miscellaneous review fixes.
jeromekelleher May 4, 2016
765015d
Merge pull request #1207 from jeromekelleher/review-fixes-2
david4096 May 4, 2016
ff3b773
Merge remote-tracking branch 'upstream/master' into sql-merge-update
jeromekelleher May 4, 2016
dc34d2d
Merge pull request #1208 from jeromekelleher/sql-merge-update
david4096 May 4, 2016
5495fad
Fixed #1209, search across multiple read groups within a single readg…
macieksmuga May 4, 2016
6aae314
Merge branch 'sql_repo' into 1209_multiple_read_groups_fix
macieksmuga May 4, 2016
8e2cdd9
Merge pull request #1212 from macieksmuga/1209_multiple_read_groups_fix
dcolligan May 5, 2016
f45f5f1
Add top-level exception handler to repo manager
dcolligan May 2, 2016
fd2aa74
Merge pull request #1193 from dcolligan/1172_exception
dcolligan May 5, 2016
71c031a
Repo manager end to end tests enabled
dcolligan May 3, 2016
3f36131
Merge pull request #1198 from dcolligan/1178_enable_tests
dcolligan May 5, 2016
f2a7388
Reinstate TestVerify
dcolligan May 2, 2016
b2e6334
Merge pull request #1194 from dcolligan/1182_test_verify
dcolligan May 5, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ before_script:
script:
- flake8 *.py tests ga4gh scripts --exclude=ez_setup.py
- nosetests --with-coverage --cover-package ga4gh
--cover-inclusive --cover-min-percentage 85
--cover-inclusive --cover-min-percentage 80
--cover-branches --cover-erase
- make clean -C docs
- make -C docs
257 changes: 36 additions & 221 deletions docs/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,222 +5,47 @@ Configuration
*************

The GA4GH reference server has two basic elements to its configuration:
the `Data repository`_ and the `Configuration file`_. The repository is most easily configured via the `Repository manager`_ command line tool.
the `Data repository`_ and the `Configuration file`_.

---------------
Data repository
---------------

Data is input to the GA4GH server as a directory hierarchy, in which
the structure of data to be served is represented by the file system.
At the top level of the data hierarchy there are two required
directories to hold the top level container types: ``referenceSets`` and
``datasets``.

.. todo:: We need to link to the high-level API documentation for descriptions
of what the various objects here mean.

+++++++++++++
ReferenceSets
+++++++++++++

Within the data directory there must be a directory called ``referenceSets``.
Within this directory, each directory is interpreted as containing a
``ReferenceSet`` with the directory name mapped to the name of the
reference set. Here is an example of how reference data should be arranged::

references/
GRCh37.json
GRCh37/
1.fa.gz
1.fa.gz.fai
1.json
2.fa.gz
2.fa.gz.fai
2.json
# More references
GRCh38.json
GRCh38/
1.fa.gz
1.fa.gz.fai
1.json
2.fa.gz
2.fa.gz.fai
2.json
# More references

In this example we have two reference sets, with names ``GRCh37`` and ``GRCh38``.
Each reference set directory must be accompanied by a file
in JSON format, which lists the metadata for a given reference. For example,
the ``GRCh37.json`` file above might look something like

.. code-block:: json

{
"description": "GRCh37 primary assembly",
"sourceUri": "TODO",
"assemblyId": "TODO",
"sourceAccessions": [],
"isDerived": false,
"ncbiTaxonId": 9606
}

Within a reference set directory is a set of files defining the references
themselves. Each reference object corresponds to three files: the bgzip
compressed FASTA sequences, the FAI index and a JSON file providing the
metadata. There must be exactly one sequence per FASTA file, and the
sequence ID in the FASTA file must be equal to the reference name
(i.e., the first line in ``1.fa`` above should start with ``>1``.)

The JSON metadata required for a reference is similar to a reference set.
An example might look something like:

.. code-block:: json

{
"sourceUri": "TODO",
"sourceAccessions": [
"CM000663.2"
],
"sourceDivergence": null,
"md5checksum": "bb07c91cda4645ad8e75e375e3d6e5eb",
"isDerived": false,
"ncbiTaxonId": 9606
}


++++++++++
Datasets
++++++++++

The main container for genetic data is the dataset. Within the
main data directory there must be a directory called ``datasets``.
Within this directory each subdirectory is interpreted as a
dataset of that name. For example, we might have something like::

datasets/
1kg-phase1
variants/
# Variant data
reads/
# Read data
1kg-phase3
variants/
# Variant data
reads/
# Read data

In this case we specify two datasets with name equal to ``1kg-phase1`` and
``1kg-phase3``. These directories contain the read and variant data
within the ``variants`` and ``reads`` directory, respectively.

++++++++
Variants
++++++++

Each dataset can contain a number of VariantSets, each of which basically
corresponds to a VCF file. Because VCF files are commonly split by chromosome
a VariantSet can consist of many VCF files that have consistent metadata.
Within the ``variants`` directory, each directory is interpreted as a
variant set with that name. A variant set directory then contains
one or more indexed VCF/BCF files.

+++++
Reads
+++++

A dataset can contain many ReadGroupSets, and each ReadGroupSet contains
a number of ReadGroups. The ``reads`` directory contains a number of BAM
files, each of which corresponds to a single ReadGroupSet. ReadGroups are
then mapped to the ReadGroups that we find within the BAM file.
The repository in the GA4GH reference server defines how your data is organised. The
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: data "are" organized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on who you ask... we should rephrase this in any case if we're going for the registry db nomenclature.

repository itself is a SQLite database, which contains information about your
datasets, reference sets and so on. Bulk data (such as variants and reads)
is not stored in database, but instead accessed directly from the primary
data files at run time. The locations of these data files is entirely up
to the administrator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...but in some cases still requires write access.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to be precise about how and when here. It would just be confusing to say that we'll sometimes need write access without fully qualifying.


+++++++
Example
+++++++

An example layout might look like::

ga4gh-data/
referencesSet/
referenceSet1.json
referenceSet1/
1.fa.gz
1.fa.gz.fai
1.json
2.fa.gz
2.fa.gz.fai
2.json
# More references
datasets/
dataset1/
/variants/
variantSet1/
chr1.vcf.gz
chr1.vcf.gz.tbi
chr2.vcf.gz
chr2.vcf.gz.tbi
# More VCFs
variantSet2/
chr1.bcf
chr1.bcf.csi
chr2.bcf
chr2.bcf.csi
# More BCFs
/reads/
sample1.bam
sample1.bam.bai
sample2.bam
sample2.bam.bai
# More BAMS

.. note:: Any change to the data repository (using the repository manager or
otherwise) requires a restart of the server to be picked up by the
server. The server does not detect changes in the data repository
while running.

------------------
Repository manager
------------------

The repository manager is a tool provided to abstract away the details of
building a data repository behind a convenient command line interface. It can
be accessed via ``ga4gh_repo`` (or ``python repo_dev.py`` if developing).
Following are descriptions of the commands that the repo manager exposes.

All of the ``add-*`` commands take a ``--moveMode`` flag which specifies how
to transfer the given file (or directory) into the data repository. The
options are ``move`` (moves the file from its original path to the new
path), ``copy`` (copies the contents of the file into the data repository) and
``link`` (creates a symlink in the data repository to the file). The
default is ``link``.

Many of the ``add-*`` commands take additional flags to specify fields to be
entered into the ``.json`` files that are created for the given file.
Utilize the command line help for a particular command to get a list of
these flags.
The repository manager provides an administration interface to the the data
repository. It can be accessed via ``ga4gh_repo`` (or ``python repo_dev.py`` if
developing). Following are descriptions of the commands that the repo manager
exposes.

+++++++
init
+++++++

Initializes a data repository at the path provided. All of the other
commands require a data repository path as an argument, so this will likely be
commands require a data repository file as an argument, so this will likely be
the first command you run.

.. code-block:: bash

$ ga4gh_repo init path/to/datarepo
$ ga4gh_repo init path/to/repo.db


+++++++
check
verify
+++++++

Performs some consistency checks on the given data repository to ensure it is
Copy link
Member

@david4096 david4096 Apr 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be more specific about what this is and why one would run it. What is a warning, fatal error?

From below:

        Checks that the data pointed to in the repository works and
        we don't have any broken URLs, missing files, etc.

well-formed.

.. code-block:: bash

$ ga4gh_repo check path/to/datarepo
$ ga4gh_repo verify path/to/repo.db

+++++++
list
Expand All @@ -230,17 +55,7 @@ Lists the contents of the given data repository.

.. code-block:: bash

$ ga4gh_repo list path/to/datarepo

+++++++
destroy
+++++++

Destroys the given data repository by deleting its directory tree.

.. code-block:: bash

$ ga4gh_repo destroy path/to/datarepo
$ ga4gh_repo list path/to/repo.db

+++++++++++
add-dataset
Expand All @@ -250,7 +65,7 @@ Creates a dataset in the given repository with a given name.

.. code-block:: bash

$ ga4gh_repo add-dataset path/to/datarepo aDataset
$ ga4gh_repo add-dataset path/to/repo.db aDataset

+++++++++++++++
remove-dataset
Expand All @@ -260,18 +75,17 @@ Destroys a dataset in the given repository with a given name.

.. code-block:: bash

$ ga4gh_repo remove-dataset path/to/datarepo aDataset
$ ga4gh_repo remove-dataset path/to/repo.db aDataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optional arguments should be documented as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the time for these comments was in the #1158 PR @david4096. This was never meant to be a full and final update of the docs. The stated intention was to make them less inaccurate. We definitely want to do a full run through of the docs once this has been merged, but I don't think it makes sense to block the whole PR on it.


++++++++++++++++
add-referenceset
++++++++++++++++

Adds a given reference set file to a given data repository. The file must
have the extension ``.fa.gz``.
Adds a given reference set file to a given data repository.

.. code-block:: bash

$ ga4gh_repo add-referenceset path/to/datarepo path/to/aReferenceSet.fa.gz
$ ga4gh_repo add-referenceset path/to/repo.db path/to/aReferenceSet.fa.gz

++++++++++++++++++++
remove-referenceset
Expand All @@ -281,30 +95,30 @@ Removes a given reference set from a given data repository.

.. code-block:: bash

$ ga4gh_repo remove-referenceset path/to/datarepo aReferenceSet
$ ga4gh_repo remove-referenceset path/to/repo.db aReferenceSet

++++++++++++++++
add-ontologymap
add-ontology
Copy link
Member

@david4096 david4096 Apr 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has gone back and forth once or twice, just making sure we land where want to be "map" or not "map". Looks like this is the correct signature according to the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in #1117, and in preparation for #1147 we want Ontology.

++++++++++++++++

Adds an Ontology Map, which maps identifiers to ontology terms, to
Adds an Ontology Map, which maps identifiers to ontology terms, to
the repository. Ontology maps are tab delimited files with an
identifier/term pair per row.


.. code-block:: bash

$ ga4gh_repo add-ontologymap path/to/datarepo path/to/aOntoMap.txt
$ ga4gh_repo add-ontology path/to/repo.db path/to/aOntoMap.txt

++++++++++++++++++++
remove-ontologymap
remove-ontology
++++++++++++++++++++

Removes a given Ontology Map from a given data repository.

.. code-block:: bash

$ ga4gh_repo remove-ontologymap path/to/datarepo aOntoMap
$ ga4gh_repo remove-ontology path/to/repo.db aOntoMap


+++++++++++++++++
Expand All @@ -316,7 +130,7 @@ file must have the extension ``.bam``.

.. code-block:: bash

$ ga4gh_repo add-readgroupset path/to/datarepo aDataset path/to/aReadGroupSet.bam
$ ga4gh_repo add-readgroupset path/to/repo.db aDataset path/to/aReadGroupSet.bam

++++++++++++++++++++
remove-readgroupset
Expand All @@ -326,18 +140,19 @@ Removes a read group set from a given data repository and dataset.

.. code-block:: bash

$ ga4gh_repo remove-readgroupset path/to/datarepo aDataset aReadGroupSet
$ ga4gh_repo remove-readgroupset path/to/repo.db aDataset aReadGroupSet

+++++++++++++++
add-variantset
+++++++++++++++

Adds a variant set directory to a given data repository and dataset. The
directory should contain file(s) with extension ``.vcf.gz``. If a variant set is annotated it will be added as both a variant set and a variant annotation set.
directory should contain file(s) with extension ``.vcf.gz``. If a variant set
is annotated it will be added as both a variant set and a variant annotation set.
Copy link
Member

@david4096 david4096 Apr 28, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we providing indexes for them? We should document that no two VCFs can be on the same contig.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should create an issue holding the documentation requirements. This is not related to the actual changes in the PR.


.. code-block:: bash

$ ga4gh_repo add-variantset path/to/datarepo aDataset path/to/aVariantSet
$ ga4gh_repo add-variantset path/to/repo.db aDataset path/to/aVariantSet

+++++++++++++++++
remove-variantset
Expand All @@ -347,7 +162,7 @@ Removes a variant set from a given data repository and dataset.

.. code-block:: bash

$ ga4gh_repo remove-variantset path/to/datarepo aDataset aVariantSet
$ ga4gh_repo remove-variantset path/to/repo.db aDataset aVariantSet

------------------
Configuration file
Expand All @@ -357,12 +172,12 @@ The GA4GH reference server is a `Flask application <http://flask.pocoo.org/>`_
and uses the standard `Flask configuration file mechanisms
<http://flask.pocoo.org/docs/0.10/config/>`_.
Many configuration files will be very simple, and will consist of just
one directive instructing the server where to look for data; for
one directive instructing the server where to find the data repository;
example, we might have

.. code-block:: python

DATA_SOURCE = "/path/to/data/root"
DATA_SOURCE = "/path/to/repo.db"

For production deployments, we shouldn't need to add any more configuration
than this, as the other keys have sensible defaults. However,
Expand Down Expand Up @@ -413,7 +228,7 @@ RESPONSE_VALIDATION
purposes.

LANDING_MESSAGE_HTML
The server provides a simple landing page at its root. By setting this
The server provides a simple landing page at its root. By setting this
value to point at a file containing an HTML block element it is possible to
customize the landing page. This can be helpful to provide support links
or details about the hosted datasets.
Expand Down
Loading