ga4gh · dcolligan · May 5, 2016 · Apr 11, 2016 · Apr 11, 2016 · Apr 9, 2016
diff --git a/.travis.yml b/.travis.yml
@@ -32,7 +32,7 @@ before_script:
 script: 
   - flake8 *.py tests ga4gh scripts --exclude=ez_setup.py
   - nosetests --with-coverage --cover-package ga4gh
-              --cover-inclusive --cover-min-percentage 85
+              --cover-inclusive --cover-min-percentage 80
               --cover-branches --cover-erase
   - make clean -C docs
   - make -C docs
diff --git a/docs/configuration.rst b/docs/configuration.rst
@@ -5,222 +5,47 @@ Configuration
 *************
 
 The GA4GH reference server has two basic elements to its configuration:
-the `Data repository`_ and the `Configuration file`_.  The repository is most easily configured via the `Repository manager`_ command line tool.
+the `Data repository`_ and the `Configuration file`_.
 
 ---------------
 Data repository
 ---------------
 
-Data is input to the GA4GH server as a directory hierarchy, in which
-the structure of data to be served is represented by the file system.
-At the top level of the data hierarchy there are two required
-directories to hold the top level container types: ``referenceSets`` and
-``datasets``.
-
-.. todo:: We need to link to the high-level API documentation for descriptions
-    of what the various objects here mean.
-
-+++++++++++++
-ReferenceSets
-+++++++++++++
-
-Within the data directory there must be a directory called ``referenceSets``.
-Within this directory, each directory is interpreted as containing a
-``ReferenceSet`` with the directory name mapped to the name of the
-reference set.  Here is an example of how reference data should be arranged::
-
-    references/
-        GRCh37.json
-        GRCh37/
-            1.fa.gz
-            1.fa.gz.fai
-            1.json
-            2.fa.gz
-            2.fa.gz.fai
-            2.json
-            # More references
-        GRCh38.json
-        GRCh38/
-            1.fa.gz
-            1.fa.gz.fai
-            1.json
-            2.fa.gz
-            2.fa.gz.fai
-            2.json
-            # More references
-
-In this example we have two reference sets, with names ``GRCh37`` and ``GRCh38``.
-Each reference set directory must be accompanied by a file
-in JSON format, which lists the metadata for a given reference. For example,
-the ``GRCh37.json`` file above might look something like
-
-.. code-block:: json
-
-    {
-        "description": "GRCh37 primary assembly",
-        "sourceUri": "TODO",
-        "assemblyId": "TODO",
-        "sourceAccessions": [],
-        "isDerived": false,
-        "ncbiTaxonId": 9606
-    }
-
-Within a reference set directory is a set of files defining the references
-themselves. Each reference object corresponds to three files: the bgzip
-compressed FASTA sequences, the FAI index and a JSON file providing the
-metadata. There must be exactly one sequence per FASTA file, and the
-sequence ID in the FASTA file must be equal to the reference name
-(i.e., the first line in ``1.fa`` above should start with ``>1``.)
-
-The JSON metadata required for a reference is similar to a reference set.
-An example might look something like:
-
-.. code-block:: json
-
-    {
-        "sourceUri": "TODO",
-        "sourceAccessions": [
-            "CM000663.2"
-        ],
-        "sourceDivergence": null,
-        "md5checksum": "bb07c91cda4645ad8e75e375e3d6e5eb",
-        "isDerived": false,
-        "ncbiTaxonId": 9606
-    }
-
-
-++++++++++
-Datasets
-++++++++++
-
-The main container for genetic data is the dataset. Within the
-main data directory there must be a directory called ``datasets``.
-Within this directory each subdirectory is interpreted as a
-dataset of that name. For example, we might have something like::
-
-    datasets/
-        1kg-phase1
-            variants/
-                # Variant data
-            reads/
-                # Read data
-        1kg-phase3
-            variants/
-                # Variant data
-            reads/
-                # Read data
-
-In this case we specify two datasets with name equal to ``1kg-phase1`` and
-``1kg-phase3``. These directories contain the read and variant data
-within the ``variants`` and ``reads`` directory, respectively.
-
-++++++++
-Variants
-++++++++
-
-Each dataset can contain a number of VariantSets, each of which basically
-corresponds to a VCF file. Because VCF files are commonly split by chromosome
-a VariantSet can consist of many VCF files that have consistent metadata.
-Within the ``variants`` directory, each directory is interpreted as a
-variant set with that name. A variant set directory then contains
-one or more indexed VCF/BCF files.
-
-+++++
-Reads
-+++++
-
-A dataset can contain many ReadGroupSets, and each ReadGroupSet contains
-a number of ReadGroups. The ``reads`` directory contains a number of BAM
-files, each of which corresponds to a single ReadGroupSet. ReadGroups are
-then mapped to the ReadGroups that we find within the BAM file.
+The repository in the GA4GH reference server defines how your data is organised. The
+repository itself is a SQLite database, which contains information about your
+datasets, reference sets and so on. Bulk data (such as variants and reads)
+is not stored in database, but instead accessed directly from the primary
+data files at run time. The locations of these data files is entirely up
+to the administrator.
 
-+++++++
-Example
-+++++++
-
-An example layout might look like::
-
-    ga4gh-data/
-        referencesSet/
-            referenceSet1.json
-            referenceSet1/
-                1.fa.gz
-                1.fa.gz.fai
-                1.json
-                2.fa.gz
-                2.fa.gz.fai
-                2.json
-                # More references
-        datasets/
-            dataset1/
-                /variants/
-                    variantSet1/
-                        chr1.vcf.gz
-                        chr1.vcf.gz.tbi
-                        chr2.vcf.gz
-                        chr2.vcf.gz.tbi
-                        # More VCFs
-                    variantSet2/
-                        chr1.bcf
-                        chr1.bcf.csi
-                        chr2.bcf
-                        chr2.bcf.csi
-                        # More BCFs
-                /reads/
-                    sample1.bam
-                    sample1.bam.bai
-                    sample2.bam
-                    sample2.bam.bai
-                    # More BAMS
-
-.. note:: Any change to the data repository (using the repository manager or
-    otherwise) requires a restart of the server to be picked up by the
-    server.  The server does not detect changes in the data repository
-    while running.
-
-------------------
-Repository manager
-------------------
-
-The repository manager is a tool provided to abstract away the details of
-building a data repository behind a convenient command line interface.  It can
-be accessed via ``ga4gh_repo`` (or ``python repo_dev.py`` if developing).
-Following are descriptions of the commands that the repo manager exposes.
-
-All of the ``add-*`` commands take a ``--moveMode`` flag which specifies how
-to transfer the given file (or directory) into the data repository.  The
-options are ``move`` (moves the file from its original path to the new
-path), ``copy`` (copies the contents of the file into the data repository) and
-``link`` (creates a symlink in the data repository to the file).  The
-default is ``link``.
-
-Many of the ``add-*`` commands take additional flags to specify fields to be
-entered into the ``.json`` files that are created for the given file.
-Utilize the command line help for a particular command to get a list of
-these flags.
+The repository manager provides an administration interface to the the data
+repository. It can be accessed via ``ga4gh_repo`` (or ``python repo_dev.py`` if
+developing). Following are descriptions of the commands that the repo manager
+exposes.
 
 +++++++
 init
 +++++++
 
 Initializes a data repository at the path provided.  All of the other
-commands require a data repository path as an argument, so this will likely be
+commands require a data repository file as an argument, so this will likely be
 the first command you run.
 
 .. code-block:: bash
 
-    $ ga4gh_repo init path/to/datarepo
+    $ ga4gh_repo init path/to/repo.db
+
 
 +++++++
-check
+verify
 +++++++
 
 Performs some consistency checks on the given data repository to ensure it is
 well-formed.
 
 .. code-block:: bash
 
-    $ ga4gh_repo check path/to/datarepo
+    $ ga4gh_repo verify path/to/repo.db
 
 +++++++
 list
@@ -230,17 +55,7 @@ Lists the contents of the given data repository.
 
 .. code-block:: bash
 
-    $ ga4gh_repo list path/to/datarepo
-
-+++++++
-destroy
-+++++++
-
-Destroys the given data repository by deleting its directory tree.
-
-.. code-block:: bash
-
-    $ ga4gh_repo destroy path/to/datarepo
+    $ ga4gh_repo list path/to/repo.db
 
 +++++++++++
 add-dataset
@@ -250,7 +65,7 @@ Creates a dataset in the given repository with a given name.
 
 .. code-block:: bash
 
-    $ ga4gh_repo add-dataset path/to/datarepo aDataset
+    $ ga4gh_repo add-dataset path/to/repo.db aDataset
 
 +++++++++++++++
 remove-dataset
@@ -260,18 +75,17 @@ Destroys a dataset in the given repository with a given name.
 
 .. code-block:: bash
 
-    $ ga4gh_repo remove-dataset path/to/datarepo aDataset
+    $ ga4gh_repo remove-dataset path/to/repo.db aDataset
 
 ++++++++++++++++
 add-referenceset
 ++++++++++++++++
 
-Adds a given reference set file to a given data repository.  The file must
-have the extension ``.fa.gz``.
+Adds a given reference set file to a given data repository.
 
 .. code-block:: bash
 
-    $ ga4gh_repo add-referenceset path/to/datarepo path/to/aReferenceSet.fa.gz
+    $ ga4gh_repo add-referenceset path/to/repo.db path/to/aReferenceSet.fa.gz
 
 ++++++++++++++++++++
 remove-referenceset
@@ -281,30 +95,30 @@ Removes a given reference set from a given data repository.
 
 .. code-block:: bash
 
-    $ ga4gh_repo remove-referenceset path/to/datarepo aReferenceSet
+    $ ga4gh_repo remove-referenceset path/to/repo.db aReferenceSet
 
 ++++++++++++++++
-add-ontologymap
+add-ontology
 ++++++++++++++++
 
-Adds an Ontology Map, which maps identifiers to ontology terms, to 
+Adds an Ontology Map, which maps identifiers to ontology terms, to
 the repository. Ontology maps are tab delimited files with an
 identifier/term pair per row.
 
 
 .. code-block:: bash
 
-    $ ga4gh_repo add-ontologymap path/to/datarepo path/to/aOntoMap.txt
+    $ ga4gh_repo add-ontology path/to/repo.db path/to/aOntoMap.txt
 
 ++++++++++++++++++++
-remove-ontologymap
+remove-ontology
 ++++++++++++++++++++
 
 Removes a given Ontology Map from a given data repository.
 
 .. code-block:: bash
 
-    $ ga4gh_repo remove-ontologymap path/to/datarepo aOntoMap
+    $ ga4gh_repo remove-ontology path/to/repo.db aOntoMap
 
 
 +++++++++++++++++
@@ -316,7 +130,7 @@ file must have the extension ``.bam``.
 
 .. code-block:: bash
 
-    $ ga4gh_repo add-readgroupset path/to/datarepo aDataset path/to/aReadGroupSet.bam
+    $ ga4gh_repo add-readgroupset path/to/repo.db aDataset path/to/aReadGroupSet.bam
 
 ++++++++++++++++++++
 remove-readgroupset
@@ -326,18 +140,19 @@ Removes a read group set from a given data repository and dataset.
 
 .. code-block:: bash
 
-    $ ga4gh_repo remove-readgroupset path/to/datarepo aDataset aReadGroupSet
+    $ ga4gh_repo remove-readgroupset path/to/repo.db aDataset aReadGroupSet
 
 +++++++++++++++
 add-variantset
 +++++++++++++++
 
 Adds a variant set directory to a given data repository and dataset.  The
-directory should contain file(s) with extension ``.vcf.gz``. If a variant set is annotated it will be added as both a variant set and a variant annotation set.
+directory should contain file(s) with extension ``.vcf.gz``. If a variant set
+is annotated it will be added as both a variant set and a variant annotation set.
 
 .. code-block:: bash
 
-    $ ga4gh_repo add-variantset path/to/datarepo aDataset path/to/aVariantSet
+    $ ga4gh_repo add-variantset path/to/repo.db aDataset path/to/aVariantSet
 
 +++++++++++++++++
 remove-variantset
@@ -347,7 +162,7 @@ Removes a variant set from a given data repository and dataset.
 
 .. code-block:: bash
 
-    $ ga4gh_repo remove-variantset path/to/datarepo aDataset aVariantSet
+    $ ga4gh_repo remove-variantset path/to/repo.db aDataset aVariantSet
 
 ------------------
 Configuration file
@@ -357,12 +172,12 @@ The GA4GH reference server is a `Flask application <http://flask.pocoo.org/>`_
 and uses the standard `Flask configuration file mechanisms
 <http://flask.pocoo.org/docs/0.10/config/>`_.
 Many configuration files will be very simple, and will consist of just
-one directive instructing the server where to look for data; for
+one directive instructing the server where to find the data repository;
 example, we might have
 
 .. code-block:: python
 
-    DATA_SOURCE = "/path/to/data/root"
+    DATA_SOURCE = "/path/to/repo.db"
 
 For production deployments, we shouldn't need to add any more configuration
 than this, as the other keys have sensible defaults. However,
@@ -413,7 +228,7 @@ RESPONSE_VALIDATION
     purposes.
 
 LANDING_MESSAGE_HTML
-    The server provides a simple landing page at its root. By setting this 
+    The server provides a simple landing page at its root. By setting this
     value to point at a file containing an HTML block element it is possible to
     customize the landing page. This can be helpful to provide support links
     or details about the hosted datasets.