Merge pull request ga4gh#1236 from jeromekelleher/ontology-docs

Ontology docs
ejacox · May 20, 2016 · 44d7355 · 44d7355
2 parents ef4f084 + e5553f2
commit 44d7355
Show file tree

Hide file tree

Showing 4 changed files with 77 additions and 57 deletions.
diff --git a/docs/configuration.rst b/docs/configuration.rst
@@ -57,58 +57,60 @@ when creating a new GA4GH repository.
     $ ga4gh_repo init registry.db
 
 +++++++
-verify
+list
 +++++++
 
-The ``verify`` command is used to check that the integrity of the
-data in a repository. The command checks each container object in turn
-and ensures that it can read data from it. Read errors can occur for
-any number of reasons (for example, a VCF file may have been moved
-to another location since it was added to the registry), and the
-``verify`` command allows an administrator to check that all is
-well in their repository.
+The ``list`` command is used to print the contents of a repository
+to the screen. It is an essential tool for administrators to
+understand the structure of the repository that they are managing.
 
-.. note:: The ``verify`` command is under development and will
+.. note:: The ``list`` command is under development and will
    be much more sophisticated in the future. In particular, the output
    of this command should improve considerably in the near future.
 
 .. argparse::
    :module: ga4gh.cli
    :func: getRepoManagerParser
    :prog: ga4gh_repo
-   :path: verify
+   :path: list
    :nodefault:
 
 **Examples:**
 
 .. code-block:: bash
 
-    $ ga4gh_repo verify registry.db
+    $ ga4gh_repo list registry.db
+
 
 +++++++
-list
+verify
 +++++++
 
-The ``list`` command is used to print the contents of a repository
-to the screen. It is an essential tool for administrators to
-understand the structure of the repository that they are managing.
+The ``verify`` command is used to check that the integrity of the
+data in a repository. The command checks each container object in turn
+and ensures that it can read data from it. Read errors can occur for
+any number of reasons (for example, a VCF file may have been moved
+to another location since it was added to the registry), and the
+``verify`` command allows an administrator to check that all is
+well in their repository.
 
-.. note:: The ``list`` command is under development and will
+.. note:: The ``verify`` command is under development and will
    be much more sophisticated in the future. In particular, the output
    of this command should improve considerably in the near future.
 
 .. argparse::
    :module: ga4gh.cli
    :func: getRepoManagerParser
    :prog: ga4gh_repo
-   :path: list
+   :path: verify
    :nodefault:
 
 **Examples:**
 
 .. code-block:: bash
 
-    $ ga4gh_repo list registry.db
+    $ ga4gh_repo verify registry.db
+
 
 +++++++++++
 add-dataset
@@ -169,7 +171,13 @@ Adds a reference set used in the 1000 Genomes project using the name
 add-ontology
 ++++++++++++++++
 
-.. todo:: add docs for adding ontologies.
+Adds a new ontology to the repository. The ontology supplied must be a text
+file in `OBO format
+<http://owlcollab.github.io/oboformat/doc/obo-syntax.html>`_. If you wish to
+serve sequence or variant annotations from a repository, a sequence ontology
+(SO) instance is required to translate ontology term names held in annotations
+to ontology IDs. Sequence ontology definitions can be downloaded from
+the `Sequence Ontology site <https://github.com/The-Sequence-Ontology/SO-Ontologies>`_.
 
 .. argparse::
    :module: ga4gh.cli
@@ -184,6 +192,9 @@ add-ontology
 
     $ ga4gh_repo add-ontology registry.db path/to/so-xp.obo
 
+Adds the sequence ontology ``so-xp.obo`` to the repository using the
+default naming rules.
+
 +++++++++++++++
 add-variantset
 +++++++++++++++

diff --git a/docs/demo.rst b/docs/demo.rst
@@ -59,8 +59,8 @@ Now we can download some example data, which we'll use for our demo:
 
 .. code-block:: bash
 
-    (ga4gh-env) $ wget https://github.com/ga4gh/server/releases/download/data/ga4gh-example-data-v4.0.tar
-    (ga4gh-env) $ tar -xvf ga4gh-example-data-v4.0.tar
+    (ga4gh-env) $ wget https://github.com/ga4gh/server/releases/download/data/ga4gh-example-data-v4.1.tar
+    (ga4gh-env) $ tar -xvf ga4gh-example-data-v4.1.tar
 
 After extracting the data, we can then run the ``ga4gh_server`` application:
 
@@ -231,13 +231,13 @@ Repo administrator CLI
 
 The CLI has methods for adding and removing Feature Sets, Read Group
 Sets, Variant Sets, etc. Before we can begin adding files we must first
-initialize an empty registry database. The directory that this database 
-is in should be readable and writable by the current user, as well as the 
+initialize an empty registry database. The directory that this database
+is in should be readable and writable by the current user, as well as the
 user running the server.
 
 .. code-block:: bash
 
-    ga4gh_repo init registry.db
+    $ ga4gh_repo init registry.db
 
 This command will create a file ``registry.db`` in the current working
 directory. This file should stay relatively small (a few MB for
@@ -249,7 +249,8 @@ description using the ``--description`` flag.
 
 .. code-block:: bash
 
-    ga4gh_repo add-dataset registry.db 1kgenomes --description "Variants from the 1000 Genomes project and GENCODE genes annotations"
+    $ ga4gh_repo add-dataset registry.db 1kgenomes \
+        --description "Variants from the 1000 Genomes project and GENCODE genes annotations"
 
 Add a Reference Set
 -------------------
@@ -260,56 +261,62 @@ used for the 1000 Genomes VCF.
 
 .. code-block:: bash
 
-    wget ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
+    $ wget ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
 
 This file is provided in ``.gz`` format, which we will decompress, and
 then with samtools installed on the system, recompress it using
 ``bgzip``.
 
 .. code-block:: bash
 
-    gunzip hs37d5.fa.gz
-    bgzip hs37d5.fa
+    $ gunzip hs37d5.fa.gz
+    $ bgzip hs37d5.fa
 
 This may take a few minutes depending on your system as this file is
 around 3GB. Next, we will add the reference set.
 
 .. code-block:: bash
 
-    ga4gh_repo add-referenceset registry.db /full/path/to/hs37d5.fa.gz \
+    $ ga4gh_repo add-referenceset registry.db /full/path/to/hs37d5.fa.gz \
       -d “NCBI37 assembly of the human genome” --ncbiTaxonId 9606 --name NCBI37 \
       --sourceUri "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz"
 
 A number of optional command line flags have been added. We will be
 referring to the name of this reference set ``NCBI37`` when we later add
 the variant set.
 
-Add an ontology TODO
---------------------
+Add an ontology
+---------------
 
 Ontologies provide a source for parsing variant annotations, as well as
-organizing feature types into ontology terms. This is a custom format
-created for this server.
+organizing feature types into ontology terms. A `sequence ontology
+<http://www.sequenceontology.org/>`_ instance must be added to the repository
+to translate ontology term names in sequence and variant annotations to IDs.
+Sequence ontology definitions can be downloaded from the `Sequence Ontology
+site <https://github.com/The-Sequence-Ontology/SO-Ontologies>`_.
 
 .. code-block:: bash
 
-    ga4gh_repo add-ontology registry.db /full/path/to/sequence_ontology.txt
+    $ wget https://raw.githubusercontent.com/The-Sequence-Ontology/SO-Ontologies/master/so-xp.obo
+    $ ga4gh_repo add-ontology registry.db /full/path/to/so-xp.obo -n so-xp
 
 Add sequence annotations
 ------------------------
 
 The GENCODE Genes dataset provides annotations for features on the
-reference assembly. The server uses a custom storage format for sequence 
-annotations, you can download a prepared set 
+reference assembly. The server uses a custom storage format for sequence
+annotations, you can download a prepared set
 `here <https://ga4ghstore.blob.core.windows.net/testing/gencode_v24lift37.db>`__.
-It can be added to the registry using the following command. Notice 
-we have told the registry to associate the reference set added above 
+It can be added to the registry using the following command. Notice
+we have told the registry to associate the reference set added above
 with these annotations.
 
 .. code-block:: bash
 
-    ga4gh_repo add-featureset registry.db 1kgenomes /full/path/to/gencode.v24lift37.annotation.db --referenceSetName NCBI37
-    
+    $ ga4gh_repo add-featureset registry.db 1kgenomes /full/path/to/gencode.v24lift37.annotation.db \
+        --referenceSetName NCBI37 --ontologyName so-xp
+
+
 .. todo:: Demonstrate how to generate your own sequence annotations database.
 
 Add the 1000 Genomes VCFs
@@ -321,20 +328,21 @@ release.
 
 .. code-block:: bash
 
-    wget -m ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ -nd -P release -l 1
+    $ wget -m ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ -nd -P release -l 1
     rm release/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz
 
 These files are already compressed and indexed. For the server to make use
-of the files in this directory we must move the `wgs` file, since it covers 
-chromosomes that are represented elsewhere and overlapping VCF are not 
+of the files in this directory we must move the `wgs` file, since it covers
+chromosomes that are represented elsewhere and overlapping VCF are not
 currently supported. This file could be added as a separate variant set.
 
 We can now add the directory to the registry using the following command.
 Again, notice we have referred to the reference set by name.
 
 .. code-block:: bash
 
-    ga4gh_repo add-variantset registry.db 1kgenomes /full/path/to/release/ --name phase3-release --referenceSetName NCBI37
+    $ ga4gh_repo add-variantset registry.db 1kgenomes /full/path/to/release/ \
+        --name phase3-release --referenceSetName NCBI37
 
 Add a BAM as a Read Group Set
 -----------------------------
@@ -345,8 +353,11 @@ We will first download the index and then add it to the registry.
 
 .. code-block:: bash
 
-    wget http://s3.amazonaws.com/1000genomes/phase3/data/HG00096/alignment/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai
-    ga4gh_repo add-readgroupset registry.db 1kgenomes "http://s3.amazonaws.com/1000genomes/phase3/data/HG00096/alignment/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam" -I "HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai" --referenceSetName NCBI37
+    $ wget http://s3.amazonaws.com/1000genomes/phase3/data/HG00096/alignment/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai
+    $ ga4gh_repo add-readgroupset registry.db 1kgenomes \
+        -I HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai \
+        --referenceSetName NCBI37 \
+        http://s3.amazonaws.com/1000genomes/phase3/data/HG00096/alignment/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam \
 
 This might take a moment as some metadata about the file will be
 retrieved from S3.

diff --git a/docs/installation.rst b/docs/installation.rst
@@ -62,8 +62,8 @@ Download and unpack the example data:
 
 .. code-block:: bash
 
-  $ wget https://github.com/ga4gh/server/releases/download/data/ga4gh-example-data-v4.0.tar
-  $ tar -xf ga4gh-example-data-v4.0.tar
+  $ wget https://github.com/ga4gh/server/releases/download/data/ga4gh-example-data-v4.1.tar
+  $ tar -xf ga4gh-example-data-v4.1.tar
 
 Create the WSGI file at ``/srv/ga4gh/application.wsgi`` and write the following
 contents:
@@ -172,7 +172,7 @@ Troubleshooting
 Server errors will be output to the web server's error log by default (in Apache on
 Debian/Ubuntu, for example, this is ``/var/log/apache2/error.log``). Each client
 request will be logged to the web server's access log (in Apache on Debian/Ubuntu
-this is ``/var/log/apache2/access.log``). 
+this is ``/var/log/apache2/access.log``).
 
 For more server configuration options see :ref:`Configuration`
 

diff --git a/ga4gh/cli.py b/ga4gh/cli.py
@@ -2051,18 +2051,16 @@ def getParser(cls):
 
         addOntologyParser = addSubparser(
             subparsers, "add-ontology",
-            "Adds an ontology to the repo. Currently ontology support "
-            "consists of a map between ontology term IDs and names "
-            "stored in a tab-delimited text file. For example, in "
-            "Sequence Ontology, we map from the term ID 'SO:0000024' "
-            "to the name 'sarcin_like_RNA_motif'. ")
+            "Adds an ontology in OBO format to the repo. Currently, "
+            "a sequence ontology (SO) instance is required to translate "
+            "ontology term names held in annotations to ontology IDs. "
+            "Sequence ontology files can be found at "
+            "https://github.com/The-Sequence-Ontology/SO-Ontologies")
         addOntologyParser.set_defaults(runner="addOntology")
         cls.addRepoArgument(addOntologyParser)
         cls.addFilePathArgument(
             addOntologyParser,
-            "The path to the text file used to define the ontology term "
-            "map to use. This must be a tab-delimited text file consisting "
-            "of ontology term IDs and names.")
+            "The path of the OBO file defining this ontology.")
         cls.addNameOption(addOntologyParser, "ontology")
 
         removeOntologyParser = addSubparser(