Skip to content

Commit

Permalink
Merge pull request ga4gh#1236 from jeromekelleher/ontology-docs
Browse files Browse the repository at this point in the history
Ontology docs
  • Loading branch information
david4096 committed May 20, 2016
2 parents ef4f084 + e5553f2 commit 44d7355
Show file tree
Hide file tree
Showing 4 changed files with 77 additions and 57 deletions.
49 changes: 30 additions & 19 deletions docs/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,58 +57,60 @@ when creating a new GA4GH repository.
$ ga4gh_repo init registry.db
+++++++
verify
list
+++++++

The ``verify`` command is used to check that the integrity of the
data in a repository. The command checks each container object in turn
and ensures that it can read data from it. Read errors can occur for
any number of reasons (for example, a VCF file may have been moved
to another location since it was added to the registry), and the
``verify`` command allows an administrator to check that all is
well in their repository.
The ``list`` command is used to print the contents of a repository
to the screen. It is an essential tool for administrators to
understand the structure of the repository that they are managing.

.. note:: The ``verify`` command is under development and will
.. note:: The ``list`` command is under development and will
be much more sophisticated in the future. In particular, the output
of this command should improve considerably in the near future.

.. argparse::
:module: ga4gh.cli
:func: getRepoManagerParser
:prog: ga4gh_repo
:path: verify
:path: list
:nodefault:

**Examples:**

.. code-block:: bash
$ ga4gh_repo verify registry.db
$ ga4gh_repo list registry.db
+++++++
list
verify
+++++++

The ``list`` command is used to print the contents of a repository
to the screen. It is an essential tool for administrators to
understand the structure of the repository that they are managing.
The ``verify`` command is used to check that the integrity of the
data in a repository. The command checks each container object in turn
and ensures that it can read data from it. Read errors can occur for
any number of reasons (for example, a VCF file may have been moved
to another location since it was added to the registry), and the
``verify`` command allows an administrator to check that all is
well in their repository.

.. note:: The ``list`` command is under development and will
.. note:: The ``verify`` command is under development and will
be much more sophisticated in the future. In particular, the output
of this command should improve considerably in the near future.

.. argparse::
:module: ga4gh.cli
:func: getRepoManagerParser
:prog: ga4gh_repo
:path: list
:path: verify
:nodefault:

**Examples:**

.. code-block:: bash
$ ga4gh_repo list registry.db
$ ga4gh_repo verify registry.db
+++++++++++
add-dataset
Expand Down Expand Up @@ -169,7 +171,13 @@ Adds a reference set used in the 1000 Genomes project using the name
add-ontology
++++++++++++++++

.. todo:: add docs for adding ontologies.
Adds a new ontology to the repository. The ontology supplied must be a text
file in `OBO format
<http://owlcollab.github.io/oboformat/doc/obo-syntax.html>`_. If you wish to
serve sequence or variant annotations from a repository, a sequence ontology
(SO) instance is required to translate ontology term names held in annotations
to ontology IDs. Sequence ontology definitions can be downloaded from
the `Sequence Ontology site <https://github.com/The-Sequence-Ontology/SO-Ontologies>`_.

.. argparse::
:module: ga4gh.cli
Expand All @@ -184,6 +192,9 @@ add-ontology
$ ga4gh_repo add-ontology registry.db path/to/so-xp.obo
Adds the sequence ontology ``so-xp.obo`` to the repository using the
default naming rules.

+++++++++++++++
add-variantset
+++++++++++++++
Expand Down
65 changes: 38 additions & 27 deletions docs/demo.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,8 @@ Now we can download some example data, which we'll use for our demo:

.. code-block:: bash
(ga4gh-env) $ wget https://github.com/ga4gh/server/releases/download/data/ga4gh-example-data-v4.0.tar
(ga4gh-env) $ tar -xvf ga4gh-example-data-v4.0.tar
(ga4gh-env) $ wget https://github.com/ga4gh/server/releases/download/data/ga4gh-example-data-v4.1.tar
(ga4gh-env) $ tar -xvf ga4gh-example-data-v4.1.tar
After extracting the data, we can then run the ``ga4gh_server`` application:

Expand Down Expand Up @@ -231,13 +231,13 @@ Repo administrator CLI

The CLI has methods for adding and removing Feature Sets, Read Group
Sets, Variant Sets, etc. Before we can begin adding files we must first
initialize an empty registry database. The directory that this database
is in should be readable and writable by the current user, as well as the
initialize an empty registry database. The directory that this database
is in should be readable and writable by the current user, as well as the
user running the server.

.. code-block:: bash
ga4gh_repo init registry.db
$ ga4gh_repo init registry.db
This command will create a file ``registry.db`` in the current working
directory. This file should stay relatively small (a few MB for
Expand All @@ -249,7 +249,8 @@ description using the ``--description`` flag.

.. code-block:: bash
ga4gh_repo add-dataset registry.db 1kgenomes --description "Variants from the 1000 Genomes project and GENCODE genes annotations"
$ ga4gh_repo add-dataset registry.db 1kgenomes \
--description "Variants from the 1000 Genomes project and GENCODE genes annotations"
Add a Reference Set
-------------------
Expand All @@ -260,56 +261,62 @@ used for the 1000 Genomes VCF.

.. code-block:: bash
wget ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
$ wget ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz
This file is provided in ``.gz`` format, which we will decompress, and
then with samtools installed on the system, recompress it using
``bgzip``.

.. code-block:: bash
gunzip hs37d5.fa.gz
bgzip hs37d5.fa
$ gunzip hs37d5.fa.gz
$ bgzip hs37d5.fa
This may take a few minutes depending on your system as this file is
around 3GB. Next, we will add the reference set.

.. code-block:: bash
ga4gh_repo add-referenceset registry.db /full/path/to/hs37d5.fa.gz \
$ ga4gh_repo add-referenceset registry.db /full/path/to/hs37d5.fa.gz \
-d “NCBI37 assembly of the human genome” --ncbiTaxonId 9606 --name NCBI37 \
--sourceUri "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence/hs37d5.fa.gz"
A number of optional command line flags have been added. We will be
referring to the name of this reference set ``NCBI37`` when we later add
the variant set.

Add an ontology TODO
--------------------
Add an ontology
---------------

Ontologies provide a source for parsing variant annotations, as well as
organizing feature types into ontology terms. This is a custom format
created for this server.
organizing feature types into ontology terms. A `sequence ontology
<http://www.sequenceontology.org/>`_ instance must be added to the repository
to translate ontology term names in sequence and variant annotations to IDs.
Sequence ontology definitions can be downloaded from the `Sequence Ontology
site <https://github.com/The-Sequence-Ontology/SO-Ontologies>`_.

.. code-block:: bash
ga4gh_repo add-ontology registry.db /full/path/to/sequence_ontology.txt
$ wget https://raw.githubusercontent.com/The-Sequence-Ontology/SO-Ontologies/master/so-xp.obo
$ ga4gh_repo add-ontology registry.db /full/path/to/so-xp.obo -n so-xp
Add sequence annotations
------------------------

The GENCODE Genes dataset provides annotations for features on the
reference assembly. The server uses a custom storage format for sequence
annotations, you can download a prepared set
reference assembly. The server uses a custom storage format for sequence
annotations, you can download a prepared set
`here <https://ga4ghstore.blob.core.windows.net/testing/gencode_v24lift37.db>`__.
It can be added to the registry using the following command. Notice
we have told the registry to associate the reference set added above
It can be added to the registry using the following command. Notice
we have told the registry to associate the reference set added above
with these annotations.

.. code-block:: bash
ga4gh_repo add-featureset registry.db 1kgenomes /full/path/to/gencode.v24lift37.annotation.db --referenceSetName NCBI37
$ ga4gh_repo add-featureset registry.db 1kgenomes /full/path/to/gencode.v24lift37.annotation.db \
--referenceSetName NCBI37 --ontologyName so-xp
.. todo:: Demonstrate how to generate your own sequence annotations database.

Add the 1000 Genomes VCFs
Expand All @@ -321,20 +328,21 @@ release.

.. code-block:: bash
wget -m ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ -nd -P release -l 1
$ wget -m ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ -nd -P release -l 1
rm release/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz
These files are already compressed and indexed. For the server to make use
of the files in this directory we must move the `wgs` file, since it covers
chromosomes that are represented elsewhere and overlapping VCF are not
of the files in this directory we must move the `wgs` file, since it covers
chromosomes that are represented elsewhere and overlapping VCF are not
currently supported. This file could be added as a separate variant set.

We can now add the directory to the registry using the following command.
Again, notice we have referred to the reference set by name.

.. code-block:: bash
ga4gh_repo add-variantset registry.db 1kgenomes /full/path/to/release/ --name phase3-release --referenceSetName NCBI37
$ ga4gh_repo add-variantset registry.db 1kgenomes /full/path/to/release/ \
--name phase3-release --referenceSetName NCBI37
Add a BAM as a Read Group Set
-----------------------------
Expand All @@ -345,8 +353,11 @@ We will first download the index and then add it to the registry.

.. code-block:: bash
wget http://s3.amazonaws.com/1000genomes/phase3/data/HG00096/alignment/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai
ga4gh_repo add-readgroupset registry.db 1kgenomes "http://s3.amazonaws.com/1000genomes/phase3/data/HG00096/alignment/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam" -I "HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai" --referenceSetName NCBI37
$ wget http://s3.amazonaws.com/1000genomes/phase3/data/HG00096/alignment/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai
$ ga4gh_repo add-readgroupset registry.db 1kgenomes \
-I HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai \
--referenceSetName NCBI37 \
http://s3.amazonaws.com/1000genomes/phase3/data/HG00096/alignment/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam \
This might take a moment as some metadata about the file will be
retrieved from S3.
Expand Down
6 changes: 3 additions & 3 deletions docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -62,8 +62,8 @@ Download and unpack the example data:

.. code-block:: bash
$ wget https://github.com/ga4gh/server/releases/download/data/ga4gh-example-data-v4.0.tar
$ tar -xf ga4gh-example-data-v4.0.tar
$ wget https://github.com/ga4gh/server/releases/download/data/ga4gh-example-data-v4.1.tar
$ tar -xf ga4gh-example-data-v4.1.tar
Create the WSGI file at ``/srv/ga4gh/application.wsgi`` and write the following
contents:
Expand Down Expand Up @@ -172,7 +172,7 @@ Troubleshooting
Server errors will be output to the web server's error log by default (in Apache on
Debian/Ubuntu, for example, this is ``/var/log/apache2/error.log``). Each client
request will be logged to the web server's access log (in Apache on Debian/Ubuntu
this is ``/var/log/apache2/access.log``).
this is ``/var/log/apache2/access.log``).

For more server configuration options see :ref:`Configuration`

Expand Down
14 changes: 6 additions & 8 deletions ga4gh/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -2051,18 +2051,16 @@ def getParser(cls):

addOntologyParser = addSubparser(
subparsers, "add-ontology",
"Adds an ontology to the repo. Currently ontology support "
"consists of a map between ontology term IDs and names "
"stored in a tab-delimited text file. For example, in "
"Sequence Ontology, we map from the term ID 'SO:0000024' "
"to the name 'sarcin_like_RNA_motif'. ")
"Adds an ontology in OBO format to the repo. Currently, "
"a sequence ontology (SO) instance is required to translate "
"ontology term names held in annotations to ontology IDs. "
"Sequence ontology files can be found at "
"https://github.com/The-Sequence-Ontology/SO-Ontologies")
addOntologyParser.set_defaults(runner="addOntology")
cls.addRepoArgument(addOntologyParser)
cls.addFilePathArgument(
addOntologyParser,
"The path to the text file used to define the ontology term "
"map to use. This must be a tab-delimited text file consisting "
"of ontology term IDs and names.")
"The path of the OBO file defining this ontology.")
cls.addNameOption(addOntologyParser, "ontology")

removeOntologyParser = addSubparser(
Expand Down

0 comments on commit 44d7355

Please sign in to comment.