Merge pull request chop-dbhi#353 from cbmi/varify-data-warehouse

Update varify to leverage varify-data-warehouse
ryanjohara · Aug 11, 2014 · 8b3ccf8 · 8b3ccf8
2 parents 61ce444 + 90568e9
commit 8b3ccf8
Show file tree

Hide file tree

Showing 223 changed files with 94 additions and 28,313 deletions.
diff --git a/README.md b/README.md
@@ -2,6 +2,8 @@
 
 [![Build Status](https://travis-ci.org/cbmi/varify.png?branch=master)](https://travis-ci.org/cbmi/varify) [![Coverage Status](https://coveralls.io/repos/cbmi/varify/badge.png)](https://coveralls.io/r/cbmi/varify)
 
+More information on the data models, commands, and loader can be found on the [varify-data-warehouse repository](https://github.com/cbmi/varify-data-warehouse).
+
 ## Need some help?
 Join our chat room and speak with our dev team: http://www.hipchat.com/gZcKr0p3y
 
@@ -198,149 +200,3 @@ make sass collect watch
 ```
 
 _Note, the `sass` target is called first to ensure the compiled files exist before attempting to collect them._
-
-## Pipeline
-
-The following describes the steps to execute the loading pipeline, the performance of the pipeline, and the process behind it.
-
-
-#### NOTE: All VCF files being loaded into Varify must be annotated with the [CBMi fork of SnpEff](https://github.com/CBMi-BiG/snpEff). The key difference is that the CBMi fork attempts to generate valid HGVS for insertions and deletions, including those which require "walking and rolling" to identify the correct indel frame while the standard SnpEff version only contains a partial implementation of HGVS notation [as noted here](http://snpeff.sourceforge.net/SnpEff_manual.html#filters).
-
-### Retrieving Test Data
-
-We have provided a [set of test data](https://github.com/cbmi/varify-demo-data) to use to test the load pipeline or use as sample data when first standing up your Varify instance. To use the test data, run the commands below.
-
-```bash
-wget https://github.com/cbmi/varify-demo-data/archive/0.1.tar.gz -O varify-demo-data-0.1.tar.gz
-tar -zxf varify-demo-data-0.1.tar.gz
-gunzip varify-demo-data-0.1/CEU.trio.2010_03.genotypes.annotated.vcf.gz
-```
-
-At this point, the VCF and MANIFEST in the `varify-demo-data-0.1` directory are ready for loading in the pipeline. You can use the `varify-demo-data-0.1` directory as the argument to the `samples queue` command in the _Queue Samples_ step below if you want to just load this test data.
-
-#### Tmux (optional)
-
-Since the pipeline can take a while to load large collections(see Performance section below), you may want to consider following the Tmux steps to attach/detach to/from the load process.
-
-[Tmux](http://robots.thoughtbot.com/post/2641409235/a-tmux-crash-course) is like [screen](http://www.gnu.org/software/screen/), just newer. It is useful for detaching/reattaching sessions with long running processes.
-
-**New Session**
-
-```bash
-tmux
-```
-
-**Existing Session**
-
-```bash
-tmux attach -t 0 # first session
-```
-
-### Activate Environment
-
-```bash
-source bin/activate
-```
-
-#### Define RQ_QUEUES
-
-For this example, we will assume you have `redis-server` running on `localhost:6379` against the database with index 0. If you have redis running elsewhere simply update the settings below with the address info and DB you wish to use. Open your `local_settings.py` file and add the following setting:
-
-```python
-RQ_QUEUES = {
-    'default': {
-        'HOST': 'localhost',
-        'PORT': 6379,
-        'DB': 0,
-    },
-    'samples': {
-        'HOST': 'localhost',
-        'PORT': 6379,
-        'DB': 0,
-    },
-    'variants': {
-        'HOST': 'localhost',
-        'PORT': 6379,
-        'DB': 0,
-    },
-}
-```
-
-#### Queue Samples
-
-Optionally specify a directory, otherwise it will recursively scan all directories defined in the `VARIFY_SAMPLE_DIRS` setting in the Varify project.
-
-```bash
-./bin/manage.py samples queue [directory]
-```
-
-#### Kick Off Workers
-
-You can technically start as many of each type for loading data in parallel, but this may cause undesired database contention which could actually slow down the loading process. A single worker for `variants` is generally preferred and two or three are suitable for the `default` type.
-
-```bash
-./bin/manage.py rqworker variants &
-./bin/manage.py rqworker default &
-```
-
-Note, these workers will run forever, if there is only a single sample being loaded, the `--burst` argument can be used to terminate the worker when there are no more items left in the queue.
-
-#### Monitor Workers
-
-You can monitor the workers and the queues using the `rq-dashboard` or `rqinfo`. Information on setting up and using those services can be found [here](http://python-rq.org/docs/monitoring/).
-
-#### Post-Load
-
-After the batch of samples have been loaded, a two more commands need to be executed to update the annotations and cohort frequencies. These are performed _post-load_ for performance reasons.
-
-```bash
-./bin/manage.py variants load --evs --1000g --sift --polyphen2 > variants.load.txt 2>&1 &
-./bin/manage.py samples allele-freqs > samples.allele-freqs.txt 2>&1 &
-```
-
-### Performance
-
-- File size: 610 MB
-- Variant count: 1,794,055
-
-#### Baseline
-
-Iteration over flat file (no parsing) with batch counting (every 1000)
-
-- Time: 80 seconds
-- Memory: 0
-
-#### Baseline VCF
-
-Iteration over VCF parsed file using PyVCF
-
-- Time: 41 minutes (extrapolated)
-- Memory: 246 KB
-
-### Parallelized Queue/Worker Process
-
-#### Summary of Workflow
-
-1. Fill Queue
-2. Spawn Worker(s)
-3. Consume Job(s)
-    - Validate Input
-    - (work)
-    - Validate Output
-    - Commit
-
-#### Parallelism Constraints
-
-The COPY command is a single statement which means the data being loaded is
-all or nothing. If multiple samples are being loaded in parallel, it is likely
-they will have overlapping variants.
-
-To prevent integrity errors, workers will need to consult one or more
-centralized caches to check if the current variant has been _addressed_
-already. If this is the case, the variant will be skipped by the worker.
-
-This incurs a second issue in that downstream jobs that depend on the existence
-of some data that does not yet exist because another worker has not yet
-committed it's data. In this case, non-matches will be queued up in the
-`deferred` queue that can be run at a later time, after the `default` queue
-is empty or in parallel with the `default` queue.
diff --git a/requirements.txt b/requirements.txt
@@ -12,6 +12,9 @@ rq-dashboard>=0.3.1
 django-rq-dashboard
 requests==2.2.0
 
+# Data models and loader
+git+git://github.com/cbmi/varify-data-warehouse.git
+
 # Templatetags for tweaking rendered HTML via form fields and such
 django-widget-tweaks
 

diff --git a/setup.py b/setup.py
@@ -35,6 +35,7 @@
     'django-reversion==1.6.6',
     'diff-match-patch',
     'pyvcf>=0.6.5',
+    'vdw'
 ]
 
 if sys.version_info < (2, 7):
@@ -47,6 +48,12 @@
                                        'tests.*']),
     'include_package_data': True,
     'install_requires': install_requires,
+    # This is a hack to get setuptools to install the latest version of the
+    # varify-data-warehouse from the github repo. Until varify-data-warehouse
+    # is released on pypi, we need to continue to install from github.
+    'dependency_links': [
+        'https://github.com/cbmi/varify-data-warehouse/archive/master.zip#egg=vdw',     # noqa
+    ],
     'test_suite': 'test_suite',
     'tests_require': ['httpretty'],
     'author': '',

diff --git a/test_suite.py b/test_suite.py
@@ -10,10 +10,9 @@
 if not apps:
     apps = [
         'assessments',
-        'sample_load_process',
         'south_tests',
         'geneset_form',
-        'commands',
+        'resources'
     ]
 
 management.call_command('test', *apps)
diff --git a/tests/cases/assessments/tests.py b/tests/cases/assessments/tests.py
@@ -1,13 +1,16 @@
 import json
 from restlib2.http import codes
 from ..base import AuthenticatedBaseTestCase
-from varify.assessments.models import Assessment, Pathogenicity,\
-    ParentalResult, AssessmentCategory
-from varify.samples.models import Result
+from vdw.assessments.models import Assessment, Pathogenicity, ParentalResult, \
+    AssessmentCategory
+from vdw.samples.models import Result
 
 
 class AssessmentResourceTestCase(AuthenticatedBaseTestCase):
-    fixtures = ['initial_data.json']
+    # TODO: Trim this test data down, it loads a lot more than is really
+    # necessary. All that is really needed is a single result with all the
+    # related data needed for that result to exist.
+    fixtures = ['test_data.json']
 
     def setUp(self):
         # Create and record some data

diff --git a/tests/cases/geneset_form/tests.py b/tests/cases/geneset_form/tests.py
@@ -1,7 +1,7 @@
 from django.test import TestCase
 from varify.genes.forms import GeneSetBulkForm
-from varify.genes.models import Gene, GeneSet
-from varify.genome.models import Chromosome
+from vdw.genes.models import Gene, GeneSet
+from vdw.genome.models import Chromosome
 
 
 class GeneSetBulkFormTestCase(TestCase):

diff --git a/tests/cases/commands/__init__.py → tests/cases/resources/__init__.py b/tests/cases/commands/__init__.py → tests/cases/resources/__init__.py
diff --git a/tests/cases/commands/models.py → tests/cases/resources/models.py b/tests/cases/commands/models.py → tests/cases/resources/models.py
diff --git a/tests/cases/resources/tests/__init__.py b/tests/cases/resources/tests/__init__.py
@@ -0,0 +1 @@
+from .gene_rank import *   # noqa