Merge branch 'devel' into frames_from_files_fixes

[ci skip]
markovmodel · Oct 21, 2016 · 5d52505 · 5d52505
2 parents e6a69d9 + 4308906
commit 5d52505
Show file tree

Hide file tree

Showing 14 changed files with 145 additions and 78 deletions.
diff --git a/.github/ISSUE_TEMPLATE.md b/.github/ISSUE_TEMPLATE.md
@@ -0,0 +1,8 @@
+Thanks for submitting an issue!
+
+Here's a quick checklist in what to include:
+
+- [ ] Include a detailed description of the bug or suggestion
+- [ ] `pip list` or `conda list` of the environment you are using (please attach a txt file to the issue).
+- [ ] PyEMMA version and operating system versions
+- [ ] Minimal example if possible, a Python script, zipped input data (if not too large)
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -0,0 +1,8 @@
+Thanks for submitting a PR, your contribution is really appreciated!
+
+Here's a quick checklist that should be present in PRs:
+
+- [ ] Make sure to include one or more tests for your change
+- [ ] Add yourself to `AUTHORS`
+- [ ] Add a new entry to the `doc/source/CHANGELOG` (choose any open position to avoid merge conflicts with other PRs).
+      Decide whether your change is a fix or a new feature.
diff --git a/.project b/.project
diff --git a/.pydevproject b/.pydevproject
diff --git a/AUTHORS b/AUTHORS
@@ -0,0 +1,24 @@
+Main Authors
+============
+Benjamin Trendelkamp-Schroer
+Christoph Wehmeyer
+Fabian Paul
+Frank Noe
+Guillermo Pérez-Hernández
+Martin K. Scherer
+Moritz Hoffmann
+Jan-Hendrik Prinz
+
+
+
+Contributors
+============
+Alexandra La Fleur
+Antonia Meys
+Ariel Rokem
+Francesco Bonazzi
+Ismael Rodriguez Espigares
+John Chodera
+Josh Fass
+Stephan Doerr
+@vargaslo
diff --git a/README.rst b/README.rst
@@ -8,7 +8,7 @@ EMMA (Emma's Markov Model Algorithms)
    :target: https://pypi.python.org/pypi/pyemma
 .. image:: https://img.shields.io/pypi/dm/pyemma.svg
    :target: https://pypi.python.org/pypi/pyemma
-.. image:: https://anaconda.org/xavier/binstar/badges/downloads.svg
+.. image:: https://anaconda.org/omnia/badges/downloads.svg
    :target: https://anaconda.org/omnia/pyemma
 .. image:: https://anaconda.org/omnia/pyemma/badges/installer/conda.svg
    :target: https://conda.anaconda.org/omnia

diff --git a/doc/source/CHANGELOG.rst b/doc/source/CHANGELOG.rst
@@ -1,12 +1,17 @@
 Changelog
 =========
 
-2.2.7 (10-20-16)
+2.2.7 (10-21-16)
 ----------------
 
 **New features**:
 
-- coordinates: for lag < chunksize improved speed (50%) for TICA. #960
+- coordinates:
+   - for lag < chunksize improved speed (50%) for TICA. #960
+   - new config variable "coordinates_check_output" to test for "NaN" and "inf" values in
+     iterator output for every chunk. The option is disabled by default. It gives insight
+     during debugging where faulty values are introduced into the pipeline. #967
+
 
 **Fixes**:
 
@@ -142,6 +147,7 @@ Service release. Fixes some
   considerable high chunk size as well.
 
 **Fixes**:
+
 - In parallel environments (clusters with shared filesystem) there will be no
   crashes due to the config module, which tried to write files in users home
   directory. Config files are optional by now.
@@ -196,19 +202,18 @@ Service release. Fixes some
     (reported as Warnings).
 
 - coordinates:
-  - Completly re-designed class hierachy (user-code/API unaffected).
-  - Added trajectory info cache to avoid re-computing lengths, dimensions and
-    byte offsets of data sets.
-  - Random access strategies supported (eg. via slices).
-  - FeatureReader supports random access for XTC and TRR (in conjunction with mdtraj-1.6).
-  - Re-design API to support scikit-learn interface (fit, transform).
-  - Pipeline elements (former Transformer class) now uses iterator pattern to
-    obtain data and therefore supports now pipeline trees.
-  - pipeline elements support writing their output to csv files.
-  - TICA/PCA uses covartools to estimate covariance matrices.
-    - This now saves one pass over the data set.
-    - Supports sparsification data on the fly.
-
+    - Completely re-designed class hierachy (user-code/API unaffected).
+    - Added trajectory info cache to avoid re-computing lengths, dimensions and
+      byte offsets of data sets.
+    - Random access strategies supported (eg. via slices).
+    - FeatureReader supports random access for XTC and TRR (in conjunction with mdtraj-1.6).
+    - Re-design API to support scikit-learn interface (fit, transform).
+    - Pipeline elements (former Transformer class) now uses iterator pattern to
+      obtain data and therefore supports now pipeline trees.
+    - pipeline elements support writing their output to csv files.
+    - TICA/PCA uses covartools to estimate covariance matrices:
+        + This now saves one pass over the data set.
+        + Supports sparsification data on the fly.
 
 **Fixes**:
 
@@ -348,7 +353,7 @@ reorganization of the code.
 - coordinates package: allow metrics to be passed to cluster algorithms.
 - coordinates package: cache trajectory lengths by default
                        (uncached led to 1 pass of reading for non indexed (XTC) formats).
-  This avoids re-reading e.g XTC files to determine their lengths.
+                       This avoids re-reading e.g XTC files to determine their lengths.
 - coordinates package: enable passing chunk size to readers and pipelines in API.
 - coordinates package: assign_to_centers now allows all supported file formats as centers input.
 - coordinates package: save_traj(s) now handles stride parameter.

diff --git a/pyemma/coordinates/api.py b/pyemma/coordinates/api.py
@@ -984,7 +984,7 @@ def pca(data=None, dim=-1, var_cutoff=0.95, stride=1, mean=None, skip=0):
     return _param_stage(data, res, stride=stride)
 
 
-def tica(data=None, lag=10, dim=-1, var_cutoff=0.95, kinetic_map=True, stride=1,
+def tica(data=None, lag=10, dim=-1, var_cutoff=0.95, kinetic_map=True, commute_map=False, stride=1,
          force_eigenvalues_le_one=False, mean=None, remove_mean=True, skip=0):
     r""" Time-lagged independent component analysis (TICA).
 
@@ -1035,6 +1035,10 @@ def tica(data=None, lag=10, dim=-1, var_cutoff=0.95, kinetic_map=True, stride=1,
         distances in the transformed data approximate kinetic distances [4]_.
         This is a good choice when the data is further processed by clustering.
 
+    commute_map : bool, optional, default False
+        Eigenvector_i will be scaled by sqrt(timescale_i / 2). As a result, Euclidean distances in the transformed
+        data will approximate commute distances [5]_.
+
     stride : int, optional, default = 1
         If set to 1, all input data will be used for estimation. Note that this
         could cause this calculation to be very slow for large data sets. Since
@@ -1145,17 +1149,19 @@ def tica(data=None, lag=10, dim=-1, var_cutoff=0.95, kinetic_map=True, stride=1,
        Improvements in Markov State Model Construction Reveal Many Non-Native Interactions in the Folding of NTL9
        J. Chem. Theory. Comput. 9, 2000-2009. doi:10.1021/ct300878a
 
-    .. [4] Noe, F. and C. Clementi. 2015.
-        Kinetic distance and kinetic maps from molecular dynamics simulation
-        (in preparation).
+    .. [4] Noe, F. and Clementi, C. 2015. Kinetic distance and kinetic maps from molecular dynamics simulation.
+        J. Chem. Theory. Comput. doi:10.1021/acs.jctc.5b00553
+
+    .. [5] Noe, F., Banisch, R., Clementi, C. 2016. Commute maps: separating slowly-mixing molecular configurations
+       for kinetic modeling. J. Chem. Theory. Comput. doi:10.1021/acs.jctc.6b00762
 
     """
     from pyemma.coordinates.transform.tica import TICA
     if mean is not None:
         import warnings
         warnings.warn("user provided mean for TICA is deprecated and its value is ignored.")
 
-    res = TICA(lag, dim=dim, var_cutoff=var_cutoff, kinetic_map=kinetic_map,
+    res = TICA(lag, dim=dim, var_cutoff=var_cutoff, kinetic_map=kinetic_map, commute_map=commute_map,
                mean=mean, remove_mean=remove_mean, skip=skip)
     return _param_stage(data, res, stride=stride)
 

diff --git a/pyemma/coordinates/data/_base/datasource.py b/pyemma/coordinates/data/_base/datasource.py
@@ -671,6 +671,14 @@ def next(self):
                 (not self.return_traj_index and len(X) == 0) or (self.return_traj_index and len(X[1]) == 0)
         ):
             X = self._it_next()
+        if config.coordinates_check_output:
+            array = X if not self.return_traj_index else X[1]
+            if not np.all(np.isfinite(array)):
+                # determine position
+                start = self.pos
+                msg = "Found invalid values in chunk in trajectory index {itraj} at chunk [{start}, {stop}]" \
+                    .format(itraj=self.current_trajindex, start=start, stop=start+len(array))
+                raise InvalidDataInStreamException(msg)
         return X
 
     def __iter__(self):
@@ -683,3 +691,6 @@ def __exit__(self, exc_type, exc_val, exc_tb):
         self.close()
         return False
 
+
+class InvalidDataInStreamException(Exception):
+    """Data stream contained NaN or (+/-) infinity"""
diff --git a/pyemma/coordinates/tests/test_coordinates_iterator.py b/pyemma/coordinates/tests/test_coordinates_iterator.py
@@ -2,6 +2,7 @@
 import numpy as np
 
 from pyemma.coordinates.data import DataInMemory
+from pyemma.util.contexts import settings
 from pyemma.util.files import TemporaryDirectory
 import os
 from glob import glob
@@ -153,5 +154,25 @@ def test_write_to_csv_propagate_filenames(self):
             for a, e in zip(actual, expected):
                 np.testing.assert_allclose(a, e)
 
+    def test_invalid_data_in_input_nan(self):
+        self.d[0][-1] = np.nan
+        r = DataInMemory(self.d)
+        it = r.iterator()
+        from pyemma.coordinates.data._base.datasource import InvalidDataInStreamException
+        with settings(coordinates_check_output=True):
+            with self.assertRaises(InvalidDataInStreamException):
+                for itraj, X in it:
+                    pass
+
+    def test_invalid_data_in_input_inf(self):
+        self.d[1][-1] = np.inf
+        r = DataInMemory(self.d, chunksize=5)
+        it = r.iterator()
+        from pyemma.coordinates.data._base.datasource import InvalidDataInStreamException
+        with settings(coordinates_check_output=True):
+            with self.assertRaises(InvalidDataInStreamException) as cm:
+                for itraj, X in it:
+                    pass
+
 if __name__ == '__main__':
     unittest.main()
diff --git a/pyemma/coordinates/transform/tica.py b/pyemma/coordinates/transform/tica.py
@@ -58,7 +58,7 @@ def _lazy_estimation(func, *args, **kw):
 class TICA(StreamingTransformer):
     r""" Time-lagged independent component analysis (TICA)"""
 
-    def __init__(self, lag, dim=-1, var_cutoff=0.95, kinetic_map=True, epsilon=1e-6,
+    def __init__(self, lag, dim=-1, var_cutoff=0.95, kinetic_map=True, commute_map=False, epsilon=1e-6,
                  mean=None, stride=1, remove_mean=True, skip=0):
         r""" Time-lagged independent component analysis (TICA) [1]_, [2]_, [3]_.
 
@@ -77,6 +77,9 @@ def __init__(self, lag, dim=-1, var_cutoff=0.95, kinetic_map=True, epsilon=1e-6,
         kinetic_map : bool, optional, default True
             Eigenvectors will be scaled by eigenvalues. As a result, Euclidean distances in the transformed data
             approximate kinetic distances [4]_. This is a good choice when the data is further processed by clustering.
+        commute_map : bool, optional, default False
+            Eigenvector_i will be scaled by sqrt(timescale_i / 2). As a result, Euclidean distances in the transformed
+            data will approximate commute distances [5]_.
         epsilon : float
             eigenvalue norm cutoff. Eigenvalues of C0 with norms <= epsilon will be
             cut off. The remaining number of eigenvalues define the size
@@ -122,23 +125,25 @@ def __init__(self, lag, dim=-1, var_cutoff=0.95, kinetic_map=True, epsilon=1e-6,
         .. [3] L. Molgedey and H. G. Schuster. 1994.
            Separation of a mixture of independent signals using time delayed correlations
            Phys. Rev. Lett. 72, 3634.
-        .. [4] Noe, F. and C. Clementi. 2015.
-            Kinetic distance and kinetic maps from molecular dynamics simulation
-            http://arxiv.org/abs/1506.06259
+        .. [4] Noe, F. and Clementi, C. 2015. Kinetic distance and kinetic maps from molecular dynamics simulation.
+            J. Chem. Theory. Comput. doi:10.1021/acs.jctc.5b00553
+        .. [5] Noe, F., Banisch, R., Clementi, C. 2016. Commute maps: separating slowly-mixing molecular configurations
+           for kinetic modeling. J. Chem. Theory. Comput. doi:10.1021/acs.jctc.6b00762
 
         """
         default_var_cutoff = get_default_args(self.__init__)['var_cutoff']
         if dim != -1 and var_cutoff != default_var_cutoff:
             raise ValueError('Trying to set both the number of dimension and the subspace variance. Use either or.')
-
+        if kinetic_map and commute_map:
+            raise ValueError('Trying to use both kinetic_map and commute_map. Use either or.')
         super(TICA, self).__init__()
 
         if dim > -1:
             var_cutoff = 1.0
 
         # empty dummy model instance
         self._model = TICAModel()
-        self.set_params(lag=lag, dim=dim, var_cutoff=var_cutoff, kinetic_map=kinetic_map,
+        self.set_params(lag=lag, dim=dim, var_cutoff=var_cutoff, kinetic_map=kinetic_map, commute_map=commute_map,
                         epsilon=epsilon, mean=mean, stride=stride, remove_mean=remove_mean, skip=skip)
 
     @property
@@ -339,8 +344,17 @@ def _transform_array(self, X):
         """
         X_meanfree = X - self.mean
         Y = np.dot(X_meanfree, self.eigenvectors[:, 0:self.dimension()])
+        if self.kinetic_map and self.commute_map:
+            raise ValueError('Trying to use both kinetic_map and commute_map. Use either or.')
         if self.kinetic_map:  # scale by eigenvalues
             Y *= self.eigenvalues[0:self.dimension()]
+        if self.commute_map:  # scale by (regularized) timescales
+            timescales = self.timescales[0:self.dimension()]
+
+            # dampen timescales smaller than the lag time, as in section 2.5 of ref. [5]
+            regularized_timescales = 0.5 * timescales * np.tanh(np.pi * ((timescales - self.lag) / self.lag) + 1)
+
+            Y *= np.sqrt(regularized_timescales / 2)
         return Y.astype(self.output_type())
 
     @property

diff --git a/pyemma/msm/api.py b/pyemma/msm/api.py
@@ -1198,7 +1198,7 @@ def bayesian_hidden_markov_model(dtrajs, nstates, lag, nsamples=100, reversible=
 def tpt(msmobj, A, B):
     r""" A->B reactive flux from transition path theory (TPT)
 
-    The returned :class:`ReactiveFlux <msmtools.flux.ReactiveFlux>` object
+    The returned :class:`ReactiveFlux <pyemma.msm.models.ReactiveFlux>` object
     can be used to extract various quantities of the flux, as well as to
     compute A -> B transition pathways, their weights, and to coarse-grain
     the flux onto sets of states.
@@ -1214,29 +1214,29 @@ def tpt(msmobj, A, B):
 
     Returns
     -------
-    tptobj : :class:`ReactiveFlux <pyemma.msm.reactive_flux.ReactiveFlux>` object
+    tptobj : :class:`ReactiveFlux <pyemma.msm.models.ReactiveFlux>` object
         An object containing the reactive A->B flux network
         and several additional quantities, such as the stationary probability,
         committors and set definitions.
 
     See also
     --------
-    :class:`ReactiveFlux <pyemma.msm.reactive_flux.ReactiveFlux>`
+    :class:`ReactiveFlux <pyemma.msm.models.ReactiveFlux>`
         Reactive Flux model
 
 
-    .. autoclass:: pyemma.msm.reactive_flux.ReactiveFlux
+    .. autoclass:: pyemma.msm.models.ReactiveFlux
         :members:
         :undoc-members:
 
         .. rubric:: Methods
 
-        .. autoautosummary:: pyemma.msm.reactive_flux.ReactiveFlux
+        .. autoautosummary:: pyemma.msm.models.ReactiveFlux
            :methods:
 
         .. rubric:: Attributes
 
-        .. autoautosummary:: pyemma.msm.reactive_flux.ReactiveFlux
+        .. autoautosummary:: pyemma.msm.models.ReactiveFlux
             :attributes:
 
     References
@@ -1282,13 +1282,6 @@ def tpt(msmobj, A, B):
         By default (False), T is a transition matrix.
         If set to True, T is a rate matrix.
 
-    Returns
-    -------
-    tpt: msmtools.flux.ReactiveFlux object
-        A python object containing the reactive A->B flux network
-        and several additional quantities, such as stationary probability,
-        committors and set definitions.
-
     Notes
     -----
     The central object used in transition path theory is
@@ -1330,6 +1323,7 @@ def tpt(msmobj, A, B):
         raise ValueError('set A or B defines more states, than given transition matrix.')
 
     # forward committor
+    #msmobj.
     qplus = msmana.committor(T, A, B, forward=True)
     # backward committor
     if msmana.is_reversible(T, mu=mu):