deploy: 2db2ecd

Emad-COMBINE-lab · Feb 11, 2024 · db29d52 · db29d52
commit db29d52
Show file tree

Hide file tree

Showing 37 changed files with 5,984 additions and 0 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: 37670295bc18dea7fe322d5223f4a00d
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/.nojekyll b/.nojekyll
diff --git a/_sources/commands/analysis.rst.txt b/_sources/commands/analysis.rst.txt
@@ -0,0 +1,17 @@
+Analysis
+========
+
+The ``analysis`` module of PPI Origami allows you to validate and analyze datasets.
+
+You can find a description of all the possible download commands by running: ::
+
+   ppi_origami analysis --help
+
+Information specific to arguments of commands can be found by running the command with the help flag: ::
+
+   ppi_origami analysis COMMAND --help
+
+This information is reproduced on this page.
+
+.. autoclass:: ppi_origami.__main__.Analysis
+   :members:
diff --git a/_sources/commands/download.rst.txt b/_sources/commands/download.rst.txt
@@ -0,0 +1,17 @@
+Download
+========
+
+The ``download`` module of PPI Origami allows you to download files from their authoratative sources. PPI Origami works best when you designate one folder on your filesystem for keeping all original, untransformed datasets (we'll call that the "raw folder"). You'll refer to this folder in the ``process`` module, where "raw" files will be transformed and saved in a "processed" folder.
+
+You can find a description of all the possible download commands by running: ::
+
+   ppi_origami download --help
+
+Information specific to arguments of commands can be found by running the command with the help flag: ::
+
+   ppi_origami download COMMAND --help
+
+This information is reproduced on this page.
+
+.. autoclass:: ppi_origami.__main__.Download
+   :members:
diff --git a/_sources/commands/process.rst.txt b/_sources/commands/process.rst.txt
@@ -0,0 +1,17 @@
+Process
+=======
+
+The ``process`` module of PPI Origami allows you to transform files from their original formats and create new datasets.
+
+You can find a description of all the possible download commands by running: ::
+
+   ppi_origami process --help
+
+Information specific to arguments of commands can be found by running the command with the help flag: ::
+
+   ppi_origami process COMMAND --help
+
+This information is reproduced on this page.
+
+.. autoclass:: ppi_origami.__main__.Process
+   :members:
diff --git a/_sources/index.rst.txt b/_sources/index.rst.txt
@@ -0,0 +1,27 @@
+.. PPI Origami documentation master file, created by
+   sphinx-quickstart on Wed Jan 31 13:24:30 2024.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+Welcome to PPI Origami's documentation!
+=======================================
+
+This is the documentation for PPI Origami, a programme that helps users create and validate datasets of
+protein-protein interactions (PPIs) with cross-validation splits suitable for training and testing PPI inference models.
+
+.. toctree::
+   :maxdepth: 3
+   :caption: Contents:
+
+   theory
+   commands/download
+   commands/process
+   commands/analysis
+
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
diff --git a/_sources/theory.rst.txt b/_sources/theory.rst.txt
@@ -0,0 +1,43 @@
+Theory
+======
+
+Preparing datasets for the purpose of training models that predict protein-protein interactions is a deceptively fraught
+process.
+
+Several studies have outlined that insufficiently controlling for protein identity between cross-validation splits can
+lead to serious over-fitting of PPI prediction methods [:ref:`1-3 <References>`].
+
+PPI Origami uses the notation from Park and Marcotte, which defines three types of PPI cross-validation datasets [:ref:`1 <References>`]:
+
+- **C3** - Proteins that constitute interactions in one split (*i.e.*: training, validation, or test) are not to be found in any other split.
+- **C2** - No more than one protein in a given interaction may be found in another split.
+- **C1** - No restriction on protein split membership. Interactions are randomly assigned to a split.
+
+In addition to this, PPI Origami ensures that **C3 datasets** meet the following two criteria oultined in the INTREPPPID
+manuscript.
+
+First, let's begin by defining :math:`\{P_{\textsf{Tr}}, P_{\textsf{Te}}, P_{\textsf{V}}\}`, which are the set of proteins
+present in the interactions found in the Training, Testing, and Validation split, respectively.
+
+Further, let's define :math:`\mathcal{P}` as the collection of protein sets :math:`\{P_{\textsf{Tr}}, P_{\textsf{Te}}, P_{\textsf{V}}\}`
+
+**Criterion 1 - Distinct Protein Identity** The protein sets :math:`\{P_{\textsf{Tr}}, P_{\textsf{Te}}, P_{\textsf{V}}\}` must be mutually disjoint:
+
+.. math::
+
+   \forall Q, R \in \mathcal{P}, Q \cap R = \varnothing \textsf{ if } Q \neq R.
+
+**Criterion 2 - Distinct Sequence Identity** 
+
+.. math::
+
+   \forall Q, R \in \mathcal{P}, \;\;\;\; \forall q \in Q, \;\;\;\; \forall r \in R,\;\;\;\; f(q,r) \leq 90\% \;\;\;\; \textsf{ if } \;\;\;\; Q \neq R,
+
+where :math:`f` is some sequence similarity metric. We use UniRef cluster membership for sequence similarity.
+
+References
+----------
+
+1. Park, Yungki and Edward M. Marcotte. “`A flaw in the typical evaluation scheme for pair-input computational predictions <https://doi.org/10.1038/nmeth.2259>`_.” *Nature methods* 9 (2012): 1134 - 1136.
+2. Hamp, Tobias and Burkhard Rost. “`More challenges for machine-learning protein interactions <https://doi.org/10.1093/bioinformatics%2Fbtu857>`_.” *Bioinformatics* 31 10 (2015): 1521-5 .
+3. Bernett, Judith, David B. Blumenthal and Markus List. “`Cracking the black box of deep sequence-based protein-protein interaction prediction <https://doi.org/10.1101/2023.01.18.524543>`_.” *bioRxiv* (2023): n. pag.