deploy: c5d9225

Emad-COMBINE-lab · Feb 12, 2024 · 3967a1f · 3967a1f
commit 3967a1f
Show file tree

Hide file tree

Showing 42 changed files with 6,851 additions and 0 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: 37670295bc18dea7fe322d5223f4a00d
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/.nojekyll b/.nojekyll
diff --git a/_images/vitable.webp b/_images/vitable.webp
diff --git a/_sources/commands/analysis.rst.txt b/_sources/commands/analysis.rst.txt
@@ -0,0 +1,17 @@
+Analysis
+========
+
+The ``analysis`` module of PPI Origami allows you to validate and analyze datasets.
+
+You can find a description of all the possible download commands by running: ::
+
+   ppi_origami analysis --help
+
+Information specific to arguments of commands can be found by running the command with the help flag: ::
+
+   ppi_origami analysis COMMAND --help
+
+This information is reproduced on this page.
+
+.. autoclass:: ppi_origami.__main__.Analysis
+   :members:
diff --git a/_sources/commands/download.rst.txt b/_sources/commands/download.rst.txt
@@ -0,0 +1,17 @@
+Download
+========
+
+The ``download`` module of PPI Origami allows you to download files from their authoratative sources. PPI Origami works best when you designate one folder on your filesystem for keeping all original, untransformed datasets (we'll call that the "raw folder"). You'll refer to this folder in the ``process`` module, where "raw" files will be transformed and saved in a "processed" folder.
+
+You can find a description of all the possible download commands by running: ::
+
+   ppi_origami download --help
+
+Information specific to arguments of commands can be found by running the command with the help flag: ::
+
+   ppi_origami download COMMAND --help
+
+This information is reproduced on this page.
+
+.. autoclass:: ppi_origami.__main__.Download
+   :members:
diff --git a/_sources/commands/process.rst.txt b/_sources/commands/process.rst.txt
@@ -0,0 +1,17 @@
+Process
+=======
+
+The ``process`` module of PPI Origami allows you to transform files from their original formats and create new datasets.
+
+You can find a description of all the possible download commands by running: ::
+
+   ppi_origami process --help
+
+Information specific to arguments of commands can be found by running the command with the help flag: ::
+
+   ppi_origami process COMMAND --help
+
+This information is reproduced on this page.
+
+.. autoclass:: ppi_origami.__main__.Process
+   :members:
diff --git a/_sources/index.rst.txt b/_sources/index.rst.txt
@@ -0,0 +1,29 @@
+.. PPI Origami documentation master file, created by
+   sphinx-quickstart on Wed Jan 31 13:24:30 2024.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+Welcome to PPI Origami's documentation!
+=======================================
+
+This is the documentation for PPI Origami, a programme that helps users create and validate datasets of
+protein-protein interactions (PPIs) with cross-validation splits suitable for training and testing PPI inference models.
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Contents:
+
+   install
+   usage
+   theory
+   commands/download
+   commands/process
+   commands/analysis
+
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
diff --git a/_sources/install.rst.txt b/_sources/install.rst.txt
@@ -0,0 +1,25 @@
+Installation
+============
+
+Using ``pip``
+-------------
+
+The easiest way to install PPI Origami is to use `pip <https://pip.pypa.io/en/stable/>`_ to retrieve the PPI Origami
+release from `PyPI <https://pypi.org/project/ppi-origami>`_.
+
+.. code-block:: bash
+
+    pip install ppi-origami
+
+Cloning the repository
+----------------------
+
+You can install PPI Origami by cloning the git repository, and using `poetry <https://python-poetry.org/>`_ to install
+and run the programme.
+
+.. code-block:: bash
+
+    git clone https://github.com/Emad-COMBINE-lab/ppi_origami
+    cd ppi_origami
+    poetry install
+    poetry run ppi_origami --help
diff --git a/_sources/theory.rst.txt b/_sources/theory.rst.txt
@@ -0,0 +1,43 @@
+Theory
+======
+
+Preparing datasets for the purpose of training models that predict protein-protein interactions is a deceptively fraught
+process.
+
+Several studies have outlined that insufficiently controlling for protein identity between cross-validation splits can
+lead to serious over-fitting of PPI prediction methods [:ref:`1-3 <References>`].
+
+PPI Origami uses the notation from Park and Marcotte, which defines three types of PPI cross-validation datasets [:ref:`1 <References>`]:
+
+- **C3** - Proteins that constitute interactions in one split (*i.e.*: training, validation, or test) are not to be found in any other split.
+- **C2** - No more than one protein in a given interaction may be found in another split.
+- **C1** - No restriction on protein split membership. Interactions are randomly assigned to a split.
+
+In addition to this, PPI Origami ensures that **C3 datasets** meet the following two criteria oultined in the INTREPPPID
+manuscript.
+
+First, let's begin by defining :math:`\{P_{\textsf{Tr}}, P_{\textsf{Te}}, P_{\textsf{V}}\}`, which are the set of proteins
+present in the interactions found in the Training, Testing, and Validation split, respectively.
+
+Further, let's define :math:`\mathcal{P}` as the collection of protein sets :math:`\{P_{\textsf{Tr}}, P_{\textsf{Te}}, P_{\textsf{V}}\}`
+
+**Criterion 1 - Distinct Protein Identity** The protein sets :math:`\{P_{\textsf{Tr}}, P_{\textsf{Te}}, P_{\textsf{V}}\}` must be mutually disjoint:
+
+.. math::
+
+   \forall Q, R \in \mathcal{P}, Q \cap R = \varnothing \textsf{ if } Q \neq R.
+
+**Criterion 2 - Distinct Sequence Identity** 
+
+.. math::
+
+   \forall Q, R \in \mathcal{P}, \;\;\;\; \forall q \in Q, \;\;\;\; \forall r \in R,\;\;\;\; f(q,r) \leq 90\% \;\;\;\; \textsf{ if } \;\;\;\; Q \neq R,
+
+where :math:`f` is some sequence similarity metric. We use UniRef cluster membership for sequence similarity.
+
+References
+----------
+
+1. Park, Yungki and Edward M. Marcotte. “`A flaw in the typical evaluation scheme for pair-input computational predictions <https://doi.org/10.1038/nmeth.2259>`_.” *Nature methods* 9 (2012): 1134 - 1136.
+2. Hamp, Tobias and Burkhard Rost. “`More challenges for machine-learning protein interactions <https://doi.org/10.1093/bioinformatics%2Fbtu857>`_.” *Bioinformatics* 31 10 (2015): 1521-5 .
+3. Bernett, Judith, David B. Blumenthal and Markus List. “`Cracking the black box of deep sequence-based protein-protein interaction prediction <https://doi.org/10.1101/2023.01.18.524543>`_.” *bioRxiv* (2023): n. pag.
diff --git a/_sources/usage.rst.txt b/_sources/usage.rst.txt
@@ -0,0 +1,138 @@
+Usage Guide
+===========
+
+Get data into the "common" format
+---------------------------------
+
+PPI Origami defines a "common" format from which it can create "strict" datasets (see `Theory <theory.html>`_ for more
+information about strict datasets). This file contains information on binary interactions. No "negative" examples of
+protein pairs which do not interact are to be included in this file.
+
+The common format is simply a `CSV <https://en.wikipedia.org/wiki/Comma-separated_values>`_ file with the following
+four columns:
+
+.. list-table:: Common Format Columns
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Column Name
+     - Example
+     - Description
+   * - ``interaction_id``
+     - ``P21589><P84085.1``
+     - A unique identifier for the interaction.
+   * - ``protein1``
+     - ``P84085``
+     - The UniprotKB accession for the **first** protein in the binary interaction.
+   * - ``protein2``
+     - ``P21589``
+     - The UniprotKB accession for the **second** protein in the binary interaction.
+   * - ``score``
+     - ``string_combined_score:311|string_experimental:312``
+     - Information about confidence scores for the interaction (*e.g.*: `STRING score <https://string-db.org/cgi/info?footer_active_subpage=scores>`_). You can filter edges on the basis of these scores. It follows the format ``key:value``, delimited by ``|``. You can also filter edges ahead of time and leave this column blank.
+
+There are some built-in tools in PPI Origami for converting `STRING <https://string-db.org/>`_ and
+`D-SCRIPT <https://dscript.csail.mit.edu/>`_ datasets to the "common" format. Here's an example of how to do that for
+the STRING dataset:
+
+.. code-block:: bash
+
+    # Download Human edges for STRING v12 into the "raw" folder
+    ppi_origami download string_links_detailed raw --version 12.0 --taxon 9606
+
+    # Download STRING alias data into the "raw" folder
+    ppi_origami download string_aliases raw --version 12.0
+
+    # Download Secondary UniprotKB accession data into the "raw" folder
+    # This will allow us to convert secondary accession codes to primary accessions
+    ppi_origami download uniprot_sec_ac raw
+
+    # Add information about UniprotKB accession codes to the STRING file
+    ppi_origami process string_upkb raw processed 12.0 9606
+
+    # Convert the STRING database to the "common" format
+    ppi_origami process common_format processed string_links_detailed 9606 upkb 12.0
+
+This will result in a file ``processed/common_string_9606.protein.links.detailed.v12.0_upkb.csv.gz``. This is a gzip
+compressed file in the "common" format representing *H. sapiens* data from the STRING v12 dataset. The UniprotKB
+accessions are normalized such that secondary accession codes are converted to primary accession codes.
+
+.. code-block:: bash
+
+    $ zcat processed/common_string_9606.protein.links.detailed.v12.0_upkb.csv.gz | head
+    interaction_id,protein1,protein2,score
+    Q86X27><P84085.1,P84085,Q86X27,string_combined_score:173|string_experimental:134|string_database:|string_textmining:81
+    Q9C0D6><P84085.1,P84085,Q9C0D6,string_combined_score:154|string_experimental:128|string_database:|string_textmining:70
+    P36543><P84085.1,P84085,P36543,string_combined_score:151|string_experimental:49|string_database:|string_textmining:69
+    Q99418><P84085.1,P84085,Q99418,string_combined_score:471|string_experimental:53|string_database:|string_textmining:457
+    Q9NYI0><P84085.1,P84085,Q9NYI0,string_combined_score:201|string_experimental:46|string_database:|string_textmining:197
+    Q8N5M4><P84085.1,P84085,Q8N5M4,string_combined_score:180|string_experimental:125|string_database:|string_textmining:50
+    P14672><P84085.1,P84085,P14672,string_combined_score:181|string_experimental:82|string_database:|string_textmining:133
+    Q9UJY5><P84085.1,P84085,Q9UJY5,string_combined_score:594|string_experimental:296|string_database:|string_textmining:445
+    Q96I51><P84085.1,P84085,Q96I51,string_combined_score:154|string_experimental:58|string_database:|string_textmining:126
+
+Create a strict RAPPPID dataset from the "common" format
+--------------------------------------------------------
+
+Now that we have our data in the "common" format, the rest gets easier. We can create a strict dataset in the RAPPPID
+`HDF5 format <https://en.wikipedia.org/wiki/Hierarchical_Data_Format>`_.
+
+.. code-block:: bash
+
+    # Download the Uniref90 dataset to the "raw" folder
+    # Uniref data is used to test the similarity of proteins
+    ppi_origami download uniref raw 90
+
+    # Process the Uniref90 dataset and store in the "processed" folder
+    # The Uniref90 files comes as an unwieldy, large XML file
+    # We parse that into a LevelDB database for efficiency
+    ppi_origami process uniref raw processed 90
+
+    # Download sequence data from UniprotKB and build a database
+    # in the "processed" folder for H. sapiens
+    ppi_origami download uniprot_seqs_db processed --taxon 9606
+
+    # Finally, we convert the STRING dataset, with UniprotKB accessions
+    # in the "common" format, into the RAPPPID HDF5 format.
+    ppi_origami process common_to_rapppid processed processed/common_string_9606.protein.links.detailed.v12.0_upkb.csv.gz [1,2,3] \
+            --train_proportion 0.8 --val_proportion 0.1 --test_proportion 0.1 --neg_proportion 1 --uniref_threshold 90 \
+            --score_key string_combined_score --score_threshold 950 --seed 8675309 --taxon 9606
+
+This will create a PPI dataset with as many generated negative example as positive examples. Here, datasets that
+correspond to Park & Marcotte C1, C2, and C3 classes are created. The file name in this case is
+``rapppid_[common_string_9606.protein.links.detailed.v12.0_upkb.csv]_MullwcrDNdNzBBOEABq_5VIy7UQ=.h5`` . The random
+string of characters at the end is a hash of the parameters used to create the dataset. It is deterministically
+generated so datasets with the same parameters will have the same hash.
+
+Below is a screenshot of the HDF5 file as viewed from the `ViTables <https://vitables.org/index.html>`_ programme.
+
+.. image:: imgs/vitable.webp
+   :height: 376px
+   :width: 500px
+   :alt: Screenshot from ViTables showing the contents of the HDF5 file.
+   :align: center
+
+Create an INTREPPPID dataset from a RAPPPID dataset
+---------------------------------------------------
+
+INTREPPPID datasets incorporate orthology data that RAPPPID datasets do not, and are required for training the
+INTREPPPID PPI inference algorithm.
+
+.. code-block:: bash
+
+    # We'll need sequence data from as many species as possible when generating INTREPPPID datasets
+    # So let's download UniprotKB sequences, but this time without specifying the organism
+    # This takes an hour and a half on my computer.
+    ppi_origami download uniprot_seqs_db processed
+
+    # Download orthology data from the OMA database
+    ppi_origami download oma processed
+
+    # Create a LevelDB database mapping UniProt accession codes to OMA Group IDs
+    ppi_origami process oma_upkb_groups raw processed
+
+    # Download the Uniref90 dataset to the "raw" folder
+    # Uniref data is used to test the similarity of proteins
+    ppi_origami process rapppid_to_intrepppid processed processed/rapppid_[common_string_9606.protein.links.detailed.v12.0_upkb.csv]_MullwcrDNdNzBBOEABq_5VIy7UQ=.h5 \
+        processed/intrepppid_[common_string_9606.protein.links.detailed.v12.0_upkb.csv]_MullwcrDNdNzBBOEABq_5VIy7UQ=.h5 \
+        [1,2,3] --uniref_threshold 90