-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit db29d52
Showing
37 changed files
with
5,984 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Sphinx build info version 1 | ||
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done. | ||
config: 37670295bc18dea7fe322d5223f4a00d | ||
tags: 645f666f9bcd5a90fca523b33c5a78b7 |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
Analysis | ||
======== | ||
|
||
The ``analysis`` module of PPI Origami allows you to validate and analyze datasets. | ||
|
||
You can find a description of all the possible download commands by running: :: | ||
|
||
ppi_origami analysis --help | ||
|
||
Information specific to arguments of commands can be found by running the command with the help flag: :: | ||
|
||
ppi_origami analysis COMMAND --help | ||
|
||
This information is reproduced on this page. | ||
|
||
.. autoclass:: ppi_origami.__main__.Analysis | ||
:members: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
Download | ||
======== | ||
|
||
The ``download`` module of PPI Origami allows you to download files from their authoratative sources. PPI Origami works best when you designate one folder on your filesystem for keeping all original, untransformed datasets (we'll call that the "raw folder"). You'll refer to this folder in the ``process`` module, where "raw" files will be transformed and saved in a "processed" folder. | ||
|
||
You can find a description of all the possible download commands by running: :: | ||
|
||
ppi_origami download --help | ||
|
||
Information specific to arguments of commands can be found by running the command with the help flag: :: | ||
|
||
ppi_origami download COMMAND --help | ||
|
||
This information is reproduced on this page. | ||
|
||
.. autoclass:: ppi_origami.__main__.Download | ||
:members: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
Process | ||
======= | ||
|
||
The ``process`` module of PPI Origami allows you to transform files from their original formats and create new datasets. | ||
|
||
You can find a description of all the possible download commands by running: :: | ||
|
||
ppi_origami process --help | ||
|
||
Information specific to arguments of commands can be found by running the command with the help flag: :: | ||
|
||
ppi_origami process COMMAND --help | ||
|
||
This information is reproduced on this page. | ||
|
||
.. autoclass:: ppi_origami.__main__.Process | ||
:members: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
.. PPI Origami documentation master file, created by | ||
sphinx-quickstart on Wed Jan 31 13:24:30 2024. | ||
You can adapt this file completely to your liking, but it should at least | ||
contain the root `toctree` directive. | ||
Welcome to PPI Origami's documentation! | ||
======================================= | ||
|
||
This is the documentation for PPI Origami, a programme that helps users create and validate datasets of | ||
protein-protein interactions (PPIs) with cross-validation splits suitable for training and testing PPI inference models. | ||
|
||
.. toctree:: | ||
:maxdepth: 3 | ||
:caption: Contents: | ||
|
||
theory | ||
commands/download | ||
commands/process | ||
commands/analysis | ||
|
||
|
||
Indices and tables | ||
================== | ||
|
||
* :ref:`genindex` | ||
* :ref:`modindex` | ||
* :ref:`search` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
Theory | ||
====== | ||
|
||
Preparing datasets for the purpose of training models that predict protein-protein interactions is a deceptively fraught | ||
process. | ||
|
||
Several studies have outlined that insufficiently controlling for protein identity between cross-validation splits can | ||
lead to serious over-fitting of PPI prediction methods [:ref:`1-3 <References>`]. | ||
|
||
PPI Origami uses the notation from Park and Marcotte, which defines three types of PPI cross-validation datasets [:ref:`1 <References>`]: | ||
|
||
- **C3** - Proteins that constitute interactions in one split (*i.e.*: training, validation, or test) are not to be found in any other split. | ||
- **C2** - No more than one protein in a given interaction may be found in another split. | ||
- **C1** - No restriction on protein split membership. Interactions are randomly assigned to a split. | ||
|
||
In addition to this, PPI Origami ensures that **C3 datasets** meet the following two criteria oultined in the INTREPPPID | ||
manuscript. | ||
|
||
First, let's begin by defining :math:`\{P_{\textsf{Tr}}, P_{\textsf{Te}}, P_{\textsf{V}}\}`, which are the set of proteins | ||
present in the interactions found in the Training, Testing, and Validation split, respectively. | ||
|
||
Further, let's define :math:`\mathcal{P}` as the collection of protein sets :math:`\{P_{\textsf{Tr}}, P_{\textsf{Te}}, P_{\textsf{V}}\}` | ||
|
||
**Criterion 1 - Distinct Protein Identity** The protein sets :math:`\{P_{\textsf{Tr}}, P_{\textsf{Te}}, P_{\textsf{V}}\}` must be mutually disjoint: | ||
|
||
.. math:: | ||
\forall Q, R \in \mathcal{P}, Q \cap R = \varnothing \textsf{ if } Q \neq R. | ||
**Criterion 2 - Distinct Sequence Identity** | ||
|
||
.. math:: | ||
\forall Q, R \in \mathcal{P}, \;\;\;\; \forall q \in Q, \;\;\;\; \forall r \in R,\;\;\;\; f(q,r) \leq 90\% \;\;\;\; \textsf{ if } \;\;\;\; Q \neq R, | ||
where :math:`f` is some sequence similarity metric. We use UniRef cluster membership for sequence similarity. | ||
|
||
References | ||
---------- | ||
|
||
1. Park, Yungki and Edward M. Marcotte. “`A flaw in the typical evaluation scheme for pair-input computational predictions <https://doi.org/10.1038/nmeth.2259>`_.” *Nature methods* 9 (2012): 1134 - 1136. | ||
2. Hamp, Tobias and Burkhard Rost. “`More challenges for machine-learning protein interactions <https://doi.org/10.1093/bioinformatics%2Fbtu857>`_.” *Bioinformatics* 31 10 (2015): 1521-5 . | ||
3. Bernett, Judith, David B. Blumenthal and Markus List. “`Cracking the black box of deep sequence-based protein-protein interaction prediction <https://doi.org/10.1101/2023.01.18.524543>`_.” *bioRxiv* (2023): n. pag. |
Oops, something went wrong.