Skip to content

Commit

Permalink
deploy: 2db2ecd
Browse files Browse the repository at this point in the history
  • Loading branch information
jszym committed Feb 11, 2024
0 parents commit db29d52
Show file tree
Hide file tree
Showing 37 changed files with 5,984 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 37670295bc18dea7fe322d5223f4a00d
tags: 645f666f9bcd5a90fca523b33c5a78b7
Empty file added .nojekyll
Empty file.
17 changes: 17 additions & 0 deletions _sources/commands/analysis.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
Analysis
========

The ``analysis`` module of PPI Origami allows you to validate and analyze datasets.

You can find a description of all the possible download commands by running: ::

ppi_origami analysis --help

Information specific to arguments of commands can be found by running the command with the help flag: ::

ppi_origami analysis COMMAND --help

This information is reproduced on this page.

.. autoclass:: ppi_origami.__main__.Analysis
:members:
17 changes: 17 additions & 0 deletions _sources/commands/download.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
Download
========

The ``download`` module of PPI Origami allows you to download files from their authoratative sources. PPI Origami works best when you designate one folder on your filesystem for keeping all original, untransformed datasets (we'll call that the "raw folder"). You'll refer to this folder in the ``process`` module, where "raw" files will be transformed and saved in a "processed" folder.

You can find a description of all the possible download commands by running: ::

ppi_origami download --help

Information specific to arguments of commands can be found by running the command with the help flag: ::

ppi_origami download COMMAND --help

This information is reproduced on this page.

.. autoclass:: ppi_origami.__main__.Download
:members:
17 changes: 17 additions & 0 deletions _sources/commands/process.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
Process
=======

The ``process`` module of PPI Origami allows you to transform files from their original formats and create new datasets.

You can find a description of all the possible download commands by running: ::

ppi_origami process --help

Information specific to arguments of commands can be found by running the command with the help flag: ::

ppi_origami process COMMAND --help

This information is reproduced on this page.

.. autoclass:: ppi_origami.__main__.Process
:members:
27 changes: 27 additions & 0 deletions _sources/index.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
.. PPI Origami documentation master file, created by
sphinx-quickstart on Wed Jan 31 13:24:30 2024.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to PPI Origami's documentation!
=======================================

This is the documentation for PPI Origami, a programme that helps users create and validate datasets of
protein-protein interactions (PPIs) with cross-validation splits suitable for training and testing PPI inference models.

.. toctree::
:maxdepth: 3
:caption: Contents:

theory
commands/download
commands/process
commands/analysis


Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
43 changes: 43 additions & 0 deletions _sources/theory.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
Theory
======

Preparing datasets for the purpose of training models that predict protein-protein interactions is a deceptively fraught
process.

Several studies have outlined that insufficiently controlling for protein identity between cross-validation splits can
lead to serious over-fitting of PPI prediction methods [:ref:`1-3 <References>`].

PPI Origami uses the notation from Park and Marcotte, which defines three types of PPI cross-validation datasets [:ref:`1 <References>`]:

- **C3** - Proteins that constitute interactions in one split (*i.e.*: training, validation, or test) are not to be found in any other split.
- **C2** - No more than one protein in a given interaction may be found in another split.
- **C1** - No restriction on protein split membership. Interactions are randomly assigned to a split.

In addition to this, PPI Origami ensures that **C3 datasets** meet the following two criteria oultined in the INTREPPPID
manuscript.

First, let's begin by defining :math:`\{P_{\textsf{Tr}}, P_{\textsf{Te}}, P_{\textsf{V}}\}`, which are the set of proteins
present in the interactions found in the Training, Testing, and Validation split, respectively.

Further, let's define :math:`\mathcal{P}` as the collection of protein sets :math:`\{P_{\textsf{Tr}}, P_{\textsf{Te}}, P_{\textsf{V}}\}`

**Criterion 1 - Distinct Protein Identity** The protein sets :math:`\{P_{\textsf{Tr}}, P_{\textsf{Te}}, P_{\textsf{V}}\}` must be mutually disjoint:

.. math::
\forall Q, R \in \mathcal{P}, Q \cap R = \varnothing \textsf{ if } Q \neq R.
**Criterion 2 - Distinct Sequence Identity**

.. math::
\forall Q, R \in \mathcal{P}, \;\;\;\; \forall q \in Q, \;\;\;\; \forall r \in R,\;\;\;\; f(q,r) \leq 90\% \;\;\;\; \textsf{ if } \;\;\;\; Q \neq R,
where :math:`f` is some sequence similarity metric. We use UniRef cluster membership for sequence similarity.

References
----------

1. Park, Yungki and Edward M. Marcotte. “`A flaw in the typical evaluation scheme for pair-input computational predictions <https://doi.org/10.1038/nmeth.2259>`_.” *Nature methods* 9 (2012): 1134 - 1136.
2. Hamp, Tobias and Burkhard Rost. “`More challenges for machine-learning protein interactions <https://doi.org/10.1093/bioinformatics%2Fbtu857>`_.” *Bioinformatics* 31 10 (2015): 1521-5 .
3. Bernett, Judith, David B. Blumenthal and Markus List. “`Cracking the black box of deep sequence-based protein-protein interaction prediction <https://doi.org/10.1101/2023.01.18.524543>`_.” *bioRxiv* (2023): n. pag.
Loading

0 comments on commit db29d52

Please sign in to comment.