Skip to content

Commit 67e729a

Browse files
committed
deploy: 8c3b985
0 parents  commit 67e729a

38 files changed

+6050
-0
lines changed

.buildinfo

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# Sphinx build info version 1
2+
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
3+
config: 0c43a31e5cf1e710c67910c0b0d7b0a8
4+
tags: 645f666f9bcd5a90fca523b33c5a78b7

.nojekyll

Whitespace-only changes.

_sources/api.rst.txt

+18
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
API
2+
===
3+
4+
Data
5+
----
6+
7+
.. autoclass:: intrepppid.data.ppi_oma.IntrepppidDataset
8+
:members:
9+
:special-members: __init__, __getitem__, __len__
10+
11+
.. autoclass:: intrepppid.data.ppi_oma.IntrepppidDataModule
12+
:members:
13+
:special-members: __init__
14+
15+
Network
16+
-------
17+
18+
.. autofunction:: intrepppid.intrepppid_network

_sources/cli.rst.txt

+118
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
Command Line Interface
2+
======================
3+
4+
INTREPPPID has a :abbr:`CLI (Command Line Interface)` which can be used to easily train INTREPPPID.
5+
6+
Train
7+
-----
8+
9+
To train the INTREPPPID model as it was in the manuscript, use the ``train e2e_rnn_triplet`` command:
10+
11+
.. code:: bash
12+
13+
$ intrepppid train e2e_rnn_triplet DATASET.h5 spm.model 3 100 80 --seed 3927704 --vocab_size 250 --trunc_len 1500 --embedding_size 64 --rnn_num_layers 2 --rnn_dropout_rate 0.3 --variational_dropout false --bi_reduce last --workers 4 --embedding_droprate 0.3 --do_rate 0.3 --log_path logs/e2e_rnn_triplet --beta_classifier 2 --use_projection false --optimizer_type ranger21_xx --lr 1e-2
14+
15+
.. list-table:: INTREPPPID Manuscript Values for ``e2e_rnn_triplet``
16+
:widths: 25 25 25 50
17+
:header-rows: 1
18+
19+
* - Argument/Flag
20+
- Default
21+
- Manuscript Value
22+
- Description
23+
* - ``PPI_DATASET_PATH``
24+
- None
25+
- See Data
26+
- Path to the PPI dataset. Must be in the INTREPPPID HDF5 format.
27+
* - ``SENTENCEPIECE_PATH``
28+
- None
29+
- See Data
30+
- Path to the SentencePiece model.
31+
* - ``C_TYPE``
32+
- None
33+
- ``3``
34+
- Specifies which dataset in the INTREPPPID HDF5 dataset to use by specifying the C-type.
35+
* - ``NUM_EPOCHS``
36+
- None
37+
- ``100``
38+
- Number of epochs to train the model for.
39+
* - ``BATCH_SIZE``
40+
- None
41+
- ``80``
42+
- The number of samples to use in the batch.
43+
* - ``--seed``
44+
- None
45+
- ``8675309`` or ``5353456`` or ``3927704`` depending on the experiment.
46+
- The random seed. If not specified, chosen at random.
47+
* - ``--vocab_size``
48+
- ``250``
49+
- ``250``
50+
- The number of tokens in the SentencePiece vocabulary.
51+
* - ``--trunc_len``
52+
- ``1500``
53+
- ``1500``
54+
- Length at which to truncate sequences.
55+
* - ``--embedding_size``
56+
- ``64``
57+
- ``64``
58+
- The size of embeddings.
59+
* - ``--rnn_num_layers``
60+
- ``2``
61+
- ``2``
62+
- The number of layers in the AWD-LSTM encoder to use.
63+
* - ``--rnn_dropout_rate``
64+
- ``0.3``
65+
- ``0.3``
66+
- The dropconnect rate for the AWD-LSTM encoder.
67+
* - ``--variational_dropout``
68+
- ``false``
69+
- ``false``
70+
- Whether to use variational dropout, as described in the AWD-LSTM manuscript.
71+
* - ``--bi_reduce``
72+
- ``last``
73+
- ``last``
74+
- Method to reduce the two LSTM embeddings for both directions. Must be one of "concat", "max", "mean", "last".
75+
* - ``--workers``
76+
- ``4``
77+
- ``4``
78+
- The number of processes to use for the DataLoader.
79+
* - ``--embedding_droprate``
80+
- ``0.3``
81+
- ``0.3``
82+
- The amount of Embedding Dropout to use (a la AWD-LSTM).
83+
* - ``--do_rate``
84+
- ``0.3``
85+
- ``0.3``
86+
- The amount of dropout to use in the MLP Classifier.
87+
* - ``--log_path``
88+
- ``"./logs/e2e_rnn_triplet"``
89+
- ``"./logs/e2e_rnn_triplet"``
90+
- The path to save logs.
91+
* - ``--encoder_only_steps``
92+
- ``-1`` (No Steps)
93+
- ``-1`` (No Steps)
94+
- The number of steps to only train the encoder and not the classifier.
95+
* - ``--classifier_warm_up``
96+
- ``-1`` (No Steps)
97+
- ``-1`` (No Steps)
98+
- The number of steps to only train the classifier and not the encoder.
99+
* - ``--beta_classifier``
100+
- ``4`` (25% contribution of the classifier loss, 75% contribution of the orthologue loss)
101+
- ``2`` (50% contribution of the classifier loss, 50% contribution of the orthologue loss)
102+
- Adjusts the amount of weight to give the PPI Classification loss, relative to the Orthologue Locality loss. The loss becomes (1/β)×(classifier_loss) + [1-(1/β)]×(orthologue_loss).
103+
* - ``--lr``
104+
- ``1e-2``
105+
- ``1e-2``
106+
- Learning rate to use.
107+
* - ``--use_projection``
108+
- ``false``
109+
- ``false``
110+
- Whether to use a projection network after the encoder.
111+
* - ``--checkpoint_path``
112+
- ``log_path / model_name / "chkpt"``
113+
- ``log_path / model_name / "chkpt"``
114+
- The location where checkpoints are to be saved.
115+
* - ``--optimizer_type``
116+
- ``ranger21``
117+
- ``ranger21_xx``
118+
- The optimizer to use while training. Must be one of ``ranger21``, ``ranger21_xx``, ``adamw``, ``adamw_1cycle``, or ``adamw_cosine``.

_sources/data.rst.txt

+115
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
Data
2+
====
3+
4+
Precomputed Datasets
5+
--------------------
6+
7+
You can download precomputed datasets from the sources below:
8+
9+
1. `Zenodo <https://doi.org/10.5281/zenodo.10594149>`_ (DOI: 10.5281/zenodo.10594149)
10+
2. `Internet Archive <https://archive.org/details/intrepppid_datasets.tar>`_
11+
12+
All datasets are made available under the `Creative Commons Attribution-ShareAlike 4.0 International <https://creativecommons.org/licenses/by-sa/4.0/legalcode>`_ license.
13+
14+
Dataset Format
15+
--------------
16+
17+
INTREPPPID requires that datasets be prepared specifically in `HDF5 <https://en.wikipedia.org/wiki/Hierarchical_Data_Format>`_ files.
18+
19+
Each INTREPPPID dataset must have the following hierarchical structure
20+
21+
.. code::
22+
23+
intrepppid.h5
24+
├── orthologs
25+
├── sequences
26+
27+
├── splits
28+
│ ├── test
29+
│ ├── train
30+
│ └── val
31+
32+
└── interactions
33+
├── c1
34+
│ ├── c1_train
35+
│ ├── c1_val
36+
│ └── c1_test
37+
38+
├── c2
39+
│ ├── c2_train
40+
│ ├── c2_val
41+
│ └── c2_test
42+
43+
└── c3
44+
├── c2_train
45+
├── c2_val
46+
└── c2_test
47+
48+
All but one of the "c" folders under "interactions" need be present, so long as that is the dataset you specify in the train step with the ``--c_type`` flag.
49+
50+
Here is the schema for the tables:
51+
52+
.. list-table:: ``orthologs`` schema
53+
:widths: 25 25 25 50
54+
:header-rows: 1
55+
56+
* - Field Name
57+
- Type
58+
- Example
59+
- Description
60+
* - ``ortholog_group_id``
61+
- ``Int64``
62+
- ``1048576``
63+
- The `OMA <https://omabrowser.org/oma/home/>`_ Group ID of the protein in the ``protein_id`` column
64+
* - ``protein_id``
65+
- ``String``
66+
- ``M7ZLH0``
67+
- The `UniProt <https://www.uniprot.org/>`_ accession of a protein with OMA Group ID ``ortholog_group_id``
68+
69+
.. list-table:: ``sequences`` schema
70+
:widths: 25 25 25 50
71+
:header-rows: 1
72+
73+
* - Field Name
74+
- Type
75+
- Example
76+
- Description
77+
* - ``name``
78+
- ``String``
79+
- ``Q9NZE8``
80+
- The `UniProt <https://www.uniprot.org/>`_ accession that corresponds to the amino acid sequence in the ``sequence`` column.
81+
* - ``sequence``
82+
- ``String``
83+
- ``MAASAFAGAVRAASGILRPLNI``...
84+
- The amino acid sequence indicated by the ``name`` column.
85+
86+
.. list-table:: Schema for all tables under ``interactions``
87+
:widths: 25 25 25 50
88+
:header-rows: 1
89+
90+
* - Field Name
91+
- Type
92+
- Example
93+
- Description
94+
* - ``protein_id1``
95+
- ``String``
96+
- ``Q9BQB4``
97+
- The `UniProt <https://www.uniprot.org/>`_ accession of the first protein in the interaction pair.
98+
* - ``protein_id2``
99+
- ``String``
100+
- ``Q9NYF0``
101+
- The `UniProt <https://www.uniprot.org/>`_ accession of the second protein in the interaction pair.
102+
* - ``omid_protein_id``
103+
- ``String``
104+
- ``C1MTX6``
105+
- The `UniProt <https://www.uniprot.org/>`_ accession of the anchor protein for the orthologous locality loss.
106+
* - ``omid_id``
107+
- ``Int64``
108+
- ``737336``
109+
- The `OMA <https://omabrowser.org/oma/home/>`_ Group ID of the anchor protein, from which a positive protein can be chose for the orthologous locality loss.
110+
* - ``label``
111+
- ``Bool``
112+
- ``False``
113+
- Label indicating whether ``protein_id1`` and ``protein_id2`` interact with one another.
114+
115+
Everything under the

_sources/guide.rst.txt

+127
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,127 @@
1+
User Guide
2+
==========
3+
4+
Training
5+
--------
6+
7+
The easiest way to start training INTREPPPID is to use the :doc:`CLI <cli>`.
8+
9+
An example of running the training loop with the values used in the INTREPPPID manuscript is as follows:
10+
11+
.. code:: bash
12+
13+
$ intrepppid train e2e_rnn_triplet DATASET.h5 spm.model 3 100 80 --seed 3927704 --vocab_size 250 --trunc_len 1500 --embedding_size 64 --rnn_num_layers 2 --rnn_dropout_rate 0.3 --variational_dropout false --bi_reduce last --workers 4 --embedding_droprate 0.3 --do_rate 0.3 --log_path logs/e2e_rnn_triplet --beta_classifier 2 --use_projection false --optimizer_type ranger21_xx --lr 1e-2
14+
15+
Checkpoints will be saved in a folder ``logs/e2e_rnn_triplet/model_name/chkpt`` and can be used for inference.
16+
17+
Inference
18+
---------
19+
20+
The easiest way to infer using INTREPPPID is through the website `https://PPI.bio <https://ppi.bio>`_. However, you may wish to infer locally using INTREPPID for various reasons, `e.g.`: to infer using your own custom checkpoints.
21+
22+
Preparing Data
23+
^^^^^^^^^^^^^^
24+
25+
To infer using INTREPPPID, you'll have to use the :doc:`API <api>`.
26+
27+
The first step is to get the amino acid sequences you want to infer. This can be as simple as defining a list of sequence pairs:
28+
29+
.. code:: python
30+
31+
sequence_pairs = [
32+
("MANQRLS","MGPLSS"),
33+
("MQQNLSS","MPWNLS"),
34+
]
35+
36+
You'll need to encode all the sequence, and you'll need to use the same settings that were used during training. Using the same parameters as used in the dataset:
37+
38+
.. code:: python
39+
40+
from intrepppid.data.ppi_oma import IntrepppidDataset
41+
import sentencepiece as sp
42+
43+
trunc_len = 1500
44+
spp = sp.SentencePieceProcessor(model_file=SPM_FILE)
45+
46+
encoded_sequence_pairs = []
47+
48+
for p1, p2 in sequence_pairs:
49+
x1 = IntrepppidDataset.static_encode(trunc_len, spp, p1)
50+
x2 = IntrepppidDataset.static_encode(trunc_len, spp, p2)
51+
52+
# Infer interactions here
53+
54+
55+
Alternatively, you may be interested in loading sequences from an INTREPPPID dataset to do testing. You can use the :py:class:`intrepppid.data.ppi_oma.IntrepppidDataModule`.
56+
57+
.. code:: python
58+
59+
from intrepppid.data.ppi_oma import IntrepppidDataModule
60+
61+
batch_size = 80
62+
63+
data_module = IntrepppidDataModule(
64+
batch_size = batch_size,
65+
dataset_path = DATASET_PATH,
66+
c_type = 3,
67+
trunc_len = 1500,
68+
workers = 4,
69+
vocab_size = 250,
70+
model_file = SPM_FILE,
71+
seed = 8675309,
72+
sos = False,
73+
eos = False,
74+
negative_omid = True
75+
)
76+
77+
data_module.setup()
78+
79+
for batch in data_module.test_dataloader():
80+
p1_seq, p2_seq, _, _, _, label = batch
81+
# Infer interactions here
82+
83+
Load the INTREPPPID network
84+
^^^^^^^^^^^^^^^^^^^^^^^^^^^
85+
86+
We must now instantiate the INTREPPPID network and load weights.
87+
88+
If you trained the INTREPPPID with the manuscript defaults, you pass any values to :py:func:`intrepppid.intrepppid_network`.
89+
90+
.. code:: python
91+
92+
from intrepppid import intrepppid_network
93+
94+
# steps_per_epoch is 0 here because it is not used for inference
95+
net = intrepppid_network(0)
96+
97+
net.eval()
98+
99+
chkpt = torch.load(CHECKPOINT_PATH)
100+
101+
net.load_state_dict(chkpt['state_dict'])
102+
103+
Infer Interactions
104+
^^^^^^^^^^^^^^^^^^
105+
106+
Putting everything together, you get:
107+
108+
.. code:: python
109+
110+
for p1, p2 in sequence_pairs:
111+
x1 = IntrepppidDataset.static_encode(trunc_len, spp, p1)
112+
x2 = IntrepppidDataset.static_encode(trunc_len, spp, p2)
113+
114+
y_hat_logits = net(x1, x2)
115+
# The forward pass returns logits, so you need to activate with sigmoid
116+
y_hat = torch.sigmoid(y_hat_logits)
117+
118+
or if you were using the INTREPPPID Data Module
119+
120+
.. code:: python
121+
122+
for batch in data_module.test_dataloader():
123+
x1, x2, _, _, _, label = batch
124+
125+
y_hat_logits = net(x1, x2)
126+
# The forward pass returns logits, so you need to activate with sigmoid
127+
y_hat = torch.sigmoid(y_hat_logits)

0 commit comments

Comments
 (0)