Skip to content

Commit 7343b20

Browse files
committed
Add 0.0.3 revisions in bulk
1 parent 50e0746 commit 7343b20

File tree

35 files changed

+33108
-449
lines changed

35 files changed

+33108
-449
lines changed

.env.example

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,4 +3,4 @@
33
# .env is loaded by train.py automatically
44
# hydra allows you to reference variables in .yaml configs with special syntax: ${oc.env:MY_VAR}
55

6-
MY_VAR="/home/user/my/system/path"
6+
PLINDER_MOUNT="$(pwd)/data/PLINDER"

.pre-commit-config.yaml

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -18,18 +18,18 @@ repos:
1818
- id: check-toml
1919
- id: check-case-conflict
2020
- id: check-added-large-files
21-
args: ["--maxkb=10000"]
21+
args: ["--maxkb=20000"]
2222

2323
# python code formatting
2424
- repo: https://github.com/psf/black
25-
rev: 24.10.0
25+
rev: 25.1.0
2626
hooks:
2727
- id: black
2828
args: [--line-length, "99"]
2929

3030
# python import sorting
3131
- repo: https://github.com/PyCQA/isort
32-
rev: 5.13.2
32+
rev: 6.0.1
3333
hooks:
3434
- id: isort
3535
args: ["--profile", "black", "--filter-files"]
@@ -43,7 +43,7 @@ repos:
4343

4444
# python docstring formatting
4545
- repo: https://github.com/myint/docformatter
46-
rev: v1.7.5
46+
rev: eb1df347edd128b30cd3368dddc3aa65edcfac38 # Don't autoupdate until https://github.com/PyCQA/docformatter/issues/293 is fixed
4747
hooks:
4848
- id: docformatter
4949
args:
@@ -73,7 +73,7 @@ repos:
7373

7474
# python check (PEP8), programming errors and code complexity
7575
- repo: https://github.com/PyCQA/flake8
76-
rev: 7.1.1
76+
rev: 7.1.2
7777
hooks:
7878
- id: flake8
7979
args:
@@ -87,7 +87,7 @@ repos:
8787

8888
# python security linter
8989
- repo: https://github.com/PyCQA/bandit
90-
rev: "1.8.2"
90+
rev: "1.8.3"
9191
hooks:
9292
- id: bandit
9393
args: ["-s", "B101"]
@@ -108,7 +108,7 @@ repos:
108108

109109
# md formatting
110110
- repo: https://github.com/executablebooks/mdformat
111-
rev: 0.7.21
111+
rev: 0.7.22
112112
hooks:
113113
- id: mdformat
114114
args: ["--number"]
@@ -121,7 +121,7 @@ repos:
121121

122122
# word spelling linter
123123
- repo: https://github.com/codespell-project/codespell
124-
rev: v2.3.0
124+
rev: v2.4.1
125125
hooks:
126126
- id: codespell
127127
args:

Dockerfile

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,18 +20,21 @@ RUN mkdir -p /software/flowdock
2020
WORKDIR /software/flowdock
2121

2222
## Clone project
23-
RUN git clone https://github.com/BioinfoMachineLearning/FlowDock /software/flowdock
23+
RUN git clone https://github.com/BioinfoMachineLearning/FlowDock /software/flowdock
2424

2525
## Create conda environment
2626
# RUN conda env create -f environments/flowdock_environment.yaml
2727
COPY environments/flowdock_environment_docker.yaml /software/flowdock/environments/flowdock_environment_docker.yaml
2828
RUN conda env create -f environments/flowdock_environment_docker.yaml
2929

30+
# Install ProDy without NumPy dependency
31+
RUN python -m pip install --no-cache-dir --no-dependencies prody==2.4.1
32+
3033
## Automatically activate conda environment
3134
RUN echo "source activate flowdock" >> /etc/profile.d/conda.sh && \
3235
echo "source /opt/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
3336
echo "conda activate flowdock" >> ~/.bashrc
3437

3538
## Default shell and command
3639
SHELL ["/bin/bash", "-l", "-c"]
37-
CMD ["/bin/bash"]
40+
CMD ["/bin/bash"]

README.md

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ cd FlowDock
7676
mamba env create -f environments/flowdock_environment.yaml
7777
conda activate FlowDock # NOTE: one still needs to use `conda` to (de)activate environments
7878
pip3 install -e . # install local project as package
79+
pip3 install prody==2.4.1 --no-dependencies # install ProDy without NumPy dependency
7980
```
8081

8182
Download checkpoints
@@ -159,6 +160,16 @@ mv pdb_2021aug02/ pdbsidechain/
159160
cd ../
160161
```
161162

163+
Lastly, to finetune `FlowDock` using the `PLINDER` dataset, one must first prepare this data for training
164+
165+
```bash
166+
# fetch PLINDER data (NOTE: requires ~1 hour to download and ~750G of storage)
167+
export PLINDER_MOUNT="$(pwd)/data/PLINDER"
168+
mkdir -p "$PLINDER_MOUNT" # create the directory if it doesn't exist
169+
170+
plinder_download -y
171+
```
172+
162173
### Generating ESM2 embeddings for each protein (optional, cached input data available on SharePoint)
163174

164175
To generate the ESM2 embeddings for the protein inputs,
@@ -260,10 +271,10 @@ python flowdock/train.py experiment=flowdock_fm
260271
python flowdock/train.py experiment=flowdock_fm trainer.max_epochs=20 data.batch_size=8
261272
```
262273

263-
For example, override parameters to finetune `FlowDock`'s pretrained weights using a new dataset
274+
For example, override parameters to finetune `FlowDock`'s pretrained weights using a new dataset such as [PLINDER](https://www.plinder.sh/)
264275

265276
```bash
266-
python flowdock/train.py experiment=flowdock_fm data=my_new_datamodule ckpt_path=checkpoints/esmfold_prior_paper_weights.ckpt
277+
python flowdock/train.py experiment=flowdock_fm data=plinder ckpt_path=checkpoints/esmfold_prior_paper_weights.ckpt
267278
```
268279

269280
</details>
@@ -277,7 +288,7 @@ To reproduce `FlowDock`'s evaluation results for structure prediction, please re
277288
To reproduce `FlowDock`'s evaluation results for binding affinity prediction using the PDBBind dataset
278289

279290
```bash
280-
python flowdock/eval.py data.test_datasets=[pdbbind] ckpt_path=checkpoints/esmfold_prior_paper_weights_EMA.ckpt trainer=gpu
291+
python flowdock/eval.py data.test_datasets=[pdbbind] ckpt_path=checkpoints/esmfold_prior_paper_weights-EMA.ckpt trainer=gpu
281292
... # re-run two more times to gather triplicate results
282293
```
283294

@@ -353,13 +364,13 @@ jupyter notebook notebooks/casp16_binding_affinity_prediction_results_plotting.i
353364
For example, generate new protein-ligand complexes for a pair of protein sequence and ligand SMILES strings such as those of the PDBBind 2020 test target `6i67`
354365

355366
```bash
356-
python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights_EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling input_receptor='YNKIVHLLVAEPEKIYAMPDPTVPDSDIKALTTLCDLADRELVVIIGWAKHIPGFSTLSLADQMSLLQSAWMEILILGVVYRSLFEDELVYADDYIMDEDQSKLAGLLDLNNAILQLVKKYKSMKLEKEEFVTLKAIALANSDSMHIEDVEAVQKLQDVLHEALQDYEAGQHMEDPRRAGKMLMTLPLLRQTSTKAVQHFYNKLEGKVPMHKLFLEMLEAKV' input_ligand='"c1cc2c(cc1O)CCCC2"' input_template=data/pdbbind/pdbbind_holo_aligned_esmfold_structures/6i67_holo_aligned_esmfold_protein.pdb sample_id='6i67' out_path='./6i67_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=true auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
367+
python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights-EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling input_receptor='YNKIVHLLVAEPEKIYAMPDPTVPDSDIKALTTLCDLADRELVVIIGWAKHIPGFSTLSLADQMSLLQSAWMEILILGVVYRSLFEDELVYADDYIMDEDQSKLAGLLDLNNAILQLVKKYKSMKLEKEEFVTLKAIALANSDSMHIEDVEAVQKLQDVLHEALQDYEAGQHMEDPRRAGKMLMTLPLLRQTSTKAVQHFYNKLEGKVPMHKLFLEMLEAKV' input_ligand='"c1cc2c(cc1O)CCCC2"' input_template=data/pdbbind/pdbbind_holo_aligned_esmfold_structures/6i67_holo_aligned_esmfold_protein.pdb sample_id='6i67' out_path='./6i67_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=true auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
357368
```
358369

359370
Or, for example, generate new protein-ligand complexes for pairs of protein sequences and (multi-)ligand SMILES strings (delimited via `|`) such as those of the CASP15 target `T1152`
360371

361372
```bash
362-
python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights_EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling input_receptor='MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIP|MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIP|MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIPN' input_ligand='"CC(=O)NC1C(O)OC(CO)C(OC2OC(CO)C(OC3OC(CO)C(O)C(O)C3NC(C)=O)C(O)C2NC(C)=O)C1O"' input_template=data/test_cases/predicted_structures/T1152.pdb sample_id='T1152' out_path='./T1152_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=true auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
373+
python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights-EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling input_receptor='MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIP|MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIP|MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIPN' input_ligand='"CC(=O)NC1C(O)OC(CO)C(OC2OC(CO)C(OC3OC(CO)C(O)C(O)C3NC(C)=O)C(O)C2NC(C)=O)C1O"' input_template=data/test_cases/predicted_structures/T1152.pdb sample_id='T1152' out_path='./T1152_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=true auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
363374
```
364375

365376
If you do not already have a template protein structure available for your target of interest, set `input_template=null` to instead have the sampling script predict the ESMFold structure of your provided `input_protein` sequence before running the sampling pipeline. For more information regarding the input arguments available for sampling, please refer to the config at `configs/sample.yaml`.
@@ -369,7 +380,7 @@ If you do not already have a template protein structure available for your targe
369380
For instance, one can perform batched prediction as follows:
370381

371382
```bash
372-
python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights_EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling csv_path='./data/test_cases/prediction_inputs/flowdock_batched_inputs.csv' out_path='./T1152_batch_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=false auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
383+
python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights-EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling csv_path='./data/test_cases/prediction_inputs/flowdock_batched_inputs.csv' out_path='./T1152_batch_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=false auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
373384
```
374385

375386
</details>

configs/data/plinder.yaml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
_target_: flowdock.data.plinder_datamodule.PlinderDataModule
2+
data_dir: ${paths.data_dir}/PLINDER/
3+
batch_size: 16 # Needs to be divisible by the number of devices (e.g., if in a distributed setup)
4+
num_workers: 4
5+
pin_memory: True
6+
# overfitting arguments
7+
overfitting_example_name: null # NOTE: currently not used
8+
# model arguments
9+
n_protein_patches: 96
10+
n_lig_patches: 32
11+
epoch_frac: 1.0
12+
edge_crop_size: 400000
13+
esm_version: ${model.cfg.protein_encoder.esm_version}
14+
esm_repr_layer: ${model.cfg.protein_encoder.esm_repr_layer}
15+
# general dataset arguments
16+
plinder_offline: False
17+
min_protein_length: 50
18+
max_protein_length: 750

0 commit comments

Comments
 (0)