BioinfoMachineLearning
diff --git a/‎.env.example
Lines changed: 1 addition & 1 deletion b/‎.env.example
Lines changed: 1 addition & 1 deletion
diff --git a/‎.pre-commit-config.yaml
Lines changed: 8 additions & 8 deletions b/‎.pre-commit-config.yaml
Lines changed: 8 additions & 8 deletions
diff --git a/‎Dockerfile
Lines changed: 5 additions & 2 deletions b/‎Dockerfile
Lines changed: 5 additions & 2 deletions
diff --git a/‎README.md
Lines changed: 17 additions & 6 deletions b/‎README.md
Lines changed: 17 additions & 6 deletions
diff --git a/‎configs/data/plinder.yaml
Lines changed: 18 additions & 0 deletions b/‎configs/data/plinder.yaml
Lines changed: 18 additions & 0 deletions
@@ -3,4 +3,4 @@
 # .env is loaded by train.py automatically
 # hydra allows you to reference variables in .yaml configs with special syntax: ${oc.env:MY_VAR}
 
-MY_VAR="/home/user/my/system/path"
+PLINDER_MOUNT="$(pwd)/data/PLINDER"
@@ -18,18 +18,18 @@ repos:
       - id: check-toml
       - id: check-case-conflict
       - id: check-added-large-files
-        args: ["--maxkb=10000"]
+        args: ["--maxkb=20000"]
 
   # python code formatting
   - repo: https://github.com/psf/black
-    rev: 24.10.0
+    rev: 25.1.0
     hooks:
       - id: black
         args: [--line-length, "99"]
 
   # python import sorting
   - repo: https://github.com/PyCQA/isort
-    rev: 5.13.2
+    rev: 6.0.1
     hooks:
       - id: isort
         args: ["--profile", "black", "--filter-files"]
@@ -43,7 +43,7 @@ repos:
 
   # python docstring formatting
   - repo: https://github.com/myint/docformatter
-    rev: v1.7.5
+    rev: eb1df347edd128b30cd3368dddc3aa65edcfac38 # Don't autoupdate until https://github.com/PyCQA/docformatter/issues/293 is fixed
     hooks:
       - id: docformatter
         args:
@@ -73,7 +73,7 @@ repos:
 
   # python check (PEP8), programming errors and code complexity
   - repo: https://github.com/PyCQA/flake8
-    rev: 7.1.1
+    rev: 7.1.2
     hooks:
       - id: flake8
         args:
@@ -87,7 +87,7 @@ repos:
 
   # python security linter
   - repo: https://github.com/PyCQA/bandit
-    rev: "1.8.2"
+    rev: "1.8.3"
     hooks:
       - id: bandit
         args: ["-s", "B101"]
@@ -108,7 +108,7 @@ repos:
 
   # md formatting
   - repo: https://github.com/executablebooks/mdformat
-    rev: 0.7.21
+    rev: 0.7.22
     hooks:
       - id: mdformat
         args: ["--number"]
@@ -121,7 +121,7 @@ repos:
 
   # word spelling linter
   - repo: https://github.com/codespell-project/codespell
-    rev: v2.3.0
+    rev: v2.4.1
     hooks:
       - id: codespell
         args:
 
@@ -20,18 +20,21 @@ RUN mkdir -p /software/flowdock
 WORKDIR /software/flowdock
 
 ## Clone project
-RUN git clone https://github.com/BioinfoMachineLearning/FlowDock /software/flowdock 
+RUN git clone https://github.com/BioinfoMachineLearning/FlowDock /software/flowdock
 
 ## Create conda environment
 # RUN conda env create -f environments/flowdock_environment.yaml
 COPY environments/flowdock_environment_docker.yaml /software/flowdock/environments/flowdock_environment_docker.yaml
 RUN conda env create -f environments/flowdock_environment_docker.yaml
 
+# Install ProDy without NumPy dependency
+RUN python -m pip install --no-cache-dir --no-dependencies prody==2.4.1
+
 ## Automatically activate conda environment
 RUN echo "source activate flowdock" >> /etc/profile.d/conda.sh && \
     echo "source /opt/conda/etc/profile.d/conda.sh" >> ~/.bashrc && \
     echo "conda activate flowdock" >> ~/.bashrc
 
 ## Default shell and command
 SHELL ["/bin/bash", "-l", "-c"]
-CMD ["/bin/bash"]
+CMD ["/bin/bash"]
@@ -76,6 +76,7 @@ cd FlowDock
 mamba env create -f environments/flowdock_environment.yaml
 conda activate FlowDock  # NOTE: one still needs to use `conda` to (de)activate environments
 pip3 install -e . # install local project as package
+pip3 install prody==2.4.1 --no-dependencies  # install ProDy without NumPy dependency
 ```
 
 Download checkpoints
@@ -159,6 +160,16 @@ mv pdb_2021aug02/ pdbsidechain/
 cd ../
 ```
 
+Lastly, to finetune `FlowDock` using the `PLINDER` dataset, one must first prepare this data for training
+
+```bash
+# fetch PLINDER data (NOTE: requires ~1 hour to download and ~750G of storage)
+export PLINDER_MOUNT="$(pwd)/data/PLINDER"
+mkdir -p "$PLINDER_MOUNT" # create the directory if it doesn't exist
+
+plinder_download -y
+```
+
 ### Generating ESM2 embeddings for each protein (optional, cached input data available on SharePoint)
 
 To generate the ESM2 embeddings for the protein inputs,
@@ -260,10 +271,10 @@ python flowdock/train.py experiment=flowdock_fm
 python flowdock/train.py experiment=flowdock_fm trainer.max_epochs=20 data.batch_size=8
 ```
 
-For example, override parameters to finetune `FlowDock`'s pretrained weights using a new dataset
+For example, override parameters to finetune `FlowDock`'s pretrained weights using a new dataset such as [PLINDER](https://www.plinder.sh/)
 
 ```bash
-python flowdock/train.py experiment=flowdock_fm data=my_new_datamodule ckpt_path=checkpoints/esmfold_prior_paper_weights.ckpt
+python flowdock/train.py experiment=flowdock_fm data=plinder ckpt_path=checkpoints/esmfold_prior_paper_weights.ckpt
 ```
 
 </details>
@@ -277,7 +288,7 @@ To reproduce `FlowDock`'s evaluation results for structure prediction, please re
 To reproduce `FlowDock`'s evaluation results for binding affinity prediction using the PDBBind dataset
 
 ```bash
-python flowdock/eval.py data.test_datasets=[pdbbind] ckpt_path=checkpoints/esmfold_prior_paper_weights_EMA.ckpt trainer=gpu
+python flowdock/eval.py data.test_datasets=[pdbbind] ckpt_path=checkpoints/esmfold_prior_paper_weights-EMA.ckpt trainer=gpu
 ... # re-run two more times to gather triplicate results
 ```
 
@@ -353,13 +364,13 @@ jupyter notebook notebooks/casp16_binding_affinity_prediction_results_plotting.i
 For example, generate new protein-ligand complexes for a pair of protein sequence and ligand SMILES strings such as those of the PDBBind 2020 test target `6i67`
 
 ```bash
-python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights_EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling input_receptor='YNKIVHLLVAEPEKIYAMPDPTVPDSDIKALTTLCDLADRELVVIIGWAKHIPGFSTLSLADQMSLLQSAWMEILILGVVYRSLFEDELVYADDYIMDEDQSKLAGLLDLNNAILQLVKKYKSMKLEKEEFVTLKAIALANSDSMHIEDVEAVQKLQDVLHEALQDYEAGQHMEDPRRAGKMLMTLPLLRQTSTKAVQHFYNKLEGKVPMHKLFLEMLEAKV' input_ligand='"c1cc2c(cc1O)CCCC2"' input_template=data/pdbbind/pdbbind_holo_aligned_esmfold_structures/6i67_holo_aligned_esmfold_protein.pdb sample_id='6i67' out_path='./6i67_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=true auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
+python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights-EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling input_receptor='YNKIVHLLVAEPEKIYAMPDPTVPDSDIKALTTLCDLADRELVVIIGWAKHIPGFSTLSLADQMSLLQSAWMEILILGVVYRSLFEDELVYADDYIMDEDQSKLAGLLDLNNAILQLVKKYKSMKLEKEEFVTLKAIALANSDSMHIEDVEAVQKLQDVLHEALQDYEAGQHMEDPRRAGKMLMTLPLLRQTSTKAVQHFYNKLEGKVPMHKLFLEMLEAKV' input_ligand='"c1cc2c(cc1O)CCCC2"' input_template=data/pdbbind/pdbbind_holo_aligned_esmfold_structures/6i67_holo_aligned_esmfold_protein.pdb sample_id='6i67' out_path='./6i67_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=true auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
 ```
 
 Or, for example, generate new protein-ligand complexes for pairs of protein sequences and (multi-)ligand SMILES strings (delimited via `|`) such as those of the CASP15 target `T1152`
 
 ```bash
-python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights_EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling input_receptor='MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIP|MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIP|MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIPN' input_ligand='"CC(=O)NC1C(O)OC(CO)C(OC2OC(CO)C(OC3OC(CO)C(O)C(O)C3NC(C)=O)C(O)C2NC(C)=O)C1O"' input_template=data/test_cases/predicted_structures/T1152.pdb sample_id='T1152' out_path='./T1152_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=true auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
+python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights-EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling input_receptor='MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIP|MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIP|MYTVKPGDTMWKIAVKYQIGISEIIAANPQIKNPNLIYPGQKINIPN' input_ligand='"CC(=O)NC1C(O)OC(CO)C(OC2OC(CO)C(OC3OC(CO)C(O)C(O)C3NC(C)=O)C(O)C2NC(C)=O)C1O"' input_template=data/test_cases/predicted_structures/T1152.pdb sample_id='T1152' out_path='./T1152_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=true auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
 ```
 
 If you do not already have a template protein structure available for your target of interest, set `input_template=null` to instead have the sampling script predict the ESMFold structure of your provided `input_protein` sequence before running the sampling pipeline. For more information regarding the input arguments available for sampling, please refer to the config at `configs/sample.yaml`.
@@ -369,7 +380,7 @@ If you do not already have a template protein structure available for your targe
 For instance, one can perform batched prediction as follows:
 
 ```bash
-python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights_EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling csv_path='./data/test_cases/prediction_inputs/flowdock_batched_inputs.csv' out_path='./T1152_batch_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=false auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
+python flowdock/sample.py ckpt_path=checkpoints/esmfold_prior_paper_weights-EMA.ckpt model.cfg.prior_type=esmfold sampling_task=batched_structure_sampling csv_path='./data/test_cases/prediction_inputs/flowdock_batched_inputs.csv' out_path='./T1152_batch_sampled_structures/' n_samples=5 chunk_size=5 num_steps=40 sampler=VDODE sampler_eta=1.0 start_time='1.0' use_template=true separate_pdb=true visualize_sample_trajectories=false auxiliary_estimation_only=false esmfold_chunk_size=null trainer=gpu
 ```
 
 </details>
 
@@ -0,0 +1,18 @@
+_target_: flowdock.data.plinder_datamodule.PlinderDataModule
+data_dir: ${paths.data_dir}/PLINDER/
+batch_size: 16 # Needs to be divisible by the number of devices (e.g., if in a distributed setup)
+num_workers: 4
+pin_memory: True
+# overfitting arguments
+overfitting_example_name: null # NOTE: currently not used
+# model arguments
+n_protein_patches: 96
+n_lig_patches: 32
+epoch_frac: 1.0
+edge_crop_size: 400000
+esm_version: ${model.cfg.protein_encoder.esm_version}
+esm_repr_layer: ${model.cfg.protein_encoder.esm_repr_layer}
+# general dataset arguments
+plinder_offline: False
+min_protein_length: 50
+max_protein_length: 750