To run the code, make sure you have nextflow installed. we tested nextflow on version 24.04.4.
To run the code on a small set and check that everything works:
make run_sample
this will create nf_output_sample folder with the result for the mols_small.csv.
To run on the main data:
make run
The main way to run the workflow is to run the following command:
nextflow run ./main.nf
If you need to calculate more variations like the MCES or motif based, you can use argument options.
You can customize the run by adding options. Here are some common options:
-resume: To resume a job.-c <config file>: To pass a config file.-w <work directory>: To define the work directory of nextflow.--batch_x <int value>: the number of rows when dividing the work into batches.--batch_y <int value>: the number of columns when dividing the work into batches.--calculate_mces <1 or 0>: If 1, the myopic-MCES distance is also calculated.--calculate_motif_based <1 or 0>: if 1, the motif based edit distance is also calculated.--mols_csv <path to csv file>: The path to csv file containing the molecules data. The CSV columns must contain the keys: 'Smiles' and 'INCHI'.motifs_csv <path to csv file>: The path to csv file containing the SMARTS for the motifs, the CSV columns must contain the key 'smarts'.output_dir <directory>: The directory for the output result.
Example:
nextflow run ./main.nf -resume -c nextflow.config --batch_x 10 --batch_y 10 \ --output_dir nf_output_test --calculate_mces 1 --calculate_motif_based 1 --mols_csv 'data/mols.csv' --motifs_csv 'data/motifs.csv'
main_edit_distance.py: The main script to calculate the edit distance, it recieves all the data, the cached mols, how the data is divided to grid and the grid index and outputs a csv file for the pairwise edit distance in that grid.motif_base_edit_distance: Contains the functions for the motif base edit distance.mol_utils.py: Utility functions used throughout the code to work with rdkit molecules.data/: Directory containing sample input data files.combine_csvs.py&combine_csvs2.py: Simple files to combine CSV files generated by each grid cell process.