Skip to content

mshahneh/Edit_Distance_Workflow

Repository files navigation

Run

To run the code, make sure you have nextflow installed. we tested nextflow on version 24.04.4.

Sanity Test Run

To run the code on a small set and check that everything works:

make run_sample

this will create nf_output_sample folder with the result for the mols_small.csv.

Main run

To run on the main data:

make run

How to Run the Code with custom options

The main way to run the workflow is to run the following command:

nextflow run ./main.nf

If you need to calculate more variations like the MCES or motif based, you can use argument options.

Options

You can customize the run by adding options. Here are some common options:

  • -resume: To resume a job.
  • -c <config file>: To pass a config file.
  • -w <work directory>: To define the work directory of nextflow.
  • --batch_x <int value>: the number of rows when dividing the work into batches.
  • --batch_y <int value>: the number of columns when dividing the work into batches.
  • --calculate_mces <1 or 0>: If 1, the myopic-MCES distance is also calculated.
  • --calculate_motif_based <1 or 0>: if 1, the motif based edit distance is also calculated.
  • --mols_csv <path to csv file>: The path to csv file containing the molecules data. The CSV columns must contain the keys: 'Smiles' and 'INCHI'.
  • motifs_csv <path to csv file>: The path to csv file containing the SMARTS for the motifs, the CSV columns must contain the key 'smarts'.
  • output_dir <directory>: The directory for the output result.

Example:

nextflow run ./main.nf -resume -c nextflow.config --batch_x 10 --batch_y 10 \ --output_dir nf_output_test --calculate_mces 1 --calculate_motif_based 1 --mols_csv 'data/mols.csv' --motifs_csv 'data/motifs.csv'

File Descriptions

  • main_edit_distance.py: The main script to calculate the edit distance, it recieves all the data, the cached mols, how the data is divided to grid and the grid index and outputs a csv file for the pairwise edit distance in that grid.
  • motif_base_edit_distance: Contains the functions for the motif base edit distance.
  • mol_utils.py: Utility functions used throughout the code to work with rdkit molecules.
  • data/: Directory containing sample input data files.
  • combine_csvs.py & combine_csvs2.py: Simple files to combine CSV files generated by each grid cell process.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •