Skip to content

Molecule

L. M. Riza Rizky edited this page Sep 13, 2025 · 9 revisions

Molecule Data Loading Module (molecule.py)

Module location: mmai25_hackathon/load_data/molecule.py

This module provides functions to load SMILES strings from a CSV/DataFrame and convert them to molecular graphs using PyTorch Geometric (PyG). Use it for molecule–protein interaction or other cheminformatics tasks.

Dataset Layout & Files

Expected data file: a CSV with at least a SMILES column, for example:

MMAI25Hackathon/molecule-protein-interaction/dataset.csv

First rows (sample):

SMILES Protein (truncated) Y drug_cluster target_cluster
CC1=CN=C2N1C=CN=C2NCC1=CC=NC=C1 MARSLLLPLQILLLSLALETAGEEAQGDKIIDGAPCARGSHPWQVALLSGNQLHCGGVLVN… 0.0 1904 1528
FC(F)OC(F)(F)C(F)Cl MVSAKKVPAIALSAGVSFALLRFLCLAVCLNESPGQNQKEEKLCTENFTRILDSLLDGYDNRLRPGF… 1.0 9 353
[H][C@@]12C[C@@]3([H])C(=C(O)[C@]1(O)C(=O)C(C(N)=O)=C(O)[C@H]2N(C)C)C(=O)C1=C(O)C=CC=C1[C@@]3(C)O MVDPVGFAEAWKAQFPDSEPPRMELRSVGDIEQELERCKASIRRLEQEVNQERFRMIYLQTLLAKEKKSY… 0.0 23 765

Note: Protein sequences are truncated for display.

Functional Overview

  • Read SMILES strings from a DataFrame or CSV and return a one‑column DataFrame named smiles.
  • Convert individual SMILES strings to PyG Data graphs with node/edge features.

Functions

load_smiles_from_dataframe(df, smiles_col, index_col=None, filter_rows=None) -> pd.DataFrame

  • Args:
    • df (pd.DataFrame | str): DataFrame or CSV filepath.
    • smiles_col (str): Column name containing SMILES strings (e.g., "SMILES").
    • index_col (str | None): Optional column to set as index.
    • filter_rows (dict[str, Sequence | pandas.Index] | None): Row filters as {column: allowed_values}; applied where columns exist. Default: None.
  • Behavior:
    • If df is a path, loads the CSV via read_tabular selecting smiles_col and index_col, applying filter_rows when provided.
    • If df is a DataFrame and filter_rows is provided, applies row filters to matching columns.
    • Validates smiles_col; optionally sets the index.
    • Returns a one‑column DataFrame named smiles (index preserved if set).
  • Returns: pd.DataFrame with a single column smiles.
  • Errors: Raises ValueError if smiles_col is not found. Will propagate file/parse errors from read_tabular when a path is invalid.

smiles_to_graph(smiles, with_hydrogen=False, kekulize=False) -> torch_geometric.data.Data

  • Args:
    • smiles (str): SMILES string to convert.
    • with_hydrogen (bool): Include explicit hydrogens if True.
    • kekulize (bool): Kekulize aromatic bonds if True.
  • Returns: PyG Data graph with keys like x, edge_index, edge_attr and attributes such as num_nodes, num_edges.
  • Errors: Propagates errors from the SMILES parser/converter if the string is invalid.

Quick Example

Expected data file: a CSV containing a SMILES column, for example: MMAI25Hackathon/molecule-protein-interaction/dataset.csv.

Step 1: Set the data path

Point CSV_PATH to the CSV file that contains the SMILES column. CSV_PATH can be any absolute or relative path to your local copy of the dataset. The path below is an example from this repo — replace it with your own.

Note: In these examples, MMAI25Hackathon is the top‑level directory you downloaded from the hackathon Dropbox link (after unzipping). If your dataset lives elsewhere, adjust the path accordingly.

# Example only — replace with your path
CSV_PATH = "MMAI25Hackathon/molecule-protein-interaction/dataset.csv"

# The loader validates inputs and will raise helpful errors if the CSV or required column is missing.

Step 2: Load SMILES

from mmai25_hackathon.load_data.molecule import load_smiles_from_dataframe, smiles_to_graph

smiles_df = load_smiles_from_dataframe(CSV_PATH, smiles_col="SMILES")
print(smiles_df.head())

Step 3: Convert to molecular graphs

for smi in smiles_df["smiles"].head(3):
    g = smiles_to_graph(smi, with_hydrogen=False, kekulize=False)
    print(g)  # e.g., Data(x=[N, F], edge_index=[2, E], edge_attr=[E, A], smiles='...')
  • Notation:
    • N: number of nodes
    • F: number of node features
    • E: number of edges
    • A: number of edge features
    • edge_index shape [2, E]: 2 rows (source/destination), E columns (edges)

CLI Usage

You can run the module directly to preview the SMILES and convert a few to graphs:

python -m mmai25_hackathon.load_data.molecule --data-path \
  MMAI25Hackathon/molecule-protein-interaction/dataset.csv

Notes

  • Ensure PyTorch Geometric is installed (torch_geometric).
  • Use index_col in load_smiles_from_dataframe to keep a stable ID alongside SMILES when needed.

Clone this wiki locally