Molecule

Molecule Data Loading Module (`molecule.py`)

Module location: mmai25_hackathon/load_data/molecule.py

This module provides functions to load SMILES strings from a CSV/DataFrame and convert them to molecular graphs using PyTorch Geometric (PyG). Use it for molecule–protein interaction or other cheminformatics tasks.

Dataset Layout & Files

Expected data file: a CSV with at least a SMILES column, for example:

MMAI25Hackathon/molecule-protein-interaction/dataset.csv

First rows (sample):

SMILES	Protein (truncated)	Y	drug_cluster	target_cluster
CC1=CN=C2N1C=CN=C2NCC1=CC=NC=C1	MARSLLLPLQILLLSLALETAGEEAQGDKIIDGAPCARGSHPWQVALLSGNQLHCGGVLVN…	0.0	1904	1528
FC(F)OC(F)(F)C(F)Cl	MVSAKKVPAIALSAGVSFALLRFLCLAVCLNESPGQNQKEEKLCTENFTRILDSLLDGYDNRLRPGF…	1.0	9	353
[H][C@@]12C[C@@]3([H])C(=C(O)[C@]1(O)C(=O)C(C(N)=O)=C(O)[C@H]2N(C)C)C(=O)C1=C(O)C=CC=C1[C@@]3(C)O	MVDPVGFAEAWKAQFPDSEPPRMELRSVGDIEQELERCKASIRRLEQEVNQERFRMIYLQTLLAKEKKSY…	0.0	23	765

Note: Protein sequences are truncated for display.

Functional Overview

Read SMILES strings from a DataFrame or CSV and return a one‑column DataFrame named smiles.
Convert individual SMILES strings to PyG Data graphs with node/edge features.

Functions

`load_smiles_from_dataframe(df, smiles_col, index_col=None, filter_rows=None) -> pd.DataFrame`

Args:
- df (pd.DataFrame | str): DataFrame or CSV filepath.
- smiles_col (str): Column name containing SMILES strings (e.g., "SMILES").
- index_col (str | None): Optional column to set as index.
- filter_rows (dict[str, Sequence | pandas.Index] | None): Row filters as {column: allowed_values}; applied where columns exist. Default: None.
Behavior:
- If df is a path, loads the CSV via read_tabular selecting smiles_col and index_col, applying filter_rows when provided.
- If df is a DataFrame and filter_rows is provided, applies row filters to matching columns.
- Validates smiles_col; optionally sets the index.
- Returns a one‑column DataFrame named smiles (index preserved if set).
Returns: pd.DataFrame with a single column smiles.
Errors: Raises ValueError if smiles_col is not found. Will propagate file/parse errors from read_tabular when a path is invalid.

`smiles_to_graph(smiles, with_hydrogen=False, kekulize=False) -> torch_geometric.data.Data`

Args:
- smiles (str): SMILES string to convert.
- with_hydrogen (bool): Include explicit hydrogens if True.
- kekulize (bool): Kekulize aromatic bonds if True.
Returns: PyG Data graph with keys like x, edge_index, edge_attr and attributes such as num_nodes, num_edges.
Errors: Propagates errors from the SMILES parser/converter if the string is invalid.

Quick Example

Expected data file: a CSV containing a SMILES column, for example: MMAI25Hackathon/molecule-protein-interaction/dataset.csv.

Step 1: Set the data path

Point CSV_PATH to the CSV file that contains the SMILES column. CSV_PATH can be any absolute or relative path to your local copy of the dataset. The path below is an example from this repo — replace it with your own.

Note: In these examples, MMAI25Hackathon is the top‑level directory you downloaded from the hackathon Dropbox link (after unzipping). If your dataset lives elsewhere, adjust the path accordingly.

# Example only — replace with your path
CSV_PATH = "MMAI25Hackathon/molecule-protein-interaction/dataset.csv"

# The loader validates inputs and will raise helpful errors if the CSV or required column is missing.

Step 2: Load SMILES

from mmai25_hackathon.load_data.molecule import load_smiles_from_dataframe, smiles_to_graph

smiles_df = load_smiles_from_dataframe(CSV_PATH, smiles_col="SMILES")
print(smiles_df.head())

Step 3: Convert to molecular graphs

for smi in smiles_df["smiles"].head(3):
    g = smiles_to_graph(smi, with_hydrogen=False, kekulize=False)
    print(g)  # e.g., Data(x=[N, F], edge_index=[2, E], edge_attr=[E, A], smiles='...')

Notation:
- N: number of nodes
- F: number of node features
- E: number of edges
- A: number of edge features
- edge_index shape [2, E]: 2 rows (source/destination), E columns (edges)

CLI Usage

You can run the module directly to preview the SMILES and convert a few to graphs:

python -m mmai25_hackathon.load_data.molecule --data-path \
  MMAI25Hackathon/molecule-protein-interaction/dataset.csv

Notes

Ensure PyTorch Geometric is installed (torch_geometric).
Use index_col in load_smiles_from_dataframe to keep a stable ID alongside SMILES when needed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Molecule

Molecule Data Loading Module (`molecule.py`)

Dataset Layout & Files

Functional Overview

Functions

`load_smiles_from_dataframe(df, smiles_col, index_col=None, filter_rows=None) -> pd.DataFrame`

`smiles_to_graph(smiles, with_hydrogen=False, kekulize=False) -> torch_geometric.data.Data`

Quick Example

Step 1: Set the data path

Step 2: Load SMILES

Step 3: Convert to molecular graphs

CLI Usage

Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Table of Contents

Home

Data

Dataset Module

Data Loading Modules

Clone this wiki locally

Molecule

Molecule Data Loading Module (molecule.py)

Dataset Layout & Files

Functional Overview

Functions

load_smiles_from_dataframe(df, smiles_col, index_col=None, filter_rows=None) -> pd.DataFrame

smiles_to_graph(smiles, with_hydrogen=False, kekulize=False) -> torch_geometric.data.Data

Quick Example

Step 1: Set the data path

Step 2: Load SMILES

Step 3: Convert to molecular graphs

CLI Usage

Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Table of Contents

Home

Data

Dataset Module

Data Loading Modules

Clone this wiki locally

Molecule Data Loading Module (`molecule.py`)

`load_smiles_from_dataframe(df, smiles_col, index_col=None, filter_rows=None) -> pd.DataFrame`

`smiles_to_graph(smiles, with_hydrogen=False, kekulize=False) -> torch_geometric.data.Data`