-
Notifications
You must be signed in to change notification settings - Fork 7
Molecule
Module location: mmai25_hackathon/load_data/molecule.py
This module provides functions to load SMILES strings from a CSV/DataFrame and convert them to molecular graphs using PyTorch Geometric (PyG). Use it for molecule–protein interaction or other cheminformatics tasks.
Expected data file: a CSV with at least a SMILES column, for example:
MMAI25Hackathon/molecule-protein-interaction/dataset.csv
First rows (sample):
| SMILES | Protein (truncated) | Y | drug_cluster | target_cluster |
|---|---|---|---|---|
| CC1=CN=C2N1C=CN=C2NCC1=CC=NC=C1 | MARSLLLPLQILLLSLALETAGEEAQGDKIIDGAPCARGSHPWQVALLSGNQLHCGGVLVN… | 0.0 | 1904 | 1528 |
| FC(F)OC(F)(F)C(F)Cl | MVSAKKVPAIALSAGVSFALLRFLCLAVCLNESPGQNQKEEKLCTENFTRILDSLLDGYDNRLRPGF… | 1.0 | 9 | 353 |
| [H][C@@]12C[C@@]3([H])C(=C(O)[C@]1(O)C(=O)C(C(N)=O)=C(O)[C@H]2N(C)C)C(=O)C1=C(O)C=CC=C1[C@@]3(C)O | MVDPVGFAEAWKAQFPDSEPPRMELRSVGDIEQELERCKASIRRLEQEVNQERFRMIYLQTLLAKEKKSY… | 0.0 | 23 | 765 |
Note: Protein sequences are truncated for display.
- Read SMILES strings from a DataFrame or CSV and return a one‑column DataFrame named
smiles. - Convert individual SMILES strings to PyG
Datagraphs with node/edge features.
- Args:
-
df(pd.DataFrame | str): DataFrame or CSV filepath. -
smiles_col(str): Column name containing SMILES strings (e.g.,"SMILES"). -
index_col(str | None): Optional column to set as index. -
filter_rows(dict[str, Sequence | pandas.Index] | None): Row filters as{column: allowed_values}; applied where columns exist. Default:None.
-
- Behavior:
- If
dfis a path, loads the CSV viaread_tabularselectingsmiles_colandindex_col, applyingfilter_rowswhen provided. - If
dfis a DataFrame andfilter_rowsis provided, applies row filters to matching columns. - Validates
smiles_col; optionally sets the index. - Returns a one‑column DataFrame named
smiles(index preserved if set).
- If
- Returns:
pd.DataFramewith a single columnsmiles. - Errors: Raises
ValueErrorifsmiles_colis not found. Will propagate file/parse errors fromread_tabularwhen a path is invalid.
- Args:
-
smiles(str): SMILES string to convert. -
with_hydrogen(bool): Include explicit hydrogens ifTrue. -
kekulize(bool): Kekulize aromatic bonds ifTrue.
-
- Returns: PyG
Datagraph with keys likex,edge_index,edge_attrand attributes such asnum_nodes,num_edges. - Errors: Propagates errors from the SMILES parser/converter if the string is invalid.
Expected data file: a CSV containing a SMILES column, for example:
MMAI25Hackathon/molecule-protein-interaction/dataset.csv.
Point CSV_PATH to the CSV file that contains the SMILES column.
CSV_PATH can be any absolute or relative path to your local copy of the dataset. The path below is an example from
this repo — replace it with your own.
Note: In these examples, MMAI25Hackathon is the top‑level directory you downloaded from the hackathon Dropbox link
(after unzipping). If your dataset lives elsewhere, adjust the path accordingly.
# Example only — replace with your path
CSV_PATH = "MMAI25Hackathon/molecule-protein-interaction/dataset.csv"
# The loader validates inputs and will raise helpful errors if the CSV or required column is missing.from mmai25_hackathon.load_data.molecule import load_smiles_from_dataframe, smiles_to_graph
smiles_df = load_smiles_from_dataframe(CSV_PATH, smiles_col="SMILES")
print(smiles_df.head())for smi in smiles_df["smiles"].head(3):
g = smiles_to_graph(smi, with_hydrogen=False, kekulize=False)
print(g) # e.g., Data(x=[N, F], edge_index=[2, E], edge_attr=[E, A], smiles='...')- Notation:
- N: number of nodes
- F: number of node features
- E: number of edges
- A: number of edge features
- edge_index shape [2, E]: 2 rows (source/destination), E columns (edges)
You can run the module directly to preview the SMILES and convert a few to graphs:
python -m mmai25_hackathon.load_data.molecule --data-path \
MMAI25Hackathon/molecule-protein-interaction/dataset.csv- Ensure PyTorch Geometric is installed (
torch_geometric). - Use
index_colinload_smiles_from_dataframeto keep a stable ID alongside SMILES when needed.