Skip to content
L. M. Riza Rizky edited this page Sep 13, 2025 · 2 revisions

Labels Data Loading Module (labels.py)

Module location: mmai25_hackathon/load_data/labels.py

This module provides functions to fetch supervision labels from a CSV/DataFrame and to one‑hot encode categorical label columns for classification tasks. It supports single or multiple label columns.

Dataset Layout & Files

Expected data file: a CSV with a label column, for example:

MMAI25Hackathon/molecule-protein-interaction/dataset.csv

First rows (sample):

SMILES Protein (truncated) Y drug_cluster target_cluster
CC1=CN=C2N1C=CN=C2NCC1=CC=NC=C1 MARSLLLPLQILLLSLALETAGEEAQGDKIIDGAPCARGSHPWQVALLSGNQLHCGGVLVN… 0.0 1904 1528
FC(F)OC(F)(F)C(F)Cl MVSAKKVPAIALSAGVSFALLRFLCLAVCLNESPGQNQKEEKLCTENFTRILDSLLDGYDNRLRPGF… 1.0 9 353
[H][C@@]12C[C@@]3([H])C(=C(O)[C@]1(O)C(=O)C(C(N)=O)=C(O)[C@H]2N(C)C)C(=O)C1=C(O)C=CC=C1[C@@]3(C)O MVDPVGFAEAWKAQFPDSEPPRMELRSVGDIEQELERCKASIRRLEQEVNQERFRMIYLQTLLAKEKKSY… 0.0 23 765

Note: Protein sequences are truncated for display. In this dataset the label column is named Y.

Functional Overview

  • Read labels from a DataFrame or CSV (single or multiple columns).
  • Optionally set an index column on the returned DataFrame.
  • One‑hot encode specified categorical label columns using pandas.

Functions

load_labels_from_dataframe(df, label_col, index_col=None, filter_rows=None) -> pd.DataFrame

  • Args:
    • df (pd.DataFrame | str): DataFrame or CSV filepath.
    • label_col (str | Sequence[str]): Label column name or list of names.
    • index_col (str | None): Optional column to set as index.
    • filter_rows (dict[str, Sequence | pandas.Index] | None): Row filters as {column: allowed_values}; applied where columns exist. Default: None.
  • Behavior:
    • If df is a path, loads the CSV via read_tabular selecting label_col and index_col, applying filter_rows when provided.
    • If df is a DataFrame and filter_rows is provided, applies row filters to matching columns.
    • Validates presence of label_col.
    • Returns a one‑column DataFrame named label for a single column; returns the DataFrame with original column names for multiple columns.
  • Returns: pd.DataFrame of labels (shape depends on single vs multiple columns).
  • Errors: Raises ValueError if requested label column(s) are missing. File/parse errors propagate when a path is invalid.

one_hot_encode_labels(labels, columns="label") -> pd.DataFrame

  • Args:
    • labels (pd.DataFrame): DataFrame with label column(s).
    • columns (str | Sequence[str]): Column(s) to one‑hot encode; defaults to label.
  • Behavior: Uses pandas.get_dummies with dtype=float32 to one‑hot encode the specified columns.
  • Returns: pd.DataFrame of one‑hot encoded labels with float32 dtypes.

Quick Example

Expected data file: a CSV containing a label column, for example: MMAI25Hackathon/molecule-protein-interaction/dataset.csv.

Step 1: Set the data path

Point CSV_PATH to the CSV file that contains your labels. CSV_PATH can be any absolute or relative path to your local copy of the dataset. The path below is an example from this repo — replace it with your own.

Note: In these examples, MMAI25Hackathon is the top‑level directory you downloaded from the hackathon Dropbox link (after unzipping). If your dataset lives elsewhere, adjust the path accordingly.

# Example only — replace with your path
CSV_PATH = "MMAI25Hackathon/molecule-protein-interaction/dataset.csv"

# The loader validates inputs and will raise helpful errors if the CSV or required column is missing.

Step 2: Fetch labels

from mmai25_hackathon.load_data.labels import (
    load_labels_from_dataframe,
    one_hot_encode_labels,
)

labels_df = load_labels_from_dataframe(CSV_PATH, label_col="Y")
print(labels_df.head())  # single column named 'label'

Step 3: One‑hot encode (for categorical labels)

one_hot_df = one_hot_encode_labels(labels_df, columns="label")
print(one_hot_df.head())

CLI Usage

You can run the module directly to preview labels and (optionally) one‑hot encodings:

python -m mmai25_hackathon.load_data.labels --data-path \
  MMAI25Hackathon/molecule-protein-interaction/dataset.csv

Notes

  • For regression tasks, you likely do not want one‑hot encoding; use the numeric label column as is.
  • For multi‑label or multi‑target tasks, pass a list to label_col and one‑hot encode whichever subset is categorical.

Clone this wiki locally