Labels

Labels Data Loading Module (`labels.py`)

Module location: mmai25_hackathon/load_data/labels.py

This module provides functions to fetch supervision labels from a CSV/DataFrame and to one‑hot encode categorical label columns for classification tasks. It supports single or multiple label columns.

Dataset Layout & Files

Expected data file: a CSV with a label column, for example:

MMAI25Hackathon/molecule-protein-interaction/dataset.csv

First rows (sample):

SMILES	Protein (truncated)	Y	drug_cluster	target_cluster
CC1=CN=C2N1C=CN=C2NCC1=CC=NC=C1	MARSLLLPLQILLLSLALETAGEEAQGDKIIDGAPCARGSHPWQVALLSGNQLHCGGVLVN…	0.0	1904	1528
FC(F)OC(F)(F)C(F)Cl	MVSAKKVPAIALSAGVSFALLRFLCLAVCLNESPGQNQKEEKLCTENFTRILDSLLDGYDNRLRPGF…	1.0	9	353
[H][C@@]12C[C@@]3([H])C(=C(O)[C@]1(O)C(=O)C(C(N)=O)=C(O)[C@H]2N(C)C)C(=O)C1=C(O)C=CC=C1[C@@]3(C)O	MVDPVGFAEAWKAQFPDSEPPRMELRSVGDIEQELERCKASIRRLEQEVNQERFRMIYLQTLLAKEKKSY…	0.0	23	765

Note: Protein sequences are truncated for display. In this dataset the label column is named Y.

Functional Overview

Read labels from a DataFrame or CSV (single or multiple columns).
Optionally set an index column on the returned DataFrame.
One‑hot encode specified categorical label columns using pandas.

Functions

`load_labels_from_dataframe(df, label_col, index_col=None, filter_rows=None) -> pd.DataFrame`

Args:
- df (pd.DataFrame | str): DataFrame or CSV filepath.
- label_col (str | Sequence[str]): Label column name or list of names.
- index_col (str | None): Optional column to set as index.
- filter_rows (dict[str, Sequence | pandas.Index] | None): Row filters as {column: allowed_values}; applied where columns exist. Default: None.
Behavior:
- If df is a path, loads the CSV via read_tabular selecting label_col and index_col, applying filter_rows when provided.
- If df is a DataFrame and filter_rows is provided, applies row filters to matching columns.
- Validates presence of label_col.
- Returns a one‑column DataFrame named label for a single column; returns the DataFrame with original column names for multiple columns.
Returns: pd.DataFrame of labels (shape depends on single vs multiple columns).
Errors: Raises ValueError if requested label column(s) are missing. File/parse errors propagate when a path is invalid.

`one_hot_encode_labels(labels, columns="label") -> pd.DataFrame`

Args:
- labels (pd.DataFrame): DataFrame with label column(s).
- columns (str | Sequence[str]): Column(s) to one‑hot encode; defaults to label.
Behavior: Uses pandas.get_dummies with dtype=float32 to one‑hot encode the specified columns.
Returns: pd.DataFrame of one‑hot encoded labels with float32 dtypes.

Quick Example

Expected data file: a CSV containing a label column, for example: MMAI25Hackathon/molecule-protein-interaction/dataset.csv.

Step 1: Set the data path

Point CSV_PATH to the CSV file that contains your labels. CSV_PATH can be any absolute or relative path to your local copy of the dataset. The path below is an example from this repo — replace it with your own.

Note: In these examples, MMAI25Hackathon is the top‑level directory you downloaded from the hackathon Dropbox link (after unzipping). If your dataset lives elsewhere, adjust the path accordingly.

# Example only — replace with your path
CSV_PATH = "MMAI25Hackathon/molecule-protein-interaction/dataset.csv"

# The loader validates inputs and will raise helpful errors if the CSV or required column is missing.

Step 2: Fetch labels

from mmai25_hackathon.load_data.labels import (
    load_labels_from_dataframe,
    one_hot_encode_labels,
)

labels_df = load_labels_from_dataframe(CSV_PATH, label_col="Y")
print(labels_df.head())  # single column named 'label'

Step 3: One‑hot encode (for categorical labels)

one_hot_df = one_hot_encode_labels(labels_df, columns="label")
print(one_hot_df.head())

CLI Usage

You can run the module directly to preview labels and (optionally) one‑hot encodings:

python -m mmai25_hackathon.load_data.labels --data-path \
  MMAI25Hackathon/molecule-protein-interaction/dataset.csv

Notes

For regression tasks, you likely do not want one‑hot encoding; use the numeric label column as is.
For multi‑label or multi‑target tasks, pass a list to label_col and one‑hot encode whichever subset is categorical.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Labels

Labels Data Loading Module (`labels.py`)

Dataset Layout & Files

Functional Overview

Functions

`load_labels_from_dataframe(df, label_col, index_col=None, filter_rows=None) -> pd.DataFrame`

`one_hot_encode_labels(labels, columns="label") -> pd.DataFrame`

Quick Example

Step 1: Set the data path

Step 2: Fetch labels

Step 3: One‑hot encode (for categorical labels)

CLI Usage

Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Table of Contents

Home

Data

Dataset Module

Data Loading Modules

Clone this wiki locally

Labels

Labels Data Loading Module (labels.py)

Dataset Layout & Files

Functional Overview

Functions

load_labels_from_dataframe(df, label_col, index_col=None, filter_rows=None) -> pd.DataFrame

one_hot_encode_labels(labels, columns="label") -> pd.DataFrame

Quick Example

Step 1: Set the data path

Step 2: Fetch labels

Step 3: One‑hot encode (for categorical labels)

CLI Usage

Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Table of Contents

Home

Data

Dataset Module

Data Loading Modules

Clone this wiki locally

Labels Data Loading Module (`labels.py`)

`load_labels_from_dataframe(df, label_col, index_col=None, filter_rows=None) -> pd.DataFrame`

`one_hot_encode_labels(labels, columns="label") -> pd.DataFrame`