-
Notifications
You must be signed in to change notification settings - Fork 7
Labels
Module location: mmai25_hackathon/load_data/labels.py
This module provides functions to fetch supervision labels from a CSV/DataFrame and to one‑hot encode categorical label columns for classification tasks. It supports single or multiple label columns.
Expected data file: a CSV with a label column, for example:
MMAI25Hackathon/molecule-protein-interaction/dataset.csv
First rows (sample):
| SMILES | Protein (truncated) | Y | drug_cluster | target_cluster |
|---|---|---|---|---|
| CC1=CN=C2N1C=CN=C2NCC1=CC=NC=C1 | MARSLLLPLQILLLSLALETAGEEAQGDKIIDGAPCARGSHPWQVALLSGNQLHCGGVLVN… | 0.0 | 1904 | 1528 |
| FC(F)OC(F)(F)C(F)Cl | MVSAKKVPAIALSAGVSFALLRFLCLAVCLNESPGQNQKEEKLCTENFTRILDSLLDGYDNRLRPGF… | 1.0 | 9 | 353 |
| [H][C@@]12C[C@@]3([H])C(=C(O)[C@]1(O)C(=O)C(C(N)=O)=C(O)[C@H]2N(C)C)C(=O)C1=C(O)C=CC=C1[C@@]3(C)O | MVDPVGFAEAWKAQFPDSEPPRMELRSVGDIEQELERCKASIRRLEQEVNQERFRMIYLQTLLAKEKKSY… | 0.0 | 23 | 765 |
Note: Protein sequences are truncated for display. In this dataset the label column is named Y.
- Read labels from a DataFrame or CSV (single or multiple columns).
- Optionally set an index column on the returned DataFrame.
- One‑hot encode specified categorical label columns using pandas.
- Args:
-
df(pd.DataFrame | str): DataFrame or CSV filepath. -
label_col(str | Sequence[str]): Label column name or list of names. -
index_col(str | None): Optional column to set as index. -
filter_rows(dict[str, Sequence | pandas.Index] | None): Row filters as{column: allowed_values}; applied where columns exist. Default:None.
-
- Behavior:
- If
dfis a path, loads the CSV viaread_tabularselectinglabel_colandindex_col, applyingfilter_rowswhen provided. - If
dfis a DataFrame andfilter_rowsis provided, applies row filters to matching columns. - Validates presence of
label_col. - Returns a one‑column DataFrame named
labelfor a single column; returns the DataFrame with original column names for multiple columns.
- If
- Returns:
pd.DataFrameof labels (shape depends on single vs multiple columns). - Errors: Raises
ValueErrorif requested label column(s) are missing. File/parse errors propagate when a path is invalid.
- Args:
-
labels(pd.DataFrame): DataFrame with label column(s). -
columns(str | Sequence[str]): Column(s) to one‑hot encode; defaults tolabel.
-
- Behavior: Uses
pandas.get_dummieswithdtype=float32to one‑hot encode the specified columns. - Returns:
pd.DataFrameof one‑hot encoded labels with float32 dtypes.
Expected data file: a CSV containing a label column, for example:
MMAI25Hackathon/molecule-protein-interaction/dataset.csv.
Point CSV_PATH to the CSV file that contains your labels.
CSV_PATH can be any absolute or relative path to your local copy of the dataset. The path below is an example from
this repo — replace it with your own.
Note: In these examples, MMAI25Hackathon is the top‑level directory you downloaded from the hackathon Dropbox link
(after unzipping). If your dataset lives elsewhere, adjust the path accordingly.
# Example only — replace with your path
CSV_PATH = "MMAI25Hackathon/molecule-protein-interaction/dataset.csv"
# The loader validates inputs and will raise helpful errors if the CSV or required column is missing.from mmai25_hackathon.load_data.labels import (
load_labels_from_dataframe,
one_hot_encode_labels,
)
labels_df = load_labels_from_dataframe(CSV_PATH, label_col="Y")
print(labels_df.head()) # single column named 'label'one_hot_df = one_hot_encode_labels(labels_df, columns="label")
print(one_hot_df.head())You can run the module directly to preview labels and (optionally) one‑hot encodings:
python -m mmai25_hackathon.load_data.labels --data-path \
MMAI25Hackathon/molecule-protein-interaction/dataset.csv- For regression tasks, you likely do not want one‑hot encoding; use the numeric
labelcolumn as is. - For multi‑label or multi‑target tasks, pass a list to
label_coland one‑hot encode whichever subset is categorical.