Skip to content

Data Loading Modules

L. M. Riza Rizky edited this page Sep 13, 2025 · 6 revisions

Data Loading Modules

All dataset loaders live under mmai25_hackathon/load_data. Each module has a dedicated guide with dataset layout, function docs, quick examples, and CLI usage. This page indexes those modules at a glance.

At‑a‑Glance

Modality Python module Doc Expected root Outputs
Chest X‑ray (CXR) load_data/cxr.py Chest‑X‑Ray CXR root with files/ pd.DataFrame with cxr_path; PIL images
Echocardiogram load_data/echo.py Echocardiogram ECHO root with files/, echo-record-list.csv pd.DataFrame; (frames, metadata)
Electrocardiogram load_data/ecg.py Electrocardiogram ECG root with files/, record_list.csv pd.DataFrame; (signals, fields)
Clinical Notes load_data/text.py Text Note v2.2 .../note/ with CSVs pd.DataFrame; text extraction
Electronic Health Record load_data/ehr.py Electronic‑Health‑Record mimic-iv-3.1/ with hosp/, icu/ merged pd.DataFrame or dict
Molecule (SMILES) load_data/molecule.py Molecule CSV with SMILES pd.DataFrame; PyG graphs
Protein Sequence load_data/protein.py Protein‑Sequence CSV with Protein pd.DataFrame; integer encodings
Labels load_data/labels.py Labels CSV with label column(s) pd.DataFrame; one‑hot labels
Tabular Utilities load_data/tabular.py Tabular Any CSVs pd.DataFrame; merged components

Chest X‑Rays (CXR) — cxr.py

Module: mmai25_hackathon/load_data/cxr.py · Doc: Chest‑X‑Ray

Maps metadata DICOM IDs to JPGs stored under files/ and returns a DataFrame with absolute image paths (cxr_path). Includes an image loader to open radiographs as grayscale or RGB PIL images. See the doc’s CLI Usage to preview data.

Echocardiogram (ECHO) — echo.py

Module: mmai25_hackathon/load_data/echo.py · Doc: Echocardiogram

Resolves .dcm paths from echo-record-list.csv, filters to existing files, and loads cine sequences as (T, H, W) NumPy arrays with metadata. See the doc’s CLI Usage for a quick run command.

Electrocardiogram (ECG) — ecg.py

Module: mmai25_hackathon/load_data/ecg.py · Doc: Electrocardiogram

Builds .hea/.dat paths from record_list.csv, ensures pairs exist, and loads signals via WFDB, returning (signals, fields). See the doc’s CLI Usage for a quick run command.

Clinical Notes — text.py

Module: mmai25_hackathon/load_data/text.py · Doc: Text

Loads radiology or discharge notes from MIMIC‑IV Note v2.2. Optionally merges <subset>_detail.csv, trims/drops empty text, and provides helpers to extract the note text (plus metadata). See the doc’s CLI Usage for an example.

Electronic Health Record (EHR) — ehr.py

Module: mmai25_hackathon/load_data/ehr.py · Doc: Electronic‑Health‑Record

Discovers and loads tables from hosp/ and/or icu/, with optional per‑table column selection and row filters. Can merge tables on shared keys into a single DataFrame, or return a dict when merge=False.

Molecule (SMILES) — molecule.py

Module: mmai25_hackathon/load_data/molecule.py · Doc: Molecule

Fetches SMILES strings from a CSV/DataFrame and converts them to PyTorch Geometric graphs. See the doc for examples and CLI Usage.

Protein Sequence — protein.py

Module: mmai25_hackathon/load_data/protein.py · Doc: Protein‑Sequence

Reads protein sequences from a CSV/DataFrame and encodes each into a fixed‑length integer array (A–Z, excluding J; 0=pad/unknown). See the doc for examples and CLI Usage.

Labels — labels.py

Module: mmai25_hackathon/load_data/labels.py · Doc: Labels

Fetches/loads label columns (single/multi) from a CSV/DataFrame and provides one‑hot encoding helpers for categorical labels.

Tabular Utilities — tabular.py

Module: mmai25_hackathon/load_data/tabular.py · Doc: Tabular

Thin wrapper over pd.read_csv with column selection and row filtering. Includes a merging helper that forms connected components by overlapping key columns.

Tips

  • Paths shown in examples are illustrative; point to your local dataset roots. In docs, MMAI25Hackathon refers to the unzipped Dropbox folder.
  • Loaders validate required folders/files and raise helpful errors; you don’t need separate sanity checks.
  • Use each module’s “CLI Usage” section to quickly preview data and sanity‑check paths.

Clone this wiki locally