API Standardization by zaRizk7 · Pull Request #4 · pykale/mmai-hackathon

zaRizk7 · 2025-09-12T12:41:25Z

Refer to copilot post below 😉.

- remove typing on var definition to reduce footprint - remove graph conversion - include subset_cols in demo runtime - extend read_tabular functionality - use param validation for all functions defined

- refactor all duplicate code - reuse tabular.py functionalities - merge_tables and get_tabular_mimic are merged into single fetch_mimic_iv_ehr - use pykale style pydocs for consistency - loosen the coupling for base data path - include param_validation for functions - use argparse to run demo on the module

- improve function pydocs to mimic pykale - decouple base path dependency - use sklearn param_validation to validate args - include examples for each function - reuse tabular io functionality

- include module level pydoc - include logging - provide more explanations for the preview cli output

- include module level pydoc - major refactor - include param_validation - loosen the coupling for data path finding - use pykale style pydoc for functions - include argparse for executable demo - use logging to replace print

- include module level pydoc - reuse tabular io - use param_validation - use pykale style pydoc for functions - simplify metadata extraction for loading dicom - use argparse for executable demo

- include module level pydoc - rename function from fetch to load - use tuple over list for available tables in mimic-iv - provide logging to replace print - print used only in executable demo

- update module level pydoc for consistency - include logging and param_validation - reuse read_tabular from tabular io module - parser argument changed from keyword to positional for consistency

- update module level pydoc to be consistent - include param_validation and logging - use pos args over kwargs for argparse to be consistent

- update module pydoc to keep consistency - use param_validation and reuse tabular io module - update naming for argparser args from csv_path to data_path

- update module pydoc - use logging - separate variable assignment and return in merge_multiple_dataframes to allow logging in each step - rename argparser args to be consistent with other modules

- include module level pydoc consistent with other io modules - use param_validation and logging - text extraction now takes pd.Series rather than a dataframe and note_id - use argparse over fixed dataset path

- include check if subset_cols is None during read_tabular

- fix typing on return for extract_text_from_note

- add type: ignore on df_by_subset to bypass mypy

- include basesampler for additional options - update module pydoc - use dataloader instead of pygdataloader - simplify method pydoc and reduce initial idea verbosity - add typing to dataloader and sampler

…ad.py

- use notimplementederror for prepare_data method in basedataset

- standardize the runnable demo with default kwargs in argparse - the assumption for default base data-path is ./MMAI25Hackathon - reformat load_mimic_iv_notes function - include pydoc for load_mimic_iv_notes - include high-level steps on how the function works for all functions

- include module pydoc - does not check runnable demo - assumes MMAI25Hackathon is in cwd for testing with real data for integration testing

- reorganize toml - reduce dependencies and remove pykale

codecov-commenter · 2025-09-12T13:16:56Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

Copilot

Pull Request Overview

This PR implements comprehensive API standardization across the hackathon codebase. The purpose is to consolidate data loading utilities, enhance test coverage, and establish a unified interface for handling multimodal datasets (EHR, images, text, molecules, proteins, etc.).

Standardized data loading APIs with consistent parameter validation and error handling
Added comprehensive test suite covering all data loaders with both unit and integration tests
Unified configuration management through pyproject.toml with proper dependency separation

Reviewed Changes

Copilot reviewed 29 out of 30 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
pyproject.toml	Restructured dependencies, added dev extras, updated tooling configuration
mmai25_hackathon/load_data/*.py	Standardized APIs with sklearn validation, consistent logging, unified error handling
tests/load_data/test_*.py	Comprehensive test coverage for all data loading modules
tests/test_dataset.py	Tests for base dataset/dataloader/sampler utilities
tests/dropbox_download.py	CI integration test data download utility
mmai25_hackathon/dataset.py	Enhanced base classes with better documentation and prepare_data method

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

mmai25_hackathon/load_data/supervised_labels.py

zaRizk7 · 2025-09-12T13:23:56Z

mmai25_hackathon/load_data/supervised_labels.py

-        return df[list(label_col)]
-
-    return df[label_col].to_frame("label")
+    return df[label_col].to_frame("label").reset_index(drop=index_col is None)


intentional to ensure the index is back to become column

- check for list tuple instead of sequence Co-authored-by: Copilot <[email protected]>

- remove sequence is string check since checking now given list and tuple

- include break if the metadata_path is found

- use high-level steps for load_mimic_iv_notes

codescene-delta-analysis

Code Health Improved (2 files improve in Code Health)

Gates Failed
Enforce critical code health rules (1 file with Brain Method, Deep, Nested Complexity)
Enforce advisory code health rules (5 files with Complex Method, Complex Conditional)

Gates Passed
2 Quality Gates Passed

See analysis details in CodeScene

Reason for failure

Enforce critical code health rules	Violations	Code Health Impact
ehr.py	2 critical rules	9.54 → 7.59	Suppress

Enforce advisory code health rules	Violations	Code Health Impact
ehr.py	1 advisory rule	9.54 → 7.59	Suppress
tabular.py	2 advisory rules	8.61 → 7.78	Suppress
molecule.py	1 advisory rule	10.00 → 9.69	Suppress
protein.py	1 advisory rule	10.00 → 9.69	Suppress
supervised_labels.py	1 advisory rule	10.00 → 9.69	Suppress

View Improvements

File	Code Health Impact	Categories Improved
tabular.py	8.61 → 7.78	Complex Method
ehr.py	9.54 → 7.59	Complex Method, Bumpy Road Ahead
cxr.py	9.46 → 9.64	Complex Method, Bumpy Road Ahead
text.py	9.33 → 9.53	Complex Method, Bumpy Road Ahead

Quality Gate Profile: Clean Code Collective
Want more control? Customize Code Health rules or catch issues early with our IDE extension and CLI tool.

codescene-delta-analysis · 2025-09-12T14:16:39Z

mmai25_hackathon/load_data/molecule.py

+def fetch_smiles_from_dataframe(
+    df: Union[pd.DataFrame, str],
+    smiles_col: str,
+    index_col: str = None,
+    filter_rows: Optional[Dict[str, Union[Sequence, pd.Index]]] = None,
+) -> pd.DataFrame:


❌ New issue: Complex Method
fetch_smiles_from_dataframe has a cyclomatic complexity of 9, threshold = 9

_Suppress

codescene-delta-analysis · 2025-09-12T14:16:40Z

mmai25_hackathon/load_data/protein.py

+    df: Union[pd.DataFrame, str],
+    prot_seq_col: str,
+    index_col: str = None,
+    filter_rows: Optional[Dict[str, Union[Sequence, pd.Index]]] = None,


❌ New issue: Complex Method
fetch_protein_sequences_from_dataframe has a cyclomatic complexity of 9, threshold = 9

_Suppress

codescene-delta-analysis · 2025-09-12T14:16:40Z

mmai25_hackathon/load_data/supervised_labels.py

    df: Union[pd.DataFrame, str],
    label_col: Union[str, Sequence[str]],
    index_col: str = None,
+    filter_rows: Optional[Dict[str, Union[Sequence, pd.Index]]] = None,


❌ New issue: Complex Method
fetch_supervised_labels_from_dataframe has a cyclomatic complexity of 11, threshold = 9

_Suppress

zaRizk7 added 30 commits September 11, 2025 02:19

update tabular.py

74324fe

- remove typing on var definition to reduce footprint - remove graph conversion - include subset_cols in demo runtime - extend read_tabular functionality - use param validation for all functions defined

reorder param validator for ehr_path in ehr.py

d7350e7

remove comment in demo script for tabular.py

97e7a14

update cxr.py

1ca9143

- improve function pydocs to mimic pykale - decouple base path dependency - use sklearn param_validation to validate args - include examples for each function - reuse tabular io functionality

remove omics modules from hackathon

b76932e

update cxr.py

6b25e1d

- include module level pydoc - include logging - provide more explanations for the preview cli output

update ecg.py

700249b

- include module level pydoc - major refactor - include param_validation - loosen the coupling for data path finding - use pykale style pydoc for functions - include argparse for executable demo - use logging to replace print

update echo.py

fb7c8f4

- include module level pydoc - reuse tabular io - use param_validation - use pykale style pydoc for functions - simplify metadata extraction for loading dicom - use argparse for executable demo

update ehr.py

4c036c3

- include module level pydoc - rename function from fetch to load - use tuple over list for available tables in mimic-iv - provide logging to replace print - print used only in executable demo

update molecule.py

0380419

- update module level pydoc for consistency - include logging and param_validation - reuse read_tabular from tabular io module - parser argument changed from keyword to positional for consistency

update protein.py

6c356ff

- update module level pydoc to be consistent - include param_validation and logging - use pos args over kwargs for argparse to be consistent

update supervised_labels.py

bbfe351

- update module pydoc to keep consistency - use param_validation and reuse tabular io module - update naming for argparser args from csv_path to data_path

update tabular.py

b4bf037

- update module pydoc - use logging - separate variable assignment and return in merge_multiple_dataframes to allow logging in each step - rename argparser args to be consistent with other modules

update text.py

9bbd776

- include module level pydoc consistent with other io modules - use param_validation and logging - text extraction now takes pd.Series rather than a dataframe and note_id - use argparse over fixed dataset path

update ehr.py

7e761d9

- include check if subset_cols is None during read_tabular

update text.py

07970c0

- fix typing on return for extract_text_from_note

update tabular.py

8184c64

- add type: ignore on df_by_subset to bypass mypy

update dataset.py

fdf338f

- include basesampler for additional options - update module pydoc - use dataloader instead of pygdataloader - simplify method pydoc and reduce initial idea verbosity - add typing to dataloader and sampler

remove utils.py

799b4a4

include __all__ for all modules

63fe37e

include dropbox api to dependencies

db5cf20

include mit license

c6aa21e

include script to download the dataset from dropbox in dropbox_downlo…

2205919

…ad.py

update dataset.py

ea838c8

- use notimplementederror for prepare_data method in basedataset

add tests/dropbox_download.py

e8faaf2

include test cases for dataset and load_data module

9c9dda2

- include module pydoc - does not check runnable demo - assumes MMAI25Hackathon is in cwd for testing with real data for integration testing

include noqa to allow skipping test if the library is not there

ad78ad7

include MIT license for build

ff60d06

zaRizk7 added 4 commits September 12, 2025 10:56

update pyproject.toml

cfeb795

- reorganize toml - reduce dependencies and remove pykale

minimize test.yml footprint

4b58859

minimize pre-commit-config

150b554

reorganize readme.md

ba08515