Conversation
- remove typing on var definition to reduce footprint - remove graph conversion - include subset_cols in demo runtime - extend read_tabular functionality - use param validation for all functions defined
- refactor all duplicate code - reuse tabular.py functionalities - merge_tables and get_tabular_mimic are merged into single fetch_mimic_iv_ehr - use pykale style pydocs for consistency - loosen the coupling for base data path - include param_validation for functions - use argparse to run demo on the module
- improve function pydocs to mimic pykale - decouple base path dependency - use sklearn param_validation to validate args - include examples for each function - reuse tabular io functionality
- include module level pydoc - include logging - provide more explanations for the preview cli output
- include module level pydoc - major refactor - include param_validation - loosen the coupling for data path finding - use pykale style pydoc for functions - include argparse for executable demo - use logging to replace print
- include module level pydoc - reuse tabular io - use param_validation - use pykale style pydoc for functions - simplify metadata extraction for loading dicom - use argparse for executable demo
- include module level pydoc - rename function from fetch to load - use tuple over list for available tables in mimic-iv - provide logging to replace print - print used only in executable demo
- update module level pydoc for consistency - include logging and param_validation - reuse read_tabular from tabular io module - parser argument changed from keyword to positional for consistency
- update module level pydoc to be consistent - include param_validation and logging - use pos args over kwargs for argparse to be consistent
- update module pydoc to keep consistency - use param_validation and reuse tabular io module - update naming for argparser args from csv_path to data_path
- update module pydoc - use logging - separate variable assignment and return in merge_multiple_dataframes to allow logging in each step - rename argparser args to be consistent with other modules
- include module level pydoc consistent with other io modules - use param_validation and logging - text extraction now takes pd.Series rather than a dataframe and note_id - use argparse over fixed dataset path
- include check if subset_cols is None during read_tabular
- fix typing on return for extract_text_from_note
- add type: ignore on df_by_subset to bypass mypy
- include basesampler for additional options - update module pydoc - use dataloader instead of pygdataloader - simplify method pydoc and reduce initial idea verbosity - add typing to dataloader and sampler
- use notimplementederror for prepare_data method in basedataset
- standardize the runnable demo with default kwargs in argparse - the assumption for default base data-path is ./MMAI25Hackathon - reformat load_mimic_iv_notes function - include pydoc for load_mimic_iv_notes - include high-level steps on how the function works for all functions
- include module pydoc - does not check runnable demo - assumes MMAI25Hackathon is in cwd for testing with real data for integration testing
- reorganize toml - reduce dependencies and remove pykale
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. Thanks for integrating Codecov - We've got you covered ☂️ |
There was a problem hiding this comment.
Pull Request Overview
This PR implements comprehensive API standardization across the hackathon codebase. The purpose is to consolidate data loading utilities, enhance test coverage, and establish a unified interface for handling multimodal datasets (EHR, images, text, molecules, proteins, etc.).
- Standardized data loading APIs with consistent parameter validation and error handling
- Added comprehensive test suite covering all data loaders with both unit and integration tests
- Unified configuration management through pyproject.toml with proper dependency separation
Reviewed Changes
Copilot reviewed 29 out of 30 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Restructured dependencies, added dev extras, updated tooling configuration |
| mmai25_hackathon/load_data/*.py | Standardized APIs with sklearn validation, consistent logging, unified error handling |
| tests/load_data/test_*.py | Comprehensive test coverage for all data loading modules |
| tests/test_dataset.py | Tests for base dataset/dataloader/sampler utilities |
| tests/dropbox_download.py | CI integration test data download utility |
| mmai25_hackathon/dataset.py | Enhanced base classes with better documentation and prepare_data method |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| return df[list(label_col)] | ||
|
|
||
| return df[label_col].to_frame("label") | ||
| return df[label_col].to_frame("label").reset_index(drop=index_col is None) |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
intentional to ensure the index is back to become column
- check for list tuple instead of sequence Co-authored-by: Copilot <[email protected]>
- remove sequence is string check since checking now given list and tuple
- include break if the metadata_path is found
- use high-level steps for load_mimic_iv_notes
There was a problem hiding this comment.
Code Health Improved
(2 files improve in Code Health)
Gates Failed
Enforce critical code health rules
(1 file with Brain Method, Deep, Nested Complexity)
Enforce advisory code health rules
(5 files with Complex Method, Complex Conditional)
Gates Passed
2 Quality Gates Passed
See analysis details in CodeScene
Reason for failure
| Enforce critical code health rules | Violations | Code Health Impact | |
|---|---|---|---|
| ehr.py | 2 critical rules | 9.54 → 7.59 | Suppress |
| Enforce advisory code health rules | Violations | Code Health Impact | |
|---|---|---|---|
| ehr.py | 1 advisory rule | 9.54 → 7.59 | Suppress |
| tabular.py | 2 advisory rules | 8.61 → 7.78 | Suppress |
| molecule.py | 1 advisory rule | 10.00 → 9.69 | Suppress |
| protein.py | 1 advisory rule | 10.00 → 9.69 | Suppress |
| supervised_labels.py | 1 advisory rule | 10.00 → 9.69 | Suppress |
View Improvements
| File | Code Health Impact | Categories Improved |
|---|---|---|
| tabular.py | 8.61 → 7.78 | Complex Method |
| ehr.py | 9.54 → 7.59 | Complex Method, Bumpy Road Ahead |
| cxr.py | 9.46 → 9.64 | Complex Method, Bumpy Road Ahead |
| text.py | 9.33 → 9.53 | Complex Method, Bumpy Road Ahead |
Quality Gate Profile: Clean Code Collective
Want more control? Customize Code Health rules or catch issues early with our IDE extension and CLI tool.
| def fetch_smiles_from_dataframe( | ||
| df: Union[pd.DataFrame, str], | ||
| smiles_col: str, | ||
| index_col: str = None, | ||
| filter_rows: Optional[Dict[str, Union[Sequence, pd.Index]]] = None, | ||
| ) -> pd.DataFrame: |
There was a problem hiding this comment.
❌ New issue: Complex Method
fetch_smiles_from_dataframe has a cyclomatic complexity of 9, threshold = 9
| df: Union[pd.DataFrame, str], | ||
| prot_seq_col: str, | ||
| index_col: str = None, | ||
| filter_rows: Optional[Dict[str, Union[Sequence, pd.Index]]] = None, |
There was a problem hiding this comment.
❌ New issue: Complex Method
fetch_protein_sequences_from_dataframe has a cyclomatic complexity of 9, threshold = 9
| df: Union[pd.DataFrame, str], | ||
| label_col: Union[str, Sequence[str]], | ||
| index_col: str = None, | ||
| filter_rows: Optional[Dict[str, Union[Sequence, pd.Index]]] = None, |
There was a problem hiding this comment.
❌ New issue: Complex Method
fetch_supervised_labels_from_dataframe has a cyclomatic complexity of 11, threshold = 9
Refer to copilot post below 😉.
#4 (review)