- 
                Notifications
    You must be signed in to change notification settings 
- Fork 10
feat: add generic copairs runner for flexible analysis configurations #96
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Create phenotypic_analysis.py that combines activity and consistency analyses - Add run_phenotypic_analysis.sh for easy execution with auto-download - Update README.md to document the new CLI script The script parameterizes the common workflow between the two notebooks, allowing users to run either analysis mode with a simple command. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
… runner - Add generic CopairsRunner class supporting YAML/JSON configuration - Create example configs matching notebook parameters for activity and consistency analyses - Add runner script with automatic data download using wget - Support multilabel preprocessing for pipe-separated target values - Update .gitignore for example outputs and local settings The new runner provides more flexibility than the previous unified script while maintaining the same analysis capabilities through configuration files. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Add filter_by_external_csv preprocessing step to filter data based on values from another CSV - Update consistency config to filter by active compounds from activity analysis results - Fix relative paths in configs to use ../data instead of ../../data - Maintain simplicity by treating external CSV path like any other path dependency This allows the consistency analysis to match the notebook workflow exactly by filtering to only phenotypically active compounds before analysis. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Add aggregate_replicates preprocessing step to match notebook workflow - Aggregates 6 replicates per compound to single consensus profile using median - Keeps only groupby columns and features, dropping other metadata as in notebook - Fixes issue where all targets appeared significant due to analyzing individual replicates This ensures the consistency analysis results (4/26 significant targets) match the notebook exactly. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Add consistent params nesting for both average_precision and mean_average_precision - Fix validation to check correct mean_average_precision parameters - Add metadata_regex documentation to class docstring - Update paths to work from utils directory - Clean up code symmetry between run methods 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
- Add plotting configuration section to both activity and consistency YAML configs - Implement plot_map_results() method to generate scatter plots of mAP vs -log10(p-value) - Add matplotlib dependency to script requirements - Create output directory in .gitignore for generated plots - Integrate plotting into main pipeline after mAP calculation 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
…l params - Explain that runner validates required params but passes ALL config params - Document optional parameters available for each copairs function - Add example showing how to specify optional parameters in config - List available optional params: batch_size, distance, progress_bar, max_workers, cache_dir - Add references to copairs function signatures for complete parameter details 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
| I'll give it a try with the JUMP compound data to check the results. | 
| I ran the  I also tried using  meta["Metadata_negcon"] = meta.Metadata_pert_type == "negcon" Maybe we can simply add this as a preprocessing step? Possibly something like this in preprocess_data(): elif step_type == "assign_column_mapping":
    col_name = step["column_name"]
    condition = step["condition"]
    df[col_name] = False
    if isinstance(condition, str):
        condition = df.query(condition).index
    df.loc[condition, col_name] = True
    logger.info(f"Identifying binary column belonging to {col_name} based on query {condition}")  - type: assign_column_mapping
    params:
      column_name: "Metadata_negcon"
      condition: "Metadata_pert_type == 'negcon'" | 
…ng docs - Add new preprocessing step 'add_column_from_query' to create boolean columns from pandas queries - Rename 'assign_reference' to 'apply_assign_reference' for clarity on external function usage - Add comprehensive docstring documenting all available preprocessing steps - Include example usage in activity_analysis.yaml config 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
| 
 Ready for you to test Note the rename of  | 
| 
 Oh wait - there's error | 
- Add optional fill_value parameter to handle NaN values in query results - Simplify logging to a consistent format for all data types - Update documentation and example config with fill_value usage - Remove overly complex boolean-specific logging 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
| @jfredinh actually ready now | 
- Add seaborn dependency for better plot styling - Implement cleaner visualization with Tufte-inspired principles: - Use range frames (despine) instead of full box - Better color scheme: blue for significant, gray for non-significant - Remove annotation box, use direct labeling - Add subtle grid with proper layering - Move annotation to top-left for better visibility - Set fixed x-axis range (0-1.05) for mAP values - Set y-axis range based on null_size for proper scaling - Increase output DPI to 300 for higher quality plots - Improve threshold line visibility with better color 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
Major simplifications to CopairsRunner (~120 lines removed): - Remove config validation and trust user-provided configs - Replace custom file I/O with direct pandas operations - Drop JSON support (YAML only) - Inline data loading logic - Use pandas operations instead of SQL/DuckDB for preprocessing - Simplify filter operations and remove redundant preprocessing steps Key changes: - utils/copairs_runner.py: Reduced from ~600 to ~480 lines - Removed load_config(), validate_config(), and file I/O helpers - Replaced SQL-based operations with pandas idioms - Consolidated filter operations (filter_active combines previous filters) - Simplified preprocessing dispatch using getattr() - utils/configs/activity_analysis.yaml: Updated to use new filter_active verb - utils/configs/consistency_analysis.yaml: Updated filter syntax for new pandas-based operations The refactored code maintains all functionality while being more readable and maintainable, using familiar pandas operations throughout. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>
| Moved over to broadinstitute/monorepo#91 | 
Summary
utils/copairs_runner.py) that enables flexible analysis through YAML/JSON configurationChanges
New Generic Runner (
utils/copairs_runner.py)Example Configurations
utils/configs/activity_analysis.yaml: Phenotypic activity assessmentutils/configs/consistency_analysis.yaml: Target-based consistency analysisAdditional Files
utils/run_examples.sh: Script to download data and run both example analyses.gitignoreto exclude data and output directoriesMotivation
The runner provides a reusable, configuration-driven approach to copairs analyses, making it easier to:
Test plan
utils/run_examples.shto test both example analyses🤖 Generated with Claude Code