feat: add generic copairs runner for flexible analysis configurations #96

shntnu · 2025-07-03T13:55:58Z

Summary

Adds a generic copairs runner (utils/copairs_runner.py) that enables flexible analysis through YAML/JSON configuration
Includes example configurations for phenotypic activity and consistency analyses
Supports matplotlib plotting for mAP vs p-value visualization

Changes

New Generic Runner (`utils/copairs_runner.py`)

Configuration-driven analysis pipeline supporting:
- Data loading from CSV/Parquet files
- Flexible preprocessing (filtering, aggregation, reference assignment)
- Average precision and mean average precision calculations
- Multilabel support for compound-target analyses
- Matplotlib plotting with customizable scatter plots
- Parameter passthrough to underlying copairs functions

Example Configurations

utils/configs/activity_analysis.yaml: Phenotypic activity assessment
utils/configs/consistency_analysis.yaml: Target-based consistency analysis
Both replicate the original Jupyter notebook examples

Additional Files

utils/run_examples.sh: Script to download data and run both example analyses
Updated .gitignore to exclude data and output directories

Motivation

The runner provides a reusable, configuration-driven approach to copairs analyses, making it easier to:

Run standard analyses without writing custom code
Reproduce analyses with different parameters
Integrate copairs into automated pipelines
Share analysis configurations

Test plan

Run utils/run_examples.sh to test both example analyses
Verify output CSV files match expected results from notebooks
Check that plots are generated when enabled in config
Test with custom configurations

🤖 Generated with Claude Code

- Create phenotypic_analysis.py that combines activity and consistency analyses - Add run_phenotypic_analysis.sh for easy execution with auto-download - Update README.md to document the new CLI script The script parameterizes the common workflow between the two notebooks, allowing users to run either analysis mode with a simple command. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

… runner - Add generic CopairsRunner class supporting YAML/JSON configuration - Create example configs matching notebook parameters for activity and consistency analyses - Add runner script with automatic data download using wget - Support multilabel preprocessing for pipe-separated target values - Update .gitignore for example outputs and local settings The new runner provides more flexibility than the previous unified script while maintaining the same analysis capabilities through configuration files. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Add filter_by_external_csv preprocessing step to filter data based on values from another CSV - Update consistency config to filter by active compounds from activity analysis results - Fix relative paths in configs to use ../data instead of ../../data - Maintain simplicity by treating external CSV path like any other path dependency This allows the consistency analysis to match the notebook workflow exactly by filtering to only phenotypically active compounds before analysis. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Add aggregate_replicates preprocessing step to match notebook workflow - Aggregates 6 replicates per compound to single consensus profile using median - Keeps only groupby columns and features, dropping other metadata as in notebook - Fixes issue where all targets appeared significant due to analyzing individual replicates This ensures the consistency analysis results (4/26 significant targets) match the notebook exactly. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Add consistent params nesting for both average_precision and mean_average_precision - Fix validation to check correct mean_average_precision parameters - Add metadata_regex documentation to class docstring - Update paths to work from utils directory - Clean up code symmetry between run methods 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

- Add plotting configuration section to both activity and consistency YAML configs - Implement plot_map_results() method to generate scatter plots of mAP vs -log10(p-value) - Add matplotlib dependency to script requirements - Create output directory in .gitignore for generated plots - Integrate plotting into main pipeline after mAP calculation 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

…l params - Explain that runner validates required params but passes ALL config params - Document optional parameters available for each copairs function - Add example showing how to specify optional parameters in config - List available optional params: batch_size, distance, progress_bar, max_workers, cache_dir - Add references to copairs function signatures for complete parameter details 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

jfredinh · 2025-07-03T15:32:38Z

I'll give it a try with the JUMP compound data to check the results.

jfredinh · 2025-07-03T17:30:16Z

I ran the run_examples.sh without any problem. Seems to output what is expected.

I also tried using copairs_runner.py with a custom yaml config for JUMP. However, some small additional function probably have to be added to map negative control to a particular negcon column as done here:

meta["Metadata_negcon"] = meta.Metadata_pert_type == "negcon"
https://github.com/jfredinh/jump-production/blob/6df9b448f41f4838c1e6aca9537a9699d3506817/04.compare-profiles/utils.py#L114

Maybe we can simply add this as a preprocessing step?

Possibly something like this in preprocess_data():

elif step_type == "assign_column_mapping":
    col_name = step["column_name"]
    condition = step["condition"]
    df[col_name] = False
    if isinstance(condition, str):
        condition = df.query(condition).index
    df.loc[condition, col_name] = True
    logger.info(f"Identifying binary column belonging to {col_name} based on query {condition}")

  - type: assign_column_mapping
    params:
      column_name: "Metadata_negcon"
      condition: "Metadata_pert_type == 'negcon'"

…ng docs - Add new preprocessing step 'add_column_from_query' to create boolean columns from pandas queries - Rename 'assign_reference' to 'apply_assign_reference' for clarity on external function usage - Add comprehensive docstring documenting all available preprocessing steps - Include example usage in activity_analysis.yaml config 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

shntnu · 2025-07-03T19:13:29Z

Maybe we can simply add this as a preprocessing step?

Ready for you to test

Note the rename of apply_assign_reference

shntnu · 2025-07-03T19:14:14Z

Ready for you to test

Oh wait - there's error

- Add optional fill_value parameter to handle NaN values in query results - Simplify logging to a consistent format for all data types - Update documentation and example config with fill_value usage - Remove overly complex boolean-specific logging 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

shntnu · 2025-07-03T19:24:52Z

@jfredinh actually ready now

- Add seaborn dependency for better plot styling - Implement cleaner visualization with Tufte-inspired principles: - Use range frames (despine) instead of full box - Better color scheme: blue for significant, gray for non-significant - Remove annotation box, use direct labeling - Add subtle grid with proper layering - Move annotation to top-left for better visibility - Set fixed x-axis range (0-1.05) for mAP values - Set y-axis range based on null_size for proper scaling - Increase output DPI to 300 for higher quality plots - Improve threshold line visibility with better color 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

utils/configs/activity_analysis.yaml

Major simplifications to CopairsRunner (~120 lines removed): - Remove config validation and trust user-provided configs - Replace custom file I/O with direct pandas operations - Drop JSON support (YAML only) - Inline data loading logic - Use pandas operations instead of SQL/DuckDB for preprocessing - Simplify filter operations and remove redundant preprocessing steps Key changes: - utils/copairs_runner.py: Reduced from ~600 to ~480 lines - Removed load_config(), validate_config(), and file I/O helpers - Replaced SQL-based operations with pandas idioms - Consolidated filter operations (filter_active combines previous filters) - Simplified preprocessing dispatch using getattr() - utils/configs/activity_analysis.yaml: Updated to use new filter_active verb - utils/configs/consistency_analysis.yaml: Updated filter syntax for new pandas-based operations The refactored code maintains all functionality while being more readable and maintainable, using familiar pandas operations throughout. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <[email protected]>

shntnu · 2025-07-09T06:27:35Z

Moved over to broadinstitute/monorepo#91

shntnu and others added 10 commits July 3, 2025 08:29

chore: move examples/runner out into utils

fcae02a

fix: bring back examples/README.md

cbcf5a3

fix: Update README.md

6e682bb

shntnu requested a review from jfredinh July 3, 2025 15:42

shntnu and others added 2 commits July 3, 2025 15:48

fix(runner): respect user's DPI setting when saving plots

0608d54

jfredinh reviewed Jul 8, 2025

View reviewed changes

utils/configs/activity_analysis.yaml Outdated Show resolved Hide resolved

shntnu closed this Jul 9, 2025

shntnu deleted the runner branch July 9, 2025 13:29

shntnu restored the runner branch July 9, 2025 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add generic copairs runner for flexible analysis configurations #96

feat: add generic copairs runner for flexible analysis configurations #96

Uh oh!

shntnu commented Jul 3, 2025

Uh oh!

jfredinh commented Jul 3, 2025

Uh oh!

jfredinh commented Jul 3, 2025

Uh oh!

shntnu commented Jul 3, 2025

Uh oh!

shntnu commented Jul 3, 2025

Uh oh!

shntnu commented Jul 3, 2025

Uh oh!

Uh oh!

shntnu commented Jul 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add generic copairs runner for flexible analysis configurations #96

feat: add generic copairs runner for flexible analysis configurations #96

Uh oh!

Conversation

shntnu commented Jul 3, 2025

Summary

Changes

New Generic Runner (utils/copairs_runner.py)

Example Configurations

Additional Files

Motivation

Test plan

Uh oh!

jfredinh commented Jul 3, 2025

Uh oh!

jfredinh commented Jul 3, 2025

Uh oh!

shntnu commented Jul 3, 2025

Uh oh!

shntnu commented Jul 3, 2025

Uh oh!

shntnu commented Jul 3, 2025

Uh oh!

Uh oh!

shntnu commented Jul 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New Generic Runner (`utils/copairs_runner.py`)