Skip to content

Conversation

@shntnu
Copy link
Member

@shntnu shntnu commented Jul 3, 2025

Summary

  • Adds a generic copairs runner (utils/copairs_runner.py) that enables flexible analysis through YAML/JSON configuration
  • Includes example configurations for phenotypic activity and consistency analyses
  • Supports matplotlib plotting for mAP vs p-value visualization

Changes

New Generic Runner (utils/copairs_runner.py)

  • Configuration-driven analysis pipeline supporting:
    • Data loading from CSV/Parquet files
    • Flexible preprocessing (filtering, aggregation, reference assignment)
    • Average precision and mean average precision calculations
    • Multilabel support for compound-target analyses
    • Matplotlib plotting with customizable scatter plots
    • Parameter passthrough to underlying copairs functions

Example Configurations

  • utils/configs/activity_analysis.yaml: Phenotypic activity assessment
  • utils/configs/consistency_analysis.yaml: Target-based consistency analysis
  • Both replicate the original Jupyter notebook examples

Additional Files

  • utils/run_examples.sh: Script to download data and run both example analyses
  • Updated .gitignore to exclude data and output directories

Motivation

The runner provides a reusable, configuration-driven approach to copairs analyses, making it easier to:

  • Run standard analyses without writing custom code
  • Reproduce analyses with different parameters
  • Integrate copairs into automated pipelines
  • Share analysis configurations

Test plan

  • Run utils/run_examples.sh to test both example analyses
  • Verify output CSV files match expected results from notebooks
  • Check that plots are generated when enabled in config
  • Test with custom configurations

🤖 Generated with Claude Code

shntnu and others added 10 commits July 3, 2025 08:29
- Create phenotypic_analysis.py that combines activity and consistency analyses
- Add run_phenotypic_analysis.sh for easy execution with auto-download
- Update README.md to document the new CLI script

The script parameterizes the common workflow between the two notebooks,
allowing users to run either analysis mode with a simple command.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
… runner

- Add generic CopairsRunner class supporting YAML/JSON configuration
- Create example configs matching notebook parameters for activity and consistency analyses
- Add runner script with automatic data download using wget
- Support multilabel preprocessing for pipe-separated target values
- Update .gitignore for example outputs and local settings

The new runner provides more flexibility than the previous unified script while maintaining
the same analysis capabilities through configuration files.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Add filter_by_external_csv preprocessing step to filter data based on values from another CSV
- Update consistency config to filter by active compounds from activity analysis results
- Fix relative paths in configs to use ../data instead of ../../data
- Maintain simplicity by treating external CSV path like any other path dependency

This allows the consistency analysis to match the notebook workflow exactly by filtering
to only phenotypically active compounds before analysis.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Add aggregate_replicates preprocessing step to match notebook workflow
- Aggregates 6 replicates per compound to single consensus profile using median
- Keeps only groupby columns and features, dropping other metadata as in notebook
- Fixes issue where all targets appeared significant due to analyzing individual replicates

This ensures the consistency analysis results (4/26 significant targets) match the notebook exactly.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Add consistent params nesting for both average_precision and mean_average_precision
- Fix validation to check correct mean_average_precision parameters
- Add metadata_regex documentation to class docstring
- Update paths to work from utils directory
- Clean up code symmetry between run methods

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
- Add plotting configuration section to both activity and consistency YAML configs
- Implement plot_map_results() method to generate scatter plots of mAP vs -log10(p-value)
- Add matplotlib dependency to script requirements
- Create output directory in .gitignore for generated plots
- Integrate plotting into main pipeline after mAP calculation

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
…l params

- Explain that runner validates required params but passes ALL config params
- Document optional parameters available for each copairs function
- Add example showing how to specify optional parameters in config
- List available optional params: batch_size, distance, progress_bar, max_workers, cache_dir
- Add references to copairs function signatures for complete parameter details

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@jfredinh
Copy link
Collaborator

jfredinh commented Jul 3, 2025

I'll give it a try with the JUMP compound data to check the results.

@shntnu shntnu requested a review from jfredinh July 3, 2025 15:42
@jfredinh
Copy link
Collaborator

jfredinh commented Jul 3, 2025

I ran the run_examples.sh without any problem. Seems to output what is expected.

I also tried using copairs_runner.py with a custom yaml config for JUMP. However, some small additional function probably have to be added to map negative control to a particular negcon column as done here:

meta["Metadata_negcon"] = meta.Metadata_pert_type == "negcon"
https://github.com/jfredinh/jump-production/blob/6df9b448f41f4838c1e6aca9537a9699d3506817/04.compare-profiles/utils.py#L114

Maybe we can simply add this as a preprocessing step?

Possibly something like this in preprocess_data():

elif step_type == "assign_column_mapping":
    col_name = step["column_name"]
    condition = step["condition"]
    df[col_name] = False
    if isinstance(condition, str):
        condition = df.query(condition).index
    df.loc[condition, col_name] = True
    logger.info(f"Identifying binary column belonging to {col_name} based on query {condition}")
  - type: assign_column_mapping
    params:
      column_name: "Metadata_negcon"
      condition: "Metadata_pert_type == 'negcon'"

…ng docs

- Add new preprocessing step 'add_column_from_query' to create boolean columns from pandas queries
- Rename 'assign_reference' to 'apply_assign_reference' for clarity on external function usage
- Add comprehensive docstring documenting all available preprocessing steps
- Include example usage in activity_analysis.yaml config

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@shntnu
Copy link
Member Author

shntnu commented Jul 3, 2025

Maybe we can simply add this as a preprocessing step?

Ready for you to test

Note the rename of apply_assign_reference

@shntnu
Copy link
Member Author

shntnu commented Jul 3, 2025

Ready for you to test

Oh wait - there's error

- Add optional fill_value parameter to handle NaN values in query results
- Simplify logging to a consistent format for all data types
- Update documentation and example config with fill_value usage
- Remove overly complex boolean-specific logging

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@shntnu
Copy link
Member Author

shntnu commented Jul 3, 2025

@jfredinh actually ready now

shntnu and others added 2 commits July 3, 2025 15:48
- Add seaborn dependency for better plot styling
- Implement cleaner visualization with Tufte-inspired principles:
  - Use range frames (despine) instead of full box
  - Better color scheme: blue for significant, gray for non-significant
  - Remove annotation box, use direct labeling
  - Add subtle grid with proper layering
  - Move annotation to top-left for better visibility
- Set fixed x-axis range (0-1.05) for mAP values
- Set y-axis range based on null_size for proper scaling
- Increase output DPI to 300 for higher quality plots
- Improve threshold line visibility with better color

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
Major simplifications to CopairsRunner (~120 lines removed):
- Remove config validation and trust user-provided configs
- Replace custom file I/O with direct pandas operations
- Drop JSON support (YAML only)
- Inline data loading logic
- Use pandas operations instead of SQL/DuckDB for preprocessing
- Simplify filter operations and remove redundant preprocessing steps

Key changes:
- utils/copairs_runner.py: Reduced from ~600 to ~480 lines
  - Removed load_config(), validate_config(), and file I/O helpers
  - Replaced SQL-based operations with pandas idioms
  - Consolidated filter operations (filter_active combines previous filters)
  - Simplified preprocessing dispatch using getattr()

- utils/configs/activity_analysis.yaml: Updated to use new filter_active verb
- utils/configs/consistency_analysis.yaml: Updated filter syntax for new pandas-based operations

The refactored code maintains all functionality while being more readable
and maintainable, using familiar pandas operations throughout.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <[email protected]>
@shntnu
Copy link
Member Author

shntnu commented Jul 9, 2025

Moved over to broadinstitute/monorepo#91

@shntnu shntnu closed this Jul 9, 2025
@shntnu shntnu deleted the runner branch July 9, 2025 13:29
@shntnu shntnu restored the runner branch July 9, 2025 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants