broadinstitute · shntnu · Jul 9, 2025 · Jul 9, 2025 · Jul 9, 2025 · Jul 9, 2025
diff --git a/libs/copairs_runner/.gitignore b/libs/copairs_runner/.gitignore
@@ -0,0 +1,9 @@
+input/
+output/
+.claude/settings.local.json
+dist/
+*.egg-info/
+__pycache__/
+.pytest_cache/
+.ruff_cache/
+.DS_Store
diff --git a/libs/copairs_runner/CLAUDE.md b/libs/copairs_runner/CLAUDE.md
@@ -0,0 +1,105 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Overview
+
+copairs_runner is a configurable Python script for running copairs analyses on cell painting data. It's part of a larger monorepo focused on morphological profiling and drug discovery through cellular imaging.
+
+**For usage, configuration, and examples, see [README.md](README.md).**
+
+## Development Context
+
+### Key Development Commands
+```bash
+# Lint and format (monorepo standard)
+uvx ruff check src/copairs_runner/copairs_runner.py --fix
+uvx ruff format src/copairs_runner/copairs_runner.py
+
+# Run tests (when implemented)
+pytest tests/
+
+# Test changes
+export COPAIRS_DATA=. COPAIRS_OUTPUT=.
+bash run_examples.sh
+```
+
+## Architecture Decisions
+
+### Package Design
+- **src/copairs_runner/copairs_runner.py** maintains single-file logic with inline dependencies (PEP 723)
+- Now packaged for easy installation via `uv add`
+- Supports both standalone script execution and installed package usage
+- Hydra-based configuration for flexibility without code changes
+
+### Key Design Patterns
+1. **Fixed Output Pattern**: Always saves 3 files per analysis (ap_scores, map_results, map_plot)
+2. **Dictionary-based Results**: `save_results()` takes a dict for easy extension
+3. **Preprocessing Pipeline**: Each step is a method `_preprocess_{type}` for consistency
+4. **Path Resolution**: Handles local files, URLs, and S3 uniformly via `resolve_path()`
+5. **Nested Subdirectory Pattern**: For dependent workflows (e.g., LINCS), each analysis uses its own subdirectory within a shared parent to prevent Hydra runtime file overwrites while maintaining predictable relative paths
+
+## Monorepo Context
+
+This project follows monorepo standards:
+- **uv** for package management (not Poetry)
+- **ruff** for formatting/linting (run via `uvx`)
+- **pytest** for testing (target >90% coverage)
+- **numpy** documentation style
+- Conventional commits
+
+### Current Limitations
+- No test suite yet (priority for future work)
+- Single-file design may need refactoring if complexity grows
+- Fixed 3-file output pattern (by design for simplicity)
+
+## Implementation Guidelines
+
+### Adding New Preprocessing Steps
+```python
+def _preprocess_<step_name>(self, df: pd.DataFrame, params: Dict[str, Any]) -> pd.DataFrame:
+    """One-line description.
+
+    Must follow this exact signature. Access params via dict lookup.
+    Log the operation with: logger.info(f"Message: {result}")
+    """
+    # Implementation
+    return df
+```
+
+Then update the docstring in `preprocess_data()` to document the new step.
+
+### Important Implementation Details
+
+1. **Lazy Loading vs Preprocessing**:
+   - `input.filter_query` uses SQL syntax (polars) - happens BEFORE loading
+   - `preprocessing.filter` uses pandas query syntax - happens AFTER loading
+   - This distinction is critical for large datasets
+
+2. **Error Handling**:
+   - Config validation happens in `_validate_config()` 
+   - Missing required params raise `ValueError` with clear messages
+   - Use `params.get("key", default)` for optional parameters
+
+3. **Logging Patterns**:
+   - Always log row counts after filtering operations
+   - Log first 5 columns when loading data for verification
+   - Use `logger.info()` not print()
+
+### Testing Approach (when implemented)
+- Unit test each preprocessing step independently
+- Integration test full pipeline with small test data
+- Test config validation edge cases
+- Mock external data sources (URLs, S3)
+
+## Key Gotchas
+
+1. **Environment Variables**: Must be set before running if configs use `${oc.env:VAR}`
+2. **Memory Usage**: Use lazy loading for large parquet files to avoid OOM
+3. **Path Resolution**: All paths are relative to where you run the script, not the config file location
+4. **Shared Directory Structure**: When creating dependent analyses (where one reads another's output), always use nested subdirectories (e.g., `shared/activity/` and `shared/consistency/`) to prevent Hydra runtime files from being overwritten between runs
+
+## Collaboration Guidelines
+
+- Never coauthor commits with Claude
+- Always use `uv run python` instead of `python3`
diff --git a/libs/copairs_runner/CONTRIBUTING.md b/libs/copairs_runner/CONTRIBUTING.md
@@ -0,0 +1,57 @@
+# Contributing to copairs_runner
+
+## Preprocessing Steps
+
+The preprocessing pipeline intentionally provides a minimal DSL to avoid recreating pandas/SQL in YAML. Before adding new steps, consider whether users should handle the transformation externally.
+
+**Important context**: Copairs analysis typically happens at the end of a morphological profiling pipeline. By this stage, your data should already be:
+- Quality-controlled and normalized
+- Aggregated to appropriate levels
+- Filtered for relevant samples
+- Properly annotated with metadata
+
+If you find yourself needing extensive preprocessing here, it likely indicates issues with your upstream pipeline.
+
+### Alternatives to New Steps
+
+1. **Lazy filtering** - For large parquet files, use polars' SQL syntax before loading:
+   ```yaml
+   data:
+     use_lazy_filter: true
+     filter_query: "Metadata_PlateType == 'TARGET2'"
+   ```
+
+2. **External preprocessing** - Complex transformations belong in Python/SQL scripts, not YAML configs
+
+3. **Composition** - Combine existing steps rather than creating specialized ones
+
+### When to Add a Step
+
+Add a step only if it:
+- Integrates with copairs-specific functionality (e.g., `apply_assign_reference`)
+- Handles last-mile transformations specific to copairs analysis
+- Requires runner context (resolved paths, metadata patterns)
+- Has been requested by multiple users
+
+Remember: needing complex preprocessing at this stage often indicates upstream processing gaps.
+
+### Implementation
+
+```python
+def _preprocess_<step_name>(self, df: pd.DataFrame, params: Dict[str, Any]) -> pd.DataFrame:
+    """One-line description."""
+    # Implementation
+    logger.info(f"Log what happened")
+    return df
+```
+
+Update the `preprocess_data()` docstring with parameters and add a usage example.
+
+### Design Constraints
+
+- Keep implementations under ~10 lines
+- Single responsibility per step
+- Clear parameter validation
+- Informative error messages
+
+The goal is providing just enough convenience without creating a parallel data manipulation framework. Most preprocessing should happen before data reaches this runner.
diff --git a/libs/copairs_runner/README.md b/libs/copairs_runner/README.md
@@ -0,0 +1,181 @@
+# Copairs Runner
+
+YAML-driven runner for [copairs](https://github.com/broadinstitute/copairs).
+
+## Installation
+
+```bash
+# Install as package
+uv add "git+https://github.com/broadinstitute/monorepo.git@copairs-runner#subdirectory=libs/copairs_runner"
+```
+
+## Usage
+
+```bash
+# Set environment variables if used in config
+export COPAIRS_DATA=. COPAIRS_OUTPUT=.
+
+# As installed package
+uv run copairs-runner --config-dir configs --config-name example_activity_lincs
+
+# Or run standalone script directly from GitHub (includes inline dependencies)
+SCRIPT_URL="https://raw.githubusercontent.com/broadinstitute/monorepo/copairs-runner/libs/copairs_runner/src/copairs_runner/copairs_runner.py"
+uv run $SCRIPT_URL --config-name example_activity_lincs
+
+# Override parameters
+uv run copairs-runner --config-dir configs --config-name example_activity_lincs mean_average_precision.params.null_size=50000
+```
+
+### Output Files
+
+Each analysis run generates exactly three files:
+- `{name}_ap_scores.csv` - Individual average precision scores
+- `{name}_map_results.csv` - Mean average precision with p-values
+- `{name}_map_plot.png` - Scatter plot of mAP vs -log10(p-value)
+
+## Configuration
+
+### Path Resolution
+
+The runner uses Hydra's best practices for path handling:
+
+- **Input paths** are resolved using Hydra utilities, relative to the original working directory
+- **Output paths** should use `${hydra:runtime.output_dir}` to save in Hydra's organized structure
+- **URLs and S3 paths** are supported for data loading and metadata merging
+
+```yaml
+# Example path configuration
+input:
+  # Local file - relative to COPAIRS_DATA (defaults to current directory)
+  path: "${oc.env:COPAIRS_DATA,.}/input/data.csv"
+
+  # URL - no changes needed
+  path: "https://example.com/data.parquet"
+
+output:
+  # Output directory and base name for all files
+  directory: "${hydra:runtime.output_dir}"
+  name: "activity"  # Creates: activity_ap_scores.csv, activity_map_results.csv, activity_map_plot.png
+```
+
+### Hydra Output Directory
+
+The example configs demonstrate a project-based organization:
+
+1. **LINCS analyses** (dependent workflow):
+   ```yaml
+   # Activity analysis
+   hydra:
+     run:
+       dir: ${oc.env:COPAIRS_OUTPUT}/output/lincs/shared/activity
+
+   # Consistency analysis
+   hydra:
+     run:
+       dir: ${oc.env:COPAIRS_OUTPUT}/output/lincs/shared/consistency
+   ```
+   - **Important**: Each analysis uses a nested subdirectory (`activity/` and `consistency/`)
+   - This prevents Hydra runtime files from being overwritten between runs
+   - Consistency analysis can still reference activity results via `../activity/activity_map_results.csv`
+   - The shared parent directory maintains the dependency relationship
+
+2. **JUMP analyses** (independent runs):
+   ```yaml
+   hydra:
+     run:
+       dir: ${oc.env:COPAIRS_OUTPUT}/output/jump-target2/${now:%Y-%m-%d}/${now:%H-%M-%S}
+   ```
+   - Used by `example_activity_jump_target2.yaml`
+   - Timestamped subdirectories preserve results from each run
+   - Better for experiments and parameter sweeps
+
+This creates a clean structure:
+```
+output/
+├── lincs/
+│   └── shared/           # LINCS workflow parent directory
+│       ├── activity/     # Activity analysis outputs
+│       │   ├── .hydra/   # Hydra runtime files preserved
+│       │   ├── activity_ap_scores.csv
+│       │   ├── activity_map_results.csv
+│       │   └── activity_map_plot.png
+│       └── consistency/  # Consistency analysis outputs
+│           ├── .hydra/   # Separate Hydra runtime files
+│           ├── consistency_ap_scores.csv
+│           ├── consistency_map_results.csv
+│           └── consistency_map_plot.png
+└── jump-target2/
+    ├── 2024-01-10/      # JUMP experiment runs
+    │   └── 14-23-45/
+    └── 2024-01-11/
+        └── 09-15-30/
+```
+
+All configs use `chdir: false` to stay in the original directory for easier debugging.
+
+```yaml
+# Required sections
+input:
+  path: "data.csv"  # or .parquet, URLs, S3 paths
+
+  # For large parquet files - filter BEFORE loading into memory:
+  # use_lazy_filter: true
+  # filter_query: "Metadata_PlateType == 'TARGET2'"  # SQL syntax
+  # columns: ["Metadata_col1", "feature_1", "feature_2"]  # optional
+
+# Optional sections
+preprocessing:
+  steps:
+    # Standard filtering - happens AFTER data is loaded:
+    - type: filter
+      params:
+        query: "Metadata_dose > 0.1"  # pandas query syntax
+
+average_precision:
+  params:
+    pos_sameby: ["Metadata_compound"]
+    pos_diffby: []
+    neg_sameby: []
+    neg_diffby: ["Metadata_compound"]
+
+output:
+  directory: "${hydra:runtime.output_dir}"
+  name: "analysis"  # Base name for outputs
+
+mean_average_precision:
+  params:
+    sameby: ["Metadata_compound"]
+    null_size: 10000  # Typically 10000-100000
+    threshold: 0.05
+    seed: 0
+```
+
+## Preprocessing Steps
+
+- `filter`: Filter rows with pandas query
+- `dropna`: Remove rows with NaN
+- `aggregate_replicates`: Median aggregation by group
+- `merge_metadata`: Join external CSV
+- `split_multilabel`: Split pipe-separated values
+- See `copairs_runner.py` docstring for complete list
+
+## Examples
+
+- `configs/example_activity_lincs.yaml`: Phenotypic activity
+- `configs/example_consistency_lincs.yaml`: Target consistency
+
+Run all examples: `./run_examples.sh`
+
+## Contributing
+
+See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on adding preprocessing steps.
+
+### Example Output
+
+The runner generates scatter plots showing mean average precision (mAP) vs statistical significance:
+
+**Phenotypic Activity Assessment:**
+![Activity Plot](examples/example_activity_plot.png)
+
+**Phenotypic Consistency (Target-based):**
+![Consistency Plot](examples/example_consistency_plot.png)