Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
328d232
feat(copairs_runner): add configurable YAML-driven runner for copairs…
shntnu Jul 9, 2025
ec785de
Update README.md
shntnu Jul 9, 2025
623b8a0
feat(copairs_runner): add preprocessing step and config-relative path…
shntnu Jul 9, 2025
67654dc
feat(copairs_runner): add lazy loading and standardize config naming
shntnu Jul 9, 2025
e05de00
fix: format
shntnu Jul 9, 2025
4f31ff1
docs(copairs_runner): update CLAUDE.md and README.md with current con…
shntnu Jul 9, 2025
8a22d4b
docs(copairs_runner): improve README configuration example
shntnu Jul 9, 2025
5d561f9
docs(copairs_runner): clarify lazy vs preprocessing filtering
shntnu Jul 9, 2025
8519b69
docs(copairs_runner): add CONTRIBUTING.md with preprocessing guidelines
shntnu Jul 9, 2025
1509b5e
docs(copairs_runner): emphasize copairs as end-stage analysis
shntnu Jul 9, 2025
12e3b81
Update libs/copairs_runner/run_examples.sh
shntnu Jul 9, 2025
9e793ea
fix(copairs_runner): address PR review comments
shntnu Jul 9, 2025
7537e9b
refactor(copairs_runner): switch to CWD-relative path resolution
shntnu Jul 10, 2025
050d3d2
refactor(copairs_runner): migrate from PyYAML to OmegaConf
shntnu Jul 10, 2025
0aeb68b
feat(copairs_runner): migrate from argparse to Hydra
shntnu Jul 10, 2025
e0960ab
feat(copairs_runner): configure Hydra to use existing output directory
shntnu Jul 10, 2025
6e10d9e
docs(copairs_runner): update documentation for Hydra migration
shntnu Jul 10, 2025
ae16e82
fix(copairs_runner): cleanup
shntnu Jul 10, 2025
3c8d354
fix(copairs_runner): typo
shntnu Jul 10, 2025
69caf31
refactor(copairs_runner): simplify code and improve documentation
shntnu Jul 10, 2025
d7ed157
feat(copairs_runner): implement Hydra best practices for path handling
shntnu Jul 10, 2025
832edaf
refactor(copairs_runner): implement unified output handling with fixe…
shntnu Jul 10, 2025
ead89f4
refactor(copairs_runner): rename 'data' config section to 'input' for…
shntnu Jul 10, 2025
16abb52
docs(copairs_runner): add design note on preprocessing list configura…
shntnu Jul 10, 2025
303a982
fix(copairs_runner): prevent Hydra runtime file overwrites in shared …
shntnu Jul 10, 2025
2d6d2b0
fix(copairs_runner): add observed=True to groupby and improve logging
shntnu Jul 10, 2025
992cdf7
feat(copairs_runner): add JUMP CPCNN example configs demonstrating sh…
shntnu Jul 10, 2025
61bd877
docs(copairs_runner): add ROADMAP.md with AI-assisted Hydra feature p…
shntnu Jul 10, 2025
e6e36a1
feat(copairs_runner): convert to installable package while keeping st…
shntnu Aug 4, 2025
e0c34c7
fix(copairs_runner): add missing __init__.py and fix wheel build config
shntnu Aug 4, 2025
fd1071e
fix: add quotes
shntnu Aug 4, 2025
a8cf8e0
refactor(copairs_runner): use dynamic versioning from __init__.py
shntnu Aug 4, 2025
7054308
fix(copairs_runner): remove hardcoded config path for package compati…
shntnu Aug 4, 2025
ffe737c
docs(copairs_runner): add concise help message for CLI usage
shntnu Aug 4, 2025
92f0b91
refactor(copairs_runner): finalize package structure and update docum…
shntnu Aug 5, 2025
b0193fb
fix(copairs_runner): use explicit .loc accessor to avoid pandas Setti…
shntnu Aug 5, 2025
80bf20d
fix(copairs_runner): add .copy() to prevent SettingWithCopyWarning wh…
shntnu Aug 5, 2025
b848ef1
feat(copairs_runner): add DuckDB support to merge_metadata preprocess…
shntnu Aug 13, 2025
7f5eb86
fix: empty string is None
shntnu Aug 14, 2025
f39f010
refactor(copairs_runner): simplify path handling using Hydra best pra…
shntnu Aug 15, 2025
70c7f29
feat(copairs_runner): add Parquet output format support
shntnu Aug 26, 2025
b5ec577
test: turn the run_examples.sh script into a minimal integration test…
shntnu Aug 26, 2025
e98c3a8
fix: drop claim about test passing because it is fairly loose (does n…
shntnu Aug 26, 2025
108863d
fix(copairs_runner): remove hardcoded threshold and add Parquet suppo…
shntnu Aug 27, 2025
6417bae
feat(copairs_runner): add support for modular preprocessing sections
shntnu Aug 27, 2025
8523bf0
Set read_only=true to allow concurrent reads
shntnu Aug 27, 2025
b93522b
feat(copairs_runner): add support for normalized mAP visualization
shntnu Sep 7, 2025
e322e23
fix(copairs_runner): fix font
shntnu Sep 8, 2025
48f744c
fix(copairs_runner): clarify
shntnu Sep 8, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions libs/copairs_runner/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
input/
output/
.claude/settings.local.json
dist/
*.egg-info/
__pycache__/
.pytest_cache/
.ruff_cache/
.DS_Store
105 changes: 105 additions & 0 deletions libs/copairs_runner/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Overview

copairs_runner is a configurable Python script for running copairs analyses on cell painting data. It's part of a larger monorepo focused on morphological profiling and drug discovery through cellular imaging.

**For usage, configuration, and examples, see [README.md](README.md).**

## Development Context

### Key Development Commands
```bash
# Lint and format (monorepo standard)
uvx ruff check src/copairs_runner/copairs_runner.py --fix
uvx ruff format src/copairs_runner/copairs_runner.py

# Run tests (when implemented)
pytest tests/

# Test changes
export COPAIRS_DATA=. COPAIRS_OUTPUT=.
bash run_examples.sh
```

## Architecture Decisions

### Package Design
- **src/copairs_runner/copairs_runner.py** maintains single-file logic with inline dependencies (PEP 723)
- Now packaged for easy installation via `uv add`
- Supports both standalone script execution and installed package usage
- Hydra-based configuration for flexibility without code changes

### Key Design Patterns
1. **Fixed Output Pattern**: Always saves 3 files per analysis (ap_scores, map_results, map_plot)
2. **Dictionary-based Results**: `save_results()` takes a dict for easy extension
3. **Preprocessing Pipeline**: Each step is a method `_preprocess_{type}` for consistency
4. **Path Resolution**: Handles local files, URLs, and S3 uniformly via `resolve_path()`
5. **Nested Subdirectory Pattern**: For dependent workflows (e.g., LINCS), each analysis uses its own subdirectory within a shared parent to prevent Hydra runtime file overwrites while maintaining predictable relative paths

## Monorepo Context

This project follows monorepo standards:
- **uv** for package management (not Poetry)
- **ruff** for formatting/linting (run via `uvx`)
- **pytest** for testing (target >90% coverage)
- **numpy** documentation style
- Conventional commits

### Current Limitations
- No test suite yet (priority for future work)
- Single-file design may need refactoring if complexity grows
- Fixed 3-file output pattern (by design for simplicity)

## Implementation Guidelines

### Adding New Preprocessing Steps
```python
def _preprocess_<step_name>(self, df: pd.DataFrame, params: Dict[str, Any]) -> pd.DataFrame:
"""One-line description.

Must follow this exact signature. Access params via dict lookup.
Log the operation with: logger.info(f"Message: {result}")
"""
# Implementation
return df
```

Then update the docstring in `preprocess_data()` to document the new step.

### Important Implementation Details

1. **Lazy Loading vs Preprocessing**:
- `input.filter_query` uses SQL syntax (polars) - happens BEFORE loading
- `preprocessing.filter` uses pandas query syntax - happens AFTER loading
- This distinction is critical for large datasets

2. **Error Handling**:
- Config validation happens in `_validate_config()`
- Missing required params raise `ValueError` with clear messages
- Use `params.get("key", default)` for optional parameters

3. **Logging Patterns**:
- Always log row counts after filtering operations
- Log first 5 columns when loading data for verification
- Use `logger.info()` not print()

### Testing Approach (when implemented)
- Unit test each preprocessing step independently
- Integration test full pipeline with small test data
- Test config validation edge cases
- Mock external data sources (URLs, S3)

## Key Gotchas

1. **Environment Variables**: Must be set before running if configs use `${oc.env:VAR}`
2. **Memory Usage**: Use lazy loading for large parquet files to avoid OOM
3. **Path Resolution**: All paths are relative to where you run the script, not the config file location
4. **Shared Directory Structure**: When creating dependent analyses (where one reads another's output), always use nested subdirectories (e.g., `shared/activity/` and `shared/consistency/`) to prevent Hydra runtime files from being overwritten between runs

## Collaboration Guidelines

- Never coauthor commits with Claude
- Always use `uv run python` instead of `python3`
57 changes: 57 additions & 0 deletions libs/copairs_runner/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Contributing to copairs_runner

## Preprocessing Steps

The preprocessing pipeline intentionally provides a minimal DSL to avoid recreating pandas/SQL in YAML. Before adding new steps, consider whether users should handle the transformation externally.

**Important context**: Copairs analysis typically happens at the end of a morphological profiling pipeline. By this stage, your data should already be:
- Quality-controlled and normalized
- Aggregated to appropriate levels
- Filtered for relevant samples
- Properly annotated with metadata

If you find yourself needing extensive preprocessing here, it likely indicates issues with your upstream pipeline.

### Alternatives to New Steps

1. **Lazy filtering** - For large parquet files, use polars' SQL syntax before loading:
```yaml
data:
use_lazy_filter: true
filter_query: "Metadata_PlateType == 'TARGET2'"
```

2. **External preprocessing** - Complex transformations belong in Python/SQL scripts, not YAML configs

3. **Composition** - Combine existing steps rather than creating specialized ones

### When to Add a Step

Add a step only if it:
- Integrates with copairs-specific functionality (e.g., `apply_assign_reference`)
- Handles last-mile transformations specific to copairs analysis
- Requires runner context (resolved paths, metadata patterns)
- Has been requested by multiple users

Remember: needing complex preprocessing at this stage often indicates upstream processing gaps.

### Implementation

```python
def _preprocess_<step_name>(self, df: pd.DataFrame, params: Dict[str, Any]) -> pd.DataFrame:
"""One-line description."""
# Implementation
logger.info(f"Log what happened")
return df
```

Update the `preprocess_data()` docstring with parameters and add a usage example.

### Design Constraints

- Keep implementations under ~10 lines
- Single responsibility per step
- Clear parameter validation
- Informative error messages

The goal is providing just enough convenience without creating a parallel data manipulation framework. Most preprocessing should happen before data reaches this runner.
181 changes: 181 additions & 0 deletions libs/copairs_runner/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,181 @@
# Copairs Runner

YAML-driven runner for [copairs](https://github.com/broadinstitute/copairs).

## Installation

```bash
# Install as package
uv add "git+https://github.com/broadinstitute/monorepo.git@copairs-runner#subdirectory=libs/copairs_runner"
```

## Usage

```bash
# Set environment variables if used in config
export COPAIRS_DATA=. COPAIRS_OUTPUT=.

# As installed package
uv run copairs-runner --config-dir configs --config-name example_activity_lincs

# Or run standalone script directly from GitHub (includes inline dependencies)
SCRIPT_URL="https://raw.githubusercontent.com/broadinstitute/monorepo/copairs-runner/libs/copairs_runner/src/copairs_runner/copairs_runner.py"
uv run $SCRIPT_URL --config-name example_activity_lincs

# Override parameters
uv run copairs-runner --config-dir configs --config-name example_activity_lincs mean_average_precision.params.null_size=50000
```

### Output Files

Each analysis run generates exactly three files:
- `{name}_ap_scores.csv` - Individual average precision scores
- `{name}_map_results.csv` - Mean average precision with p-values
- `{name}_map_plot.png` - Scatter plot of mAP vs -log10(p-value)

## Configuration

### Path Resolution

The runner uses Hydra's best practices for path handling:

- **Input paths** are resolved using Hydra utilities, relative to the original working directory
- **Output paths** should use `${hydra:runtime.output_dir}` to save in Hydra's organized structure
- **URLs and S3 paths** are supported for data loading and metadata merging

```yaml
# Example path configuration
input:
# Local file - relative to COPAIRS_DATA (defaults to current directory)
path: "${oc.env:COPAIRS_DATA,.}/input/data.csv"

# URL - no changes needed
path: "https://example.com/data.parquet"

output:
# Output directory and base name for all files
directory: "${hydra:runtime.output_dir}"
name: "activity" # Creates: activity_ap_scores.csv, activity_map_results.csv, activity_map_plot.png
```

### Hydra Output Directory

The example configs demonstrate a project-based organization:

1. **LINCS analyses** (dependent workflow):
```yaml
# Activity analysis
hydra:
run:
dir: ${oc.env:COPAIRS_OUTPUT}/output/lincs/shared/activity

# Consistency analysis
hydra:
run:
dir: ${oc.env:COPAIRS_OUTPUT}/output/lincs/shared/consistency
```
- **Important**: Each analysis uses a nested subdirectory (`activity/` and `consistency/`)
- This prevents Hydra runtime files from being overwritten between runs
- Consistency analysis can still reference activity results via `../activity/activity_map_results.csv`
- The shared parent directory maintains the dependency relationship

2. **JUMP analyses** (independent runs):
```yaml
hydra:
run:
dir: ${oc.env:COPAIRS_OUTPUT}/output/jump-target2/${now:%Y-%m-%d}/${now:%H-%M-%S}
```
- Used by `example_activity_jump_target2.yaml`
- Timestamped subdirectories preserve results from each run
- Better for experiments and parameter sweeps

This creates a clean structure:
```
output/
├── lincs/
│ └── shared/ # LINCS workflow parent directory
│ ├── activity/ # Activity analysis outputs
│ │ ├── .hydra/ # Hydra runtime files preserved
│ │ ├── activity_ap_scores.csv
│ │ ├── activity_map_results.csv
│ │ └── activity_map_plot.png
│ └── consistency/ # Consistency analysis outputs
│ ├── .hydra/ # Separate Hydra runtime files
│ ├── consistency_ap_scores.csv
│ ├── consistency_map_results.csv
│ └── consistency_map_plot.png
└── jump-target2/
├── 2024-01-10/ # JUMP experiment runs
│ └── 14-23-45/
└── 2024-01-11/
└── 09-15-30/
```

All configs use `chdir: false` to stay in the original directory for easier debugging.

```yaml
# Required sections
input:
path: "data.csv" # or .parquet, URLs, S3 paths

# For large parquet files - filter BEFORE loading into memory:
# use_lazy_filter: true
# filter_query: "Metadata_PlateType == 'TARGET2'" # SQL syntax
# columns: ["Metadata_col1", "feature_1", "feature_2"] # optional

# Optional sections
preprocessing:
steps:
# Standard filtering - happens AFTER data is loaded:
- type: filter
params:
query: "Metadata_dose > 0.1" # pandas query syntax

average_precision:
params:
pos_sameby: ["Metadata_compound"]
pos_diffby: []
neg_sameby: []
neg_diffby: ["Metadata_compound"]

output:
directory: "${hydra:runtime.output_dir}"
name: "analysis" # Base name for outputs

mean_average_precision:
params:
sameby: ["Metadata_compound"]
null_size: 10000 # Typically 10000-100000
threshold: 0.05
seed: 0
```

## Preprocessing Steps

- `filter`: Filter rows with pandas query
- `dropna`: Remove rows with NaN
- `aggregate_replicates`: Median aggregation by group
- `merge_metadata`: Join external CSV
- `split_multilabel`: Split pipe-separated values
- See `copairs_runner.py` docstring for complete list

## Examples

- `configs/example_activity_lincs.yaml`: Phenotypic activity
- `configs/example_consistency_lincs.yaml`: Target consistency

Run all examples: `./run_examples.sh`

## Contributing

See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on adding preprocessing steps.

### Example Output

The runner generates scatter plots showing mean average precision (mAP) vs statistical significance:

**Phenotypic Activity Assessment:**
![Activity Plot](examples/example_activity_plot.png)

**Phenotypic Consistency (Target-based):**
![Consistency Plot](examples/example_consistency_plot.png)
Loading