Skip to content

Conversation

@shailja-thakur
Copy link

Summary

This PR adds the ability to test Mellea m-program robustness by integrating with BenchDrift- semantic variation generation and evaluation pipeline. Users can now systematically evaluate how consistently their m-programs answer semantically equivalent variations of a problem.

What This Enables

  • Generate semantic variations of a problem (different phrasings, same meaning)
  • Execute m-programs on all variations to measure consistency
  • Measure pass rates, drift patterns, and identify failure modes
  • Understand where m-programs break and where they perform well

Key Components

  • run_benchdrift_pipeline(): Orchestrates BenchDrift's 3-stage pipeline (generate variations → execute m-program → evaluate)
  • MelleaModelClientAdapter: Bridges Mellea m-programs to BenchDrift's test framework
  • analyze_robustness_from_probes(): Computes robustness metrics from test results
  • Configurable variation strategies (generic, cluster-based, persona-based, long-context)

…ting

- Add variation_types parameter to run_benchdrift_pipeline() to allow users to customize which semantic variation types to generate (generic, cluster_variations, persona, long_context)
- Update test/1_test_robustness_testing.py to demonstrate variation_types usage
- Add docs/ROBUSTNESS_TESTING.md with comprehensive documentation for robustness testing workflow
- Enables fine-grained control over robustness testing configurations

🤖 Generated with Claude Code

Co-Authored-By: Claude Haiku 4.5 <[email protected]>
Copy link
Collaborator

@delucs21 delucs21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with some necessary changes before merging

### Step 1: Install BenchDrift
Install BenchDrift from source (required for robustness testing pipeline):
```bash
git clone https://github.com/ritterinvest/BenchDrift.git
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This repo returns 404. I proceeded with testing using the internal repo, but BenchDrift needs to be in a publicly accessible repo.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to IBM internal repo (https://github.ibm.com/Granite-debug/BenchDrift) with access note. BenchDrift is in approval process for public release - will update URL once available.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting this PR to draft. Please go ahead and mark it was ready for review whenever BenchDrift's OSS process is approved.

We should also remove references to other IBM infrastructure (e.g., RITS) before merging this PR.

  - Rename test file: 1_test_robustness_testing.py → test_mprogram_robustness.py
  - Rename adapter: mellea_model_client_adapter.py → benchdrift_model_client_adapter.py
  - Add import os to benchdrift_runner.py
  - Remove hardcoded max_workers parameter
  - Refactor config: users load from YAML in test script
  - Add config/ directory with benchdrift_config.yaml
  - Update BenchDrift repo URLs to IBM internal
  - Add ROBUSTNESS_INSTALLATION_GUIDE.md
  - Update all documentation references
@shailja-thakur
Copy link
Author

  • Renamed 1_test_robustness_testing.pytest_mprogram_robustness.py to clarify testing m-programs, not BenchDrift
  • Renamed mellea_model_client_adapter.pybenchdrift_model_client_adapter.py
  • Updated all imports and docs accordingly

Configuration changes:

  • Removed hardcoded parameters (max_workers, model names, etc.) from run_benchdrift_pipeline()
  • Moved all config to config/benchdrift_config.yaml with inline documentation for 19 parameters
  • Test script now loads YAML and passes via config_overrides - makes config user-editable
  • Runner validates config is provided rather than using defaults

Config files:

  • Added config/benchdrift_config.yaml - all BenchDrift parameters with comments
  • Added config/model_config.yaml - model mappings

Documentation:

  • Updated ROBUSTNESS_TESTING.md with config loading pattern and examples
  • Added explanation for Callable[[str, Dict[str, Any]], Any] type hint (second param for future extensibility)
  • Created INSTALLATION_GUIDE.md with setup steps

@nrfulton nrfulton marked this pull request as draft January 9, 2026 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants