Add robustness testing integration with BenchDrift for m-programs #15

shailja-thakur · 2025-12-13T07:56:18Z

Summary

This PR adds the ability to test Mellea m-program robustness by integrating with BenchDrift- semantic variation generation and evaluation pipeline. Users can now systematically evaluate how consistently their m-programs answer semantically equivalent variations of a problem.

What This Enables

Generate semantic variations of a problem (different phrasings, same meaning)
Execute m-programs on all variations to measure consistency
Measure pass rates, drift patterns, and identify failure modes
Understand where m-programs break and where they perform well

Key Components

run_benchdrift_pipeline(): Orchestrates BenchDrift's 3-stage pipeline (generate variations → execute m-program → evaluate)
MelleaModelClientAdapter: Bridges Mellea m-programs to BenchDrift's test framework
analyze_robustness_from_probes(): Computes robustness metrics from test results
Configurable variation strategies (generic, cluster-based, persona-based, long-context)

…ting - Add variation_types parameter to run_benchdrift_pipeline() to allow users to customize which semantic variation types to generate (generic, cluster_variations, persona, long_context) - Update test/1_test_robustness_testing.py to demonstrate variation_types usage - Add docs/ROBUSTNESS_TESTING.md with comprehensive documentation for robustness testing workflow - Enables fine-grained control over robustness testing configurations 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <[email protected]>

delucs21

Reviewed with some necessary changes before merging

delucs21 · 2025-12-19T18:13:01Z

docs/ROBUSTNESS_TESTING.md

+### Step 1: Install BenchDrift
+Install BenchDrift from source (required for robustness testing pipeline):
+```bash
+git clone https://github.com/ritterinvest/BenchDrift.git


This repo returns 404. I proceeded with testing using the internal repo, but BenchDrift needs to be in a publicly accessible repo.

Updated to IBM internal repo (https://github.ibm.com/Granite-debug/BenchDrift) with access note. BenchDrift is in approval process for public release - will update URL once available.

Converting this PR to draft. Please go ahead and mark it was ready for review whenever BenchDrift's OSS process is approved.

We should also remove references to other IBM infrastructure (e.g., RITS) before merging this PR.

test/test_mprogram_robustness.py

mellea_contribs/tools/benchdrift_model_client_adapter.py

mellea_contribs/tools/benchdrift_runner.py

- Rename test file: 1_test_robustness_testing.py → test_mprogram_robustness.py - Rename adapter: mellea_model_client_adapter.py → benchdrift_model_client_adapter.py - Add import os to benchdrift_runner.py - Remove hardcoded max_workers parameter - Refactor config: users load from YAML in test script - Add config/ directory with benchdrift_config.yaml - Update BenchDrift repo URLs to IBM internal - Add ROBUSTNESS_INSTALLATION_GUIDE.md - Update all documentation references

shailja-thakur · 2026-01-01T10:25:50Z

Renamed 1_test_robustness_testing.py → test_mprogram_robustness.py to clarify testing m-programs, not BenchDrift
Renamed mellea_model_client_adapter.py → benchdrift_model_client_adapter.py
Updated all imports and docs accordingly

Configuration changes:

Removed hardcoded parameters (max_workers, model names, etc.) from run_benchdrift_pipeline()
Moved all config to config/benchdrift_config.yaml with inline documentation for 19 parameters
Test script now loads YAML and passes via config_overrides - makes config user-editable
Runner validates config is provided rather than using defaults

Config files:

Added config/benchdrift_config.yaml - all BenchDrift parameters with comments
Added config/model_config.yaml - model mappings

Documentation:

Updated ROBUSTNESS_TESTING.md with config loading pattern and examples
Added explanation for Callable[[str, Dict[str, Any]], Any] type hint (second param for future extensibility)
Created INSTALLATION_GUIDE.md with setup steps

delucs21 requested changes Dec 19, 2025

View reviewed changes

shailja-thakur requested a review from delucs21 January 1, 2026 10:25

nrfulton marked this pull request as draft January 9, 2026 14:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add robustness testing integration with BenchDrift for m-programs #15

Add robustness testing integration with BenchDrift for m-programs #15

shailja-thakur commented Dec 13, 2025

Uh oh!

delucs21 left a comment

Uh oh!

delucs21 Dec 19, 2025

Uh oh!

shailja-thakur Jan 1, 2026

Uh oh!

nrfulton Jan 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shailja-thakur commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add robustness testing integration with BenchDrift for m-programs #15

Are you sure you want to change the base?

Add robustness testing integration with BenchDrift for m-programs #15

Conversation

shailja-thakur commented Dec 13, 2025

Summary

What This Enables

Key Components

Uh oh!

delucs21 left a comment

Choose a reason for hiding this comment

Uh oh!

delucs21 Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

shailja-thakur Jan 1, 2026

Choose a reason for hiding this comment

Uh oh!

nrfulton Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shailja-thakur commented Jan 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants