Toolkit for running LLM-as-a-Judge evaluations on ITBench root-cause analysis (RCA) agent outputs. Supports both SRE (incident RCA) and FinOps (cost anomaly RCA) evaluation domains. The CLI loads ground-truth scenarios, reads agent trial outputs, repairs malformed JSON when possible, scores multiple metrics with a judge model, and aggregates macro statistics.
| Path | Description |
|---|---|
itbench_evaluations/__main__.py |
CLI entrypoint (python -m itbench_evaluations). |
itbench_evaluations/agent.py |
Judge workflow and evaluation logic |
itbench_evaluations/client.py |
OpenAI-compatible judge client and model selection. |
itbench_evaluations/loader.py |
Ground-truth and agent output loading. |
itbench_evaluations/aggregator.py |
Computes per-incident and overall statistics. |
itbench_evaluations/json_fixer.py |
JSON repair utilities for malformed model outputs. |
itbench_evaluations/namespace_filter.py |
Optional post-processing helpers for filtering and recalculating metrics. |
itbench_evaluations/prompts/main.py |
SRE system prompt and evaluation template. |
itbench_evaluations/prompts/finops_main.py |
FinOps system prompt, evaluation template, and parse template. |
itbench_evaluations/prompts/ |
Per-metric criterion templates (entity, reasoning, proximity, etc.). |
itbench_evaluations/data/ |
Example incident output hierarchy. |
.env.tmpl |
Environment configuration template (copy to .env). |
pyproject.toml |
Python package configuration and dependencies. |
-
Install prerequisites
uv sync # Install huggingface_hub for downloading datasets uv pip install huggingface_hub -
Configure environment variables
cp .env.tmpl .env # Edit .env with your actual valuesThe judge client reads
.envviapython-dotenv. Required variables:Purpose Keys Judge API JUDGE_API_KEY(defaults to"dummy"for local endpoints)Judge model JUDGE_MODEL(defaults togpt-4-turbo)Optional base URL JUDGE_BASE_URL(OpenAI-compatible endpoint) -
Download ITBench-Lite dataset
Download benchmark scenarios from Hugging Face (https://huggingface.co/datasets/ibm-research/ITBench-Lite):
# Download all scenarios (50 total: 35 SRE + 15 FinOps) uv run hf download \ ibm-research/ITBench-Lite \ --repo-type dataset \ --local-dir ./ITBench-Lite # Or download only specific scenarios (faster for testing) uv run hf download \ ibm-research/ITBench-Lite \ --repo-type dataset \ --include "snapshots/sre/v0.2-*/Scenario-1/*" \ --include "snapshots/sre/v0.2-*/Scenario-2/*" \ --local-dir ./ITBench-Lite
-
Prepare data Ground truth supports:
- A single JSON/YAML file with one incident (must include
idor the filename is used). - A JSON/YAML list of incidents (each must include
id). - A directory of scenario subfolders containing
ground_truth.yaml|yml|json(orgt.yaml|json). - A consolidated
ground_truths.jsoninside a directory.
Agent outputs should be structured as:
agent-outputs/ <incident_id>/ <trial_number>/ outputs/agent_output.json # or outputs/agent_response.json # or agent_output.json / agent_response.jsonSupported incident folder naming is case-insensitive and accepts patterns like
scenario-i<id>,scenario_<id>,scenario-<id>,scenario<id>, andincident-<id>. - A single JSON/YAML file with one incident (must include
uv run itbench-evaluations \
--ground-truth path/to/ground_truths.json \
--outputs path/to/agent-outputs \
--eval-criteria ROOT_CAUSE_ENTITY ROOT_CAUSE_REASONINGuv run itbench-evaluations \
--domain finops \
--ground-truth path/to/finops_ground_truths.json \
--outputs path/to/finops-agent-outputsFinOps evaluates cost anomaly RCA by comparing predicted resources against ground-truth resources using name and type matching. It produces a single binary metric (root_cause_resource: 0 or 1). If agent output is raw text rather than structured JSON, an LLM-based preprocessing step extracts resource information automatically.
uv run itbench-evaluations \
--domain ciso \
--scenario-dir path/to/ITBench-Lite/snapshots/ciso/v0.1/k8s-opa-static-cis-5.1.1 \
--outputs path/to/ciso-agent-outputs \
--ground-truth dummy # Not used for CISOCISO evaluates Kubernetes compliance checking solutions using deterministic OPA (Open Policy Agent) evaluation instead of LLM-as-a-Judge. It produces 4 binary metrics:
required_files_exist: Checks iffetch.shandpolicy.regoexistfetch_generates_data: Verifiesfetch.shcreates validcollected_data.jsonviolation_detection: OPA policy correctly detects violations (returnsfalse)compliance_validation: OPA policy confirms compliance with compliant resources (returnstrue)
Note: CISO does not use ground truth files - OPA evaluation is deterministic. The --scenario-dir must point to a scenario containing static-resources-compliant/ directory.
--domainselects the evaluation domain:sre(default),finops, orciso.--result-filesets the output file (default:evaluation_results.json).--eval-criteriaaccepts any of the following:- SRE:
ROOT_CAUSE_ENTITY,ROOT_CAUSE_REASONING,PROPAGATION_CHAIN,FAULT_LOCALIZATION,ROOT_CAUSE_REASONING_PARTIAL,ROOT_CAUSE_PROXIMITY,ROOT_CAUSE_PROXIMITY_FP. - FinOps:
ROOT_CAUSE_RESOURCE. - CISO:
required_files_exist,fetch_generates_data,violation_detection,compliance_validation.
- SRE:
--scenario-dir(required for CISO) specifies the scenario directory containing compliant resources.--kis kept for backward compatibility (entity@k metrics are derived fromROOT_CAUSE_ENTITY).--max-concurrentcontrols evaluation concurrency (default: 5).--verboseenables debug logging.
The CLI writes a JSON report containing raw results and aggregated statistics:
{
"raw_results": [
{"incident_id": "1", "trial_id": 0, "scores": { "...": "..." }}
],
"statistics": {
"per_incident": { "...": "..." },
"overall": { "...": "..." }
},
"config": {
"domain": "sre",
"ground_truth_path": "...",
"outputs_path": "...",
"eval_criteria": ["..."],
"k": 3
}
}aggregator.py computes macro-level mean, stderr, and pass@1 (for categorical metrics) for the following:
root_cause_entity(precision/recall/F1 + pass@1): Whether the correct root cause entity was identifiedroot_cause_entity_k(precision/recall/F1 + pass@1, configurablek): Whether the correct root cause entity was identified in the first k=(1,..,5) model predictionsroot_cause_reasoning: Whether the reasoning for the root cause was correct (0, 0.5 or 1).propagation_chain: Scores the full propagation chainfault_localization_component_identification: Checks if the model correctly identified the first semantic component to exhibit a significant failure symptomroot_cause_reasoning_partial: Awards partial credit for reasoning if the model correctly analyzed a downstream symptom when it missed the root cause entity.root_cause_proximity(precision/recall/F1): Compute closeness between model root cause entities and the Ground-Truth (GT) root-cause entities based on distance (number of hops) between the model entity's component and any GT root-cause componentroot_cause_proximity_with_fp(precision/recall/F1): Similar to root_cause_proximity_no_fp but distance is relative to the GT path length
root_cause_resource(binary 0/1 + pass@1): Whether all cost anomaly root cause resources were correctly identified by matchingnameandtypeagainst the ground truth resource list