ITBench Evaluations

Toolkit for running LLM-as-a-Judge evaluations on ITBench root-cause analysis (RCA) agent outputs. Supports both SRE (incident RCA) and FinOps (cost anomaly RCA) evaluation domains. The CLI loads ground-truth scenarios, reads agent trial outputs, repairs malformed JSON when possible, scores multiple metrics with a judge model, and aggregates macro statistics.

Repository layout

Path	Description
`itbench_evaluations/__main__.py`	CLI entrypoint (`python -m itbench_evaluations`).
`itbench_evaluations/agent.py`	Judge workflow and evaluation logic
`itbench_evaluations/client.py`	OpenAI-compatible judge client and model selection.
`itbench_evaluations/loader.py`	Ground-truth and agent output loading.
`itbench_evaluations/aggregator.py`	Computes per-incident and overall statistics.
`itbench_evaluations/json_fixer.py`	JSON repair utilities for malformed model outputs.
`itbench_evaluations/namespace_filter.py`	Optional post-processing helpers for filtering and recalculating metrics.
`itbench_evaluations/prompts/main.py`	SRE system prompt and evaluation template.
`itbench_evaluations/prompts/finops_main.py`	FinOps system prompt, evaluation template, and parse template.
`itbench_evaluations/prompts/`	Per-metric criterion templates (entity, reasoning, proximity, etc.).
`itbench_evaluations/data/`	Example incident output hierarchy.
`.env.tmpl`	Environment configuration template (copy to `.env`).
`pyproject.toml`	Python package configuration and dependencies.

Quick start

Install prerequisites

uv sync

# Install huggingface_hub for downloading datasets
uv pip install huggingface_hub

Configure environment variables

cp .env.tmpl .env
# Edit .env with your actual values

The judge client reads .env via python-dotenv. Required variables:

Purpose	Keys
Judge API	`JUDGE_API_KEY` (defaults to `"dummy"` for local endpoints)
Judge model	`JUDGE_MODEL` (defaults to `gpt-4-turbo`)
Optional base URL	`JUDGE_BASE_URL` (OpenAI-compatible endpoint)

Download ITBench-Lite dataset

Download benchmark scenarios from Hugging Face (https://huggingface.co/datasets/ibm-research/ITBench-Lite):

# Download all scenarios (50 total: 35 SRE + 15 FinOps)
uv run hf download \
  ibm-research/ITBench-Lite \
  --repo-type dataset \
  --local-dir ./ITBench-Lite

# Or download only specific scenarios (faster for testing)
uv run hf download \
  ibm-research/ITBench-Lite \
  --repo-type dataset \
  --include "snapshots/sre/v0.2-*/Scenario-1/*" \
  --include "snapshots/sre/v0.2-*/Scenario-2/*" \
  --local-dir ./ITBench-Lite

Prepare data Ground truth supports:
- A single JSON/YAML file with one incident (must include id or the filename is used).
- A JSON/YAML list of incidents (each must include id).
- A directory of scenario subfolders containing ground_truth.yaml|yml|json (or gt.yaml|json).
- A consolidated ground_truths.json inside a directory.
Agent outputs should be structured as:
```
agent-outputs/
  <incident_id>/
    <trial_number>/
      outputs/agent_output.json
      # or outputs/agent_response.json
      # or agent_output.json / agent_response.json
```
Supported incident folder naming is case-insensitive and accepts patterns like scenario-i<id>, scenario_<id>, scenario-<id>, scenario<id>, and incident-<id>.

Running evaluations

SRE evaluation (default)

uv run itbench-evaluations \
  --ground-truth path/to/ground_truths.json \
  --outputs path/to/agent-outputs \
  --eval-criteria ROOT_CAUSE_ENTITY ROOT_CAUSE_REASONING

FinOps evaluation

uv run itbench-evaluations \
  --domain finops \
  --ground-truth path/to/finops_ground_truths.json \
  --outputs path/to/finops-agent-outputs

FinOps evaluates cost anomaly RCA by comparing predicted resources against ground-truth resources using name and type matching. It produces a single binary metric (root_cause_resource: 0 or 1). If agent output is raw text rather than structured JSON, an LLM-based preprocessing step extracts resource information automatically.

CISO evaluation

uv run itbench-evaluations \
  --domain ciso \
  --scenario-dir path/to/ITBench-Lite/snapshots/ciso/v0.1/k8s-opa-static-cis-5.1.1 \
  --outputs path/to/ciso-agent-outputs \
  --ground-truth dummy  # Not used for CISO

CISO evaluates Kubernetes compliance checking solutions using deterministic OPA (Open Policy Agent) evaluation instead of LLM-as-a-Judge. It produces 4 binary metrics:

required_files_exist: Checks if fetch.sh and policy.rego exist
fetch_generates_data: Verifies fetch.sh creates valid collected_data.json
violation_detection: OPA policy correctly detects violations (returns false)
compliance_validation: OPA policy confirms compliance with compliant resources (returns true)

Note: CISO does not use ground truth files - OPA evaluation is deterministic. The --scenario-dir must point to a scenario containing static-resources-compliant/ directory.

Key options

--domain selects the evaluation domain: sre (default), finops, or ciso.
--result-file sets the output file (default: evaluation_results.json).
--eval-criteria accepts any of the following:
- SRE: ROOT_CAUSE_ENTITY, ROOT_CAUSE_REASONING, PROPAGATION_CHAIN, FAULT_LOCALIZATION, ROOT_CAUSE_REASONING_PARTIAL, ROOT_CAUSE_PROXIMITY, ROOT_CAUSE_PROXIMITY_FP.
- FinOps: ROOT_CAUSE_RESOURCE.
- CISO: required_files_exist, fetch_generates_data, violation_detection, compliance_validation.
--scenario-dir (required for CISO) specifies the scenario directory containing compliant resources.
--k is kept for backward compatibility (entity@k metrics are derived from ROOT_CAUSE_ENTITY).
--max-concurrent controls evaluation concurrency (default: 5).
--verbose enables debug logging.

Outputs

The CLI writes a JSON report containing raw results and aggregated statistics:

{
  "raw_results": [
    {"incident_id": "1", "trial_id": 0, "scores": { "...": "..." }}
  ],
  "statistics": {
    "per_incident": { "...": "..." },
    "overall": { "...": "..." }
  },
  "config": {
    "domain": "sre",
    "ground_truth_path": "...",
    "outputs_path": "...",
    "eval_criteria": ["..."],
    "k": 3
  }
}

Metrics covered

aggregator.py computes macro-level mean, stderr, and pass@1 (for categorical metrics) for the following:

SRE metrics

root_cause_entity (precision/recall/F1 + pass@1): Whether the correct root cause entity was identified
root_cause_entity_k (precision/recall/F1 + pass@1, configurable k): Whether the correct root cause entity was identified in the first k=(1,..,5) model predictions
root_cause_reasoning: Whether the reasoning for the root cause was correct (0, 0.5 or 1).
propagation_chain: Scores the full propagation chain
fault_localization_component_identification: Checks if the model correctly identified the first semantic component to exhibit a significant failure symptom
root_cause_reasoning_partial: Awards partial credit for reasoning if the model correctly analyzed a downstream symptom when it missed the root cause entity.
root_cause_proximity (precision/recall/F1): Compute closeness between model root cause entities and the Ground-Truth (GT) root-cause entities based on distance (number of hops) between the model entity's component and any GT root-cause component
root_cause_proximity_with_fp (precision/recall/F1): Similar to root_cause_proximity_no_fp but distance is relative to the GT path length

FinOps metrics

root_cause_resource (binary 0/1 + pass@1): Whether all cost anomaly root cause resources were correctly identified by matching name and type against the ground truth resource list

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
itbench_evaluations		itbench_evaluations
.env.tmpl		.env.tmpl
.gitignore		.gitignore
DESIGN.md		DESIGN.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ITBench Evaluations

Repository layout

Quick start

Running evaluations

SRE evaluation (default)

FinOps evaluation

CISO evaluation

Key options

Outputs

Metrics covered

SRE metrics

FinOps metrics

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ITBench Evaluations

Repository layout

Quick start

Running evaluations

SRE evaluation (default)

FinOps evaluation

CISO evaluation

Key options

Outputs

Metrics covered

SRE metrics

FinOps metrics

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages