Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
3 changes: 3 additions & 0 deletions src/aks-agent/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Ignore Poetry artifacts
poetry.lock
pyproject.toml
142 changes: 142 additions & 0 deletions src/aks-agent/azext_aks_agent/tests/evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# AKS Agent Evals

## Environment Setup

Create and activate a virtual environment (example shown for bash-compatible shells):

```bash
python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -e .
```

Optional tooling used by the eval harness (Braintrust uploads and semantic classifier helpers):

```bash
python -m pip install braintrust openai autoevals
```

## Running Live Scenarios

```bash
RUN_LIVE=true \
MODEL=azure/gpt-4.1 \
CLASSIFIER_MODEL=azure/gpt-4o \
AKS_AGENT_RESOURCE_GROUP=<rg> \
AKS_AGENT_CLUSTER=<cluster> \
KUBECONFIG=<path-to-kubeconfig> \
AZURE_API_KEY=<key> \
AZURE_API_BASE=<endpoint> \
AZURE_API_VERSION=<version> \
python -m pytest azext_aks_agent/tests/evals/test_ask_agent.py -k 01_list_all_nodes -m aks_eval
```

Per-scenario overrides (`resource_group`, `cluster_name`, `kubeconfig`, `test_env_vars`) still apply. Use `--skip-setup` or `--skip-cleanup` to bypass hooks. Expect the test to log iteration progress, classifier scores, and (on the final iteration) a Braintrust link when uploads are enabled.

**Example output (live run with classifier)**

```
[iteration 1/3] running setup commands for 01_list_all_nodes
[iteration 1/3] invoking AKS Agent CLI for 01_list_all_nodes
[iteration 1/3] classifier score for 01_list_all_nodes: 1
[iteration 2/3] invoking AKS Agent CLI for 01_list_all_nodes
[iteration 2/3] classifier score for 01_list_all_nodes: 1
[iteration 3/3] invoking AKS Agent CLI for 01_list_all_nodes
[iteration 3/3] classifier score for 01_list_all_nodes: 1
...
🔍 Braintrust: https://www.braintrust.dev/app/<org>/p/aks-agent/experiments/aks-agent/...
```

## Mock Workflow

```bash
# Generate fresh mocks from a live run
RUN_LIVE=true GENERATE_MOCKS=true \
MODEL=azure/gpt-4.1 \
AKS_AGENT_RESOURCE_GROUP=<rg> \
AKS_AGENT_CLUSTER=<cluster> \
KUBECONFIG=<path-to-kubeconfig> \
AZURE_API_KEY=<key> \
AZURE_API_BASE=<endpoint> \
AZURE_API_VERSION=<version> \
python -m pytest azext_aks_agent/tests/evals/test_ask_agent.py -k 02_list_clusters -m aks_eval

# Re-run offline using the recorded response
python -m pytest azext_aks_agent/tests/evals/test_ask_agent.py -k 02_list_clusters -m aks_eval
```

If a mock is missing, pytest skips the scenario with instructions to regenerate it.

**Regression guardrails**

- Mocked answers make iterations deterministic, so you can update parsing or prompts without waiting on live infrastructure.
- If you check in a new mock after behavior changes, reviewers see the exact diff in `mocks/response.txt`, making regressions obvious.
- CI can run `RUN_LIVE` off by default, catching logical regressions early without needing cluster credentials.


**Example skip (no mock present)**

```
azext_aks_agent/tests/evals/test_ask_agent.py::test_ask_agent_live[02_list_clusters]
SKIPPED: Mock response missing for scenario 02_list_clusters; rerun with RUN_LIVE=true GENERATE_MOCKS=true
```

## Braintrust Uploads

Set the following environment variables to push results:

- `BRAINTRUST_API_KEY` and `BRAINTRUST_ORG` (required)
- Optional overrides: `BRAINTRUST_PROJECT` (default `aks-agent`), `BRAINTRUST_DATASET` (default `aks-agent/ask`), `EXPERIMENT_ID`

Each iteration logs to Braintrust; the console prints a clickable link (for aware terminals) when uploads succeed.

**Tips**

- Leave `EXPERIMENT_ID` unset to generate a fresh experiment name each run (`aks-agent/<model>/<run-id>`).
- Use `BRAINTRUST_RUN_ID=<custom>` if you want deterministic experiment names across retries.
- The upload payload includes classifier score, rationale, raw CLI output, cluster, and resource group metadata for later filtering.

## Semantic Classifier

- Enabled by default; set `ENABLE_CLASSIFIER=false` to opt out.
- Requires Azure OpenAI credentials: `AZURE_API_BASE`, `AZURE_API_KEY`, `AZURE_API_VERSION`, and a classifier deployment specified via `CLASSIFIER_MODEL` (e.g. `azure/<deployment>`). Defaults to the same deployment as `MODEL` when not provided.
- Install classifier dependencies when online (see Environment Setup above if not already installed).

- Scenarios can override the grading style by adding:

```yaml
evaluation:
correctness:
type: loose # or strict (default)
```

Classifier scores and rationales are attached to Braintrust uploads and printed in the pytest output metadata.

**Debugging classifiers**

```
python -m pytest ... -o log_cli=true -o log_cli_level=DEBUG -s
```

Look for `classifier score ...` lines to confirm the semantic judge executed.

## Iterations & Tags

- `ITERATIONS=<n>` repeats every scenario, useful for non-deterministic models.
- Filter suites with pytest markers: `-m aks_eval`, `-m easy`, `-m medium`, etc.

## Troubleshooting

- Missing mocks: rerun with `RUN_LIVE=true GENERATE_MOCKS=true`.
- Cleanup always executes unless `--skip-cleanup` is provided; check the `[cleanup]` log line.
- Braintrust disabled messages mean credentials or the SDK are missing.
- Classifier disabled messages usually indicate missing Azure settings (`AZURE_API_BASE`, `AZURE_API_KEY`, `AZURE_API_VERSION`).

## Quick Checklist

- Install dependencies inside a virtual environment (`python -m pip install -e .`) and, if needed, the optional tooling (`python -m pip install braintrust openai autoevals`).
- `RUN_LIVE=true`: set Azure creds (`AZURE_API_KEY`, `AZURE_API_BASE`, `AZURE_API_VERSION`), `MODEL`, kubeconfig, and optional Braintrust vars.
- `RUN_LIVE` unset/false: ensure each scenario directory has `mocks/response.txt`.
- Classifier overrides: `CLASSIFIER_MODEL` (defaults to `MODEL`) and per-scenario `evaluation.correctness.type`.
- Optional: `BRAINTRUST_RUN_ID=<identifier>` to reuse experiment names across retries.
4 changes: 4 additions & 0 deletions src/aks-agent/azext_aks_agent/tests/evals/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# --------------------------------------------------------------------------------------------
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License. See License.txt in the project root for license information.
# --------------------------------------------------------------------------------------------
216 changes: 216 additions & 0 deletions src/aks-agent/azext_aks_agent/tests/evals/braintrust_uploader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
# --------------------------------------------------------------------------------------------
# Copyright (c) Microsoft Corporation. All rights reserved.
# Licensed under the MIT License. See License.txt in the project root for license information.
# --------------------------------------------------------------------------------------------

from __future__ import annotations

import logging
import os
from dataclasses import dataclass
from typing import Any, Dict, Mapping, Optional
from urllib.parse import quote

LOGGER = logging.getLogger(__name__)


@dataclass
class BraintrustMetadata:
project: str
dataset: str
experiment: Optional[str]
api_key: str
org: str


class BraintrustUploader:
"""Uploads eval results to Braintrust when credentials and SDK are available."""

def __init__(self, env: Mapping[str, str | None]) -> None:
self._env = env
self._metadata = self._load_metadata(env)
self._enabled = self._metadata is not None
self._braintrust = None
self._dataset = None
self._experiments: Dict[str, Any] = {}
self._warning_emitted = False

@staticmethod
def _load_metadata(env: Mapping[str, str | None]) -> Optional[BraintrustMetadata]:
api_key = env.get("BRAINTRUST_API_KEY") or ""
org = env.get("BRAINTRUST_ORG") or ""
if not api_key or not org:
return None
project = env.get("BRAINTRUST_PROJECT") or "aks-agent"
dataset = env.get("BRAINTRUST_DATASET") or "aks-agent/ask"
experiment = env.get("EXPERIMENT_ID")
return BraintrustMetadata(
project=project,
dataset=dataset,
experiment=experiment,
api_key=api_key,
org=org,
)

@property
def enabled(self) -> bool:
return self._enabled

def _warn_once(self, message: str) -> None:
if not self._warning_emitted:
LOGGER.warning("[braintrust] %s", message)
self._warning_emitted = True

def _ensure_braintrust(self) -> bool:
if self._braintrust is not None:
return True
if not self._enabled or not self._metadata:
return False
try:
import braintrust # type: ignore
except ImportError:
self._warn_once(
"braintrust package not installed; skipping Braintrust uploads"
)
self._enabled = False
return False

# Configure environment for braintrust SDK
os.environ.setdefault("BRAINTRUST_API_KEY", self._metadata.api_key)
os.environ.setdefault("BRAINTRUST_ORG", self._metadata.org)
self._braintrust = braintrust
return True

def _ensure_dataset(self) -> Optional[Any]:
if not self._ensure_braintrust():
return None
if self._dataset is None and self._metadata:
try:
self._dataset = self._braintrust.init_dataset( # type: ignore[attr-defined]
project=self._metadata.project,
name=self._metadata.dataset,
)
except Exception as exc: # pragma: no cover - SDK specific failure
self._warn_once(f"Unable to initialise Braintrust dataset: {exc}")
self._enabled = False
return None
return self._dataset

def _get_experiment(self, experiment_name: str) -> Optional[Any]:
if experiment_name in self._experiments:
return self._experiments[experiment_name]
dataset = self._ensure_dataset()
if dataset is None or not self._metadata:
return None
try:
experiment = self._braintrust.init( # type: ignore[attr-defined]
project=self._metadata.project,
experiment=experiment_name,
dataset=dataset,
open=False,
update=True,
metadata={"aks_agent": True},
)
except Exception as exc: # pragma: no cover - SDK specific failure
self._warn_once(f"Unable to initialise Braintrust experiment: {exc}")
self._enabled = False
return None
self._experiments[experiment_name] = experiment
return experiment

def _build_url(
self,
experiment_name: str,
span_id: Optional[str],
root_span_id: Optional[str],
) -> Optional[str]:
if not self._metadata:
return None
encoded_exp = quote(experiment_name, safe="")
base = (
f"https://www.braintrust.dev/app/{self._metadata.org}/p/"
f"{self._metadata.project}/experiments/{encoded_exp}"
)
if span_id and root_span_id:
return f"{base}?r={span_id}&s={root_span_id}"
return base

def record(
self,
*,
scenario_name: str,
iteration: int,
total_iterations: int,
prompt: str,
answer: str,
expected_output: list[str],
model: str,
tags: list[str],
passed: bool,
run_live: bool,
raw_output: str,
resource_group: str,
cluster_name: str,
error_message: Optional[str] = None,
classifier_score: Optional[float] = None,
classifier_rationale: Optional[str] = None,
) -> Optional[Dict[str, Optional[str]]]:
if not self._enabled:
return None
metadata = self._metadata
if not metadata:
return None

if metadata.experiment:
experiment_name = metadata.experiment
else:
iteration_token = os.environ.get("BRAINTRUST_RUN_ID") or os.environ.get("GITHUB_RUN_ID") or os.environ.get("CI_PIPELINE_ID")
if not iteration_token:
iteration_token = f"{model}-{os.getpid()}"
experiment_name = f"aks-agent/{model}/{iteration_token}"
experiment = self._get_experiment(experiment_name)
if experiment is None:
return None

span = experiment.start_span(
name=f"{scenario_name} [iter {iteration + 1}/{total_iterations}]"
)
metadata: Dict[str, Any] = {
"raw_output": raw_output,
"resource_group": resource_group,
"cluster_name": cluster_name,
"error": error_message,
}
if classifier_score is not None:
metadata["classifier_score"] = classifier_score
if classifier_rationale:
metadata["classifier_rationale"] = classifier_rationale

span.log(
input=prompt,
output=answer,
expected="\n".join(expected_output),
dataset_record_id=scenario_name,
scores={
"correctness": 1 if passed else 0,
"classifier": classifier_score,
},
tags=list(tags) + [f"model:{model}", f"run_live:{run_live}"],
metadata=metadata,
)
span_id = getattr(span, "id", None)
root_span_id = getattr(span, "root_span_id", None)
span.end()
try:
experiment.flush()
except Exception as exc: # pragma: no cover - SDK specific failure
self._warn_once(f"Failed flushing Braintrust experiment: {exc}")
self._enabled = False
return None
return {
"span_id": span_id,
"root_span_id": root_span_id,
"url": self._build_url(experiment_name, span_id, root_span_id),
"classifier_score": classifier_score,
"classifier_rationale": classifier_rationale,
}
Loading
Loading