Azure · yanzhudd · Sep 30, 2025 · Sep 24, 2025 · Sep 25, 2025
@@ -0,0 +1,3 @@
+# Ignore Poetry artifacts
+poetry.lock
+pyproject.toml
@@ -0,0 +1,142 @@
+# AKS Agent Evals
+
+## Environment Setup
+
+Create and activate a virtual environment (example shown for bash-compatible shells):
+
+```bash
+python -m venv .venv
+source .venv/bin/activate
+python -m pip install --upgrade pip
+python -m pip install -e .
+```
+
+Optional tooling used by the eval harness (Braintrust uploads and semantic classifier helpers):
+
+```bash
+python -m pip install braintrust openai autoevals
+```
+
+## Running Live Scenarios
+
+```bash
+RUN_LIVE=true \
+MODEL=azure/gpt-4.1 \
+CLASSIFIER_MODEL=azure/gpt-4o \
+AKS_AGENT_RESOURCE_GROUP=<rg> \
+AKS_AGENT_CLUSTER=<cluster> \
+KUBECONFIG=<path-to-kubeconfig> \
+AZURE_API_KEY=<key> \
+AZURE_API_BASE=<endpoint> \
+AZURE_API_VERSION=<version> \
+python -m pytest azext_aks_agent/tests/evals/test_ask_agent.py -k 01_list_all_nodes -m aks_eval
+```
+
+Per-scenario overrides (`resource_group`, `cluster_name`, `kubeconfig`, `test_env_vars`) still apply. Use `--skip-setup` or `--skip-cleanup` to bypass hooks. Expect the test to log iteration progress, classifier scores, and (on the final iteration) a Braintrust link when uploads are enabled.
+
+**Example output (live run with classifier)**
+
+```
+[iteration 1/3] running setup commands for 01_list_all_nodes
+[iteration 1/3] invoking AKS Agent CLI for 01_list_all_nodes
+[iteration 1/3] classifier score for 01_list_all_nodes: 1
+[iteration 2/3] invoking AKS Agent CLI for 01_list_all_nodes
+[iteration 2/3] classifier score for 01_list_all_nodes: 1
+[iteration 3/3] invoking AKS Agent CLI for 01_list_all_nodes
+[iteration 3/3] classifier score for 01_list_all_nodes: 1
+...
+🔍 Braintrust: https://www.braintrust.dev/app/<org>/p/aks-agent/experiments/aks-agent/... 
+```
+
+## Mock Workflow
+
+```bash
+# Generate fresh mocks from a live run
+RUN_LIVE=true GENERATE_MOCKS=true \
+MODEL=azure/gpt-4.1 \
+AKS_AGENT_RESOURCE_GROUP=<rg> \
+AKS_AGENT_CLUSTER=<cluster> \
+KUBECONFIG=<path-to-kubeconfig> \
+AZURE_API_KEY=<key> \
+AZURE_API_BASE=<endpoint> \
+AZURE_API_VERSION=<version> \
+python -m pytest azext_aks_agent/tests/evals/test_ask_agent.py -k 02_list_clusters -m aks_eval
+
+# Re-run offline using the recorded response
+python -m pytest azext_aks_agent/tests/evals/test_ask_agent.py -k 02_list_clusters -m aks_eval
+```
+
+If a mock is missing, pytest skips the scenario with instructions to regenerate it.
+
+**Regression guardrails**
+
+- Mocked answers make iterations deterministic, so you can update parsing or prompts without waiting on live infrastructure.
+- If you check in a new mock after behavior changes, reviewers see the exact diff in `mocks/response.txt`, making regressions obvious.
+- CI can run `RUN_LIVE` off by default, catching logical regressions early without needing cluster credentials.
+
+
+**Example skip (no mock present)**
+
+```
+azext_aks_agent/tests/evals/test_ask_agent.py::test_ask_agent_live[02_list_clusters]
+  SKIPPED: Mock response missing for scenario 02_list_clusters; rerun with RUN_LIVE=true GENERATE_MOCKS=true
+```
+
+## Braintrust Uploads
+
+Set the following environment variables to push results:
+
+- `BRAINTRUST_API_KEY` and `BRAINTRUST_ORG` (required)
+- Optional overrides: `BRAINTRUST_PROJECT` (default `aks-agent`), `BRAINTRUST_DATASET` (default `aks-agent/ask`), `EXPERIMENT_ID`
+
+Each iteration logs to Braintrust; the console prints a clickable link (for aware terminals) when uploads succeed.
+
+**Tips**
+
+- Leave `EXPERIMENT_ID` unset to generate a fresh experiment name each run (`aks-agent/<model>/<run-id>`).
+- Use `BRAINTRUST_RUN_ID=<custom>` if you want deterministic experiment names across retries.
+- The upload payload includes classifier score, rationale, raw CLI output, cluster, and resource group metadata for later filtering.
+
+## Semantic Classifier
+
+- Enabled by default; set `ENABLE_CLASSIFIER=false` to opt out.
+- Requires Azure OpenAI credentials: `AZURE_API_BASE`, `AZURE_API_KEY`, `AZURE_API_VERSION`, and a classifier deployment specified via `CLASSIFIER_MODEL` (e.g. `azure/<deployment>`). Defaults to the same deployment as `MODEL` when not provided.
+- Install classifier dependencies when online (see Environment Setup above if not already installed).
+
+- Scenarios can override the grading style by adding:
+
+  ```yaml
+  evaluation:
+    correctness:
+      type: loose  # or strict (default)
+  ```
+
+Classifier scores and rationales are attached to Braintrust uploads and printed in the pytest output metadata.
+
+**Debugging classifiers**
+
+```
+python -m pytest ... -o log_cli=true -o log_cli_level=DEBUG -s
+```
+
+Look for `classifier score ...` lines to confirm the semantic judge executed.
+
+## Iterations & Tags
+
+- `ITERATIONS=<n>` repeats every scenario, useful for non-deterministic models.
+- Filter suites with pytest markers: `-m aks_eval`, `-m easy`, `-m medium`, etc.
+
+## Troubleshooting
+
+- Missing mocks: rerun with `RUN_LIVE=true GENERATE_MOCKS=true`.
+- Cleanup always executes unless `--skip-cleanup` is provided; check the `[cleanup]` log line.
+- Braintrust disabled messages mean credentials or the SDK are missing.
+- Classifier disabled messages usually indicate missing Azure settings (`AZURE_API_BASE`, `AZURE_API_KEY`, `AZURE_API_VERSION`).
+
+## Quick Checklist
+
+- Install dependencies inside a virtual environment (`python -m pip install -e .`) and, if needed, the optional tooling (`python -m pip install braintrust openai autoevals`).
+- `RUN_LIVE=true`: set Azure creds (`AZURE_API_KEY`, `AZURE_API_BASE`, `AZURE_API_VERSION`), `MODEL`, kubeconfig, and optional Braintrust vars.
+- `RUN_LIVE` unset/false: ensure each scenario directory has `mocks/response.txt`.
+- Classifier overrides: `CLASSIFIER_MODEL` (defaults to `MODEL`) and per-scenario `evaluation.correctness.type`.
+- Optional: `BRAINTRUST_RUN_ID=<identifier>` to reuse experiment names across retries.
@@ -0,0 +1,4 @@
+# --------------------------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License. See License.txt in the project root for license information.
+# --------------------------------------------------------------------------------------------
@@ -0,0 +1,216 @@
+# --------------------------------------------------------------------------------------------
+# Copyright (c) Microsoft Corporation. All rights reserved.
+# Licensed under the MIT License. See License.txt in the project root for license information.
+# --------------------------------------------------------------------------------------------
+
+from __future__ import annotations
+
+import logging
+import os
+from dataclasses import dataclass
+from typing import Any, Dict, Mapping, Optional
+from urllib.parse import quote
+
+LOGGER = logging.getLogger(__name__)
+
+
+@dataclass
+class BraintrustMetadata:
+    project: str
+    dataset: str
+    experiment: Optional[str]
+    api_key: str
+    org: str
+
+
+class BraintrustUploader:
+    """Uploads eval results to Braintrust when credentials and SDK are available."""
+
+    def __init__(self, env: Mapping[str, str | None]) -> None:
+        self._env = env
+        self._metadata = self._load_metadata(env)
+        self._enabled = self._metadata is not None
+        self._braintrust = None
+        self._dataset = None
+        self._experiments: Dict[str, Any] = {}
+        self._warning_emitted = False
+
+    @staticmethod
+    def _load_metadata(env: Mapping[str, str | None]) -> Optional[BraintrustMetadata]:
+        api_key = env.get("BRAINTRUST_API_KEY") or ""
+        org = env.get("BRAINTRUST_ORG") or ""
+        if not api_key or not org:
+            return None
+        project = env.get("BRAINTRUST_PROJECT") or "aks-agent"
+        dataset = env.get("BRAINTRUST_DATASET") or "aks-agent/ask"
+        experiment = env.get("EXPERIMENT_ID")
+        return BraintrustMetadata(
+            project=project,
+            dataset=dataset,
+            experiment=experiment,
+            api_key=api_key,
+            org=org,
+        )
+
+    @property
+    def enabled(self) -> bool:
+        return self._enabled
+
+    def _warn_once(self, message: str) -> None:
+        if not self._warning_emitted:
+            LOGGER.warning("[braintrust] %s", message)
+            self._warning_emitted = True
+
+    def _ensure_braintrust(self) -> bool:
+        if self._braintrust is not None:
+            return True
+        if not self._enabled or not self._metadata:
+            return False
+        try:
+            import braintrust  # type: ignore
+        except ImportError:
+            self._warn_once(
+                "braintrust package not installed; skipping Braintrust uploads"
+            )
+            self._enabled = False
+            return False
+
+        # Configure environment for braintrust SDK
+        os.environ.setdefault("BRAINTRUST_API_KEY", self._metadata.api_key)
+        os.environ.setdefault("BRAINTRUST_ORG", self._metadata.org)
+        self._braintrust = braintrust
+        return True
+
+    def _ensure_dataset(self) -> Optional[Any]:
+        if not self._ensure_braintrust():
+            return None
+        if self._dataset is None and self._metadata:
+            try:
+                self._dataset = self._braintrust.init_dataset(  # type: ignore[attr-defined]
+                    project=self._metadata.project,
+                    name=self._metadata.dataset,
+                )
+            except Exception as exc:  # pragma: no cover - SDK specific failure
+                self._warn_once(f"Unable to initialise Braintrust dataset: {exc}")
+                self._enabled = False
+                return None
+        return self._dataset
+
+    def _get_experiment(self, experiment_name: str) -> Optional[Any]:
+        if experiment_name in self._experiments:
+            return self._experiments[experiment_name]
+        dataset = self._ensure_dataset()
+        if dataset is None or not self._metadata:
+            return None
+        try:
+            experiment = self._braintrust.init(  # type: ignore[attr-defined]
+                project=self._metadata.project,
+                experiment=experiment_name,
+                dataset=dataset,
+                open=False,
+                update=True,
+                metadata={"aks_agent": True},
+            )
+        except Exception as exc:  # pragma: no cover - SDK specific failure
+            self._warn_once(f"Unable to initialise Braintrust experiment: {exc}")
+            self._enabled = False
+            return None
+        self._experiments[experiment_name] = experiment
+        return experiment
+
+    def _build_url(
+        self,
+        experiment_name: str,
+        span_id: Optional[str],
+        root_span_id: Optional[str],
+    ) -> Optional[str]:
+        if not self._metadata:
+            return None
+        encoded_exp = quote(experiment_name, safe="")
+        base = (
+            f"https://www.braintrust.dev/app/{self._metadata.org}/p/"
+            f"{self._metadata.project}/experiments/{encoded_exp}"
+        )
+        if span_id and root_span_id:
+            return f"{base}?r={span_id}&s={root_span_id}"
+        return base
+
+    def record(
+        self,
+        *,
+        scenario_name: str,
+        iteration: int,
+        total_iterations: int,
+        prompt: str,
+        answer: str,
+        expected_output: list[str],
+        model: str,
+        tags: list[str],
+        passed: bool,
+        run_live: bool,
+        raw_output: str,
+        resource_group: str,
+        cluster_name: str,
+        error_message: Optional[str] = None,
+        classifier_score: Optional[float] = None,
+        classifier_rationale: Optional[str] = None,
+    ) -> Optional[Dict[str, Optional[str]]]:
+        if not self._enabled:
+            return None
+        metadata = self._metadata
+        if not metadata:
+            return None
+
+        if metadata.experiment:
+            experiment_name = metadata.experiment
+        else:
+            iteration_token = os.environ.get("BRAINTRUST_RUN_ID") or os.environ.get("GITHUB_RUN_ID") or os.environ.get("CI_PIPELINE_ID")
+            if not iteration_token:
+                iteration_token = f"{model}-{os.getpid()}"
+            experiment_name = f"aks-agent/{model}/{iteration_token}"
+        experiment = self._get_experiment(experiment_name)
+        if experiment is None:
+            return None
+
+        span = experiment.start_span(
+            name=f"{scenario_name} [iter {iteration + 1}/{total_iterations}]"
+        )
+        metadata: Dict[str, Any] = {
+            "raw_output": raw_output,
+            "resource_group": resource_group,
+            "cluster_name": cluster_name,
+            "error": error_message,
+        }
+        if classifier_score is not None:
+            metadata["classifier_score"] = classifier_score
+        if classifier_rationale:
+            metadata["classifier_rationale"] = classifier_rationale
+
+        span.log(
+            input=prompt,
+            output=answer,
+            expected="\n".join(expected_output),
+            dataset_record_id=scenario_name,
+            scores={
+                "correctness": 1 if passed else 0,
+                "classifier": classifier_score,
+            },
+            tags=list(tags) + [f"model:{model}", f"run_live:{run_live}"],
+            metadata=metadata,
+        )
+        span_id = getattr(span, "id", None)
+        root_span_id = getattr(span, "root_span_id", None)
+        span.end()
+        try:
+            experiment.flush()
+        except Exception as exc:  # pragma: no cover - SDK specific failure
+            self._warn_once(f"Failed flushing Braintrust experiment: {exc}")
+            self._enabled = False
+            return None
+        return {
+            "span_id": span_id,
+            "root_span_id": root_span_id,
+            "url": self._build_url(experiment_name, span_id, root_span_id),
+            "classifier_score": classifier_score,
+            "classifier_rationale": classifier_rationale,
+        }