From 0426aeab2997e9f08bd2b4c5e25850220a19f84b Mon Sep 17 00:00:00 2001
From: Timo Derstappen <teemow@gmail.com>
Date: Tue, 23 Jun 2026 15:31:59 +0200
Subject: [PATCH] feat(agents): port eval-tuned muster prompt into sre-agent
 systemMessage
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Give the bundled sre-agent the same muster meta-tool contract that was
eval-tuned for the Backstage devportal AI chat (giantswarm/backstage#1775):
prefer a single-call workflow_<name> discovered via filter_tools(query=…),
the exact x_kubernetes_* argument contract (management_cluster:
<mc>-mcp-kubernetes, resourceType, podName, tailLines), and the cheap-listing
recipe. The agent shares muster's meta-tool interface with the devportal chat,
so the same guidance cuts wrong-arg retries and avoidable tool round trips.

Clusters that enable the agent without overriding systemMessage (e.g. glean)
inherit the improved default.

Co-authored-by: Cursor <cursoragent@cursor.com>
---
 CHANGELOG.md                                  |  1 +
 .../agentic-platform-connectivity/values.yaml | 31 +++++++++++++++++++
 2 files changed, 32 insertions(+)
diff --git a/CHANGELOG.md b/CHANGELOG.md
index 0b9f7ac..1329b6f 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -13,6 +13,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 
 ### Changed
 
+- `agents.definitions.sre-agent`: the default `systemMessage` now carries the muster meta-tool contract that was eval-tuned for the Backstage devportal AI chat ([giantswarm/backstage#1775](https://github.com/giantswarm/backstage/issues/1775)) — prefer a single-call `workflow_<name>` (discovered via `filter_tools(query=…)`), the exact `x_kubernetes_*` argument contract (`management_cluster: <mc>-mcp-kubernetes`, `resourceType`, `podName`, `tailLines`), and the cheap-listing recipe (`summary`, `fieldSelector: status.phase!=Running`, `reason=BackOff` for CrashLoopBackOff). The agent shares muster's meta-tool interface with the devportal chat, so the same guidance reduces wrong-arg retries and avoidable tool round trips. Clusters that enable the agent without overriding `systemMessage` (e.g. glean) inherit it.
 - `agentic-platform-connectivity`: the agentgateway data-plane container now sets `ephemeral-storage` requests/limits (new `gateway.parameters.dataPlaneResources` value, merged onto the generated proxy container via `AgentgatewayParameters`). The agentgateway controller injects a writable `/tmp` emptyDir (readOnlyRootFilesystem) without a `sizeLimit`, which tripped the `require-emptydir-requests-and-limits` Kyverno policy; declaring ephemeral-storage on the mounting container clears the audit warning. Part of the namespace-wide Kyverno cleanup (giantswarm/giantswarm#36885).
 - The bundled `valkey` (`muster-valkey`) component now documents that its RDB-only persistence is a deliberate cache-only choice and warns against enabling AOF to "fix" data loss: AOF lives on the same PVC so it does not survive PVC loss/recreation (the failure mode that actually occurs), and flipping `appendonly` in config + restart is a data-loss footgun (Valkey loads the empty AOF dir and ignores the RDB). Comment-only; no behaviour change. See [giantswarm/muster#884](https://github.com/giantswarm/muster/issues/884).
 - `agentic-platform-connectivity`: the kagent controller `NetworkPolicy` (both `cilium` and `kubernetes` flavors) scopes intra-namespace ingress to agent pods (`app: kagent`) and the kagent UI on the controller API port (8083), instead of allowing every pod in the kagent namespace on all ports.
diff --git a/helm/agentic-platform-connectivity/values.yaml b/helm/agentic-platform-connectivity/values.yaml
index d78f290..279669c 100644
--- a/helm/agentic-platform-connectivity/values.yaml
+++ b/helm/agentic-platform-connectivity/values.yaml
@@ -751,11 +751,42 @@ agents:
       # creates from kagent.providers.anthropic.
       modelConfig: default-model-config
       description: "Giant Swarm SRE agent"
+      # The muster-tool guidance below is the same contract that was eval-tuned
+      # for the Backstage devportal AI chat (giantswarm/backstage#1775): prefer a
+      # single-call workflow, the exact x_kubernetes_* argument contract, and the
+      # cheap-listing recipe. It cuts wrong-arg retries and avoidable tool round
+      # trips against the muster meta-tool interface this agent shares.
       systemMessage: |-
         You are a Giant Swarm SRE agent. You investigate cluster, workload, and
         observability issues using the tools exposed through the muster MCP
         gateway. Prefer read-only diagnosis; explain findings concisely and cite
         the evidence (resources, metrics, logs) you base conclusions on.
+
+        Your MCP tools come from an aggregator, muster, via meta-tools:
+        `filter_tools` discovers tools (pass a natural-language `query`; it returns
+        a short, relevance-ranked list), `describe_tool` returns a tool's input
+        schema, and `call_tool` runs a tool by its exact `name`.
+
+        Prefer a workflow. A `workflow_<name>` tool answers a whole question (pod
+        health, cluster health, a failing app) in a single call. Always look for
+        one first with `filter_tools(query="<the question's topic>")` and prefer it
+        over driving the raw `x_kubernetes_*` / `x_prometheus_*` tools yourself.
+        Fall back to raw tools only when no workflow fits.
+
+        Kubernetes tool contract. Raw Kubernetes tools are `x_kubernetes_*`
+        (`list`/`get`/`describe`/`logs`), with two easy-to-miss requirements:
+        - `management_cluster` is required, and its value is the full server name
+          `<mc>-mcp-kubernetes` — not the bare `<mc>`, and not `server`/`cluster`.
+          Prometheus tools take `<mc>-mcp-prometheus`.
+        - Argument names are exact: `x_kubernetes_list` selects with `resourceType`
+          (not `kind`); pod logs use `podName` (not `name`) and `tailLines`
+          (not `tail`).
+
+        List cheaply.
+        - `summary: true` for a per-kind overview; `fieldSelector:
+          status.phase!=Running` to surface only the pods needing attention.
+        - A CrashLoopBackOff pod is reported as `Running` by summary/phase — catch
+          it via events with `fieldSelector: reason=BackOff`.
       # Tools to expose from the agent's muster server. Empty = all tools the
       # server advertises (muster's meta-tools: list_tools, filter_tools, ...).
       toolNames: []