From 0426aeab2997e9f08bd2b4c5e25850220a19f84b Mon Sep 17 00:00:00 2001 From: Timo Derstappen Date: Tue, 23 Jun 2026 15:31:59 +0200 Subject: [PATCH] feat(agents): port eval-tuned muster prompt into sre-agent systemMessage MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Give the bundled sre-agent the same muster meta-tool contract that was eval-tuned for the Backstage devportal AI chat (giantswarm/backstage#1775): prefer a single-call workflow_ discovered via filter_tools(query=…), the exact x_kubernetes_* argument contract (management_cluster: -mcp-kubernetes, resourceType, podName, tailLines), and the cheap-listing recipe. The agent shares muster's meta-tool interface with the devportal chat, so the same guidance cuts wrong-arg retries and avoidable tool round trips. Clusters that enable the agent without overriding systemMessage (e.g. glean) inherit the improved default. Co-authored-by: Cursor --- CHANGELOG.md | 1 + .../agentic-platform-connectivity/values.yaml | 31 +++++++++++++++++++ 2 files changed, 32 insertions(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index 0b9f7ac..1329b6f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -13,6 +13,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ### Changed +- `agents.definitions.sre-agent`: the default `systemMessage` now carries the muster meta-tool contract that was eval-tuned for the Backstage devportal AI chat ([giantswarm/backstage#1775](https://github.com/giantswarm/backstage/issues/1775)) — prefer a single-call `workflow_` (discovered via `filter_tools(query=…)`), the exact `x_kubernetes_*` argument contract (`management_cluster: -mcp-kubernetes`, `resourceType`, `podName`, `tailLines`), and the cheap-listing recipe (`summary`, `fieldSelector: status.phase!=Running`, `reason=BackOff` for CrashLoopBackOff). The agent shares muster's meta-tool interface with the devportal chat, so the same guidance reduces wrong-arg retries and avoidable tool round trips. Clusters that enable the agent without overriding `systemMessage` (e.g. glean) inherit it. - `agentic-platform-connectivity`: the agentgateway data-plane container now sets `ephemeral-storage` requests/limits (new `gateway.parameters.dataPlaneResources` value, merged onto the generated proxy container via `AgentgatewayParameters`). The agentgateway controller injects a writable `/tmp` emptyDir (readOnlyRootFilesystem) without a `sizeLimit`, which tripped the `require-emptydir-requests-and-limits` Kyverno policy; declaring ephemeral-storage on the mounting container clears the audit warning. Part of the namespace-wide Kyverno cleanup (giantswarm/giantswarm#36885). - The bundled `valkey` (`muster-valkey`) component now documents that its RDB-only persistence is a deliberate cache-only choice and warns against enabling AOF to "fix" data loss: AOF lives on the same PVC so it does not survive PVC loss/recreation (the failure mode that actually occurs), and flipping `appendonly` in config + restart is a data-loss footgun (Valkey loads the empty AOF dir and ignores the RDB). Comment-only; no behaviour change. See [giantswarm/muster#884](https://github.com/giantswarm/muster/issues/884). - `agentic-platform-connectivity`: the kagent controller `NetworkPolicy` (both `cilium` and `kubernetes` flavors) scopes intra-namespace ingress to agent pods (`app: kagent`) and the kagent UI on the controller API port (8083), instead of allowing every pod in the kagent namespace on all ports. diff --git a/helm/agentic-platform-connectivity/values.yaml b/helm/agentic-platform-connectivity/values.yaml index d78f290..279669c 100644 --- a/helm/agentic-platform-connectivity/values.yaml +++ b/helm/agentic-platform-connectivity/values.yaml @@ -751,11 +751,42 @@ agents: # creates from kagent.providers.anthropic. modelConfig: default-model-config description: "Giant Swarm SRE agent" + # The muster-tool guidance below is the same contract that was eval-tuned + # for the Backstage devportal AI chat (giantswarm/backstage#1775): prefer a + # single-call workflow, the exact x_kubernetes_* argument contract, and the + # cheap-listing recipe. It cuts wrong-arg retries and avoidable tool round + # trips against the muster meta-tool interface this agent shares. systemMessage: |- You are a Giant Swarm SRE agent. You investigate cluster, workload, and observability issues using the tools exposed through the muster MCP gateway. Prefer read-only diagnosis; explain findings concisely and cite the evidence (resources, metrics, logs) you base conclusions on. + + Your MCP tools come from an aggregator, muster, via meta-tools: + `filter_tools` discovers tools (pass a natural-language `query`; it returns + a short, relevance-ranked list), `describe_tool` returns a tool's input + schema, and `call_tool` runs a tool by its exact `name`. + + Prefer a workflow. A `workflow_` tool answers a whole question (pod + health, cluster health, a failing app) in a single call. Always look for + one first with `filter_tools(query="")` and prefer it + over driving the raw `x_kubernetes_*` / `x_prometheus_*` tools yourself. + Fall back to raw tools only when no workflow fits. + + Kubernetes tool contract. Raw Kubernetes tools are `x_kubernetes_*` + (`list`/`get`/`describe`/`logs`), with two easy-to-miss requirements: + - `management_cluster` is required, and its value is the full server name + `-mcp-kubernetes` — not the bare ``, and not `server`/`cluster`. + Prometheus tools take `-mcp-prometheus`. + - Argument names are exact: `x_kubernetes_list` selects with `resourceType` + (not `kind`); pod logs use `podName` (not `name`) and `tailLines` + (not `tail`). + + List cheaply. + - `summary: true` for a per-kind overview; `fieldSelector: + status.phase!=Running` to surface only the pods needing attention. + - A CrashLoopBackOff pod is reported as `Running` by summary/phase — catch + it via events with `fieldSelector: reason=BackOff`. # Tools to expose from the agent's muster server. Empty = all tools the # server advertises (muster's meta-tools: list_tools, filter_tools, ...). toolNames: []