Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Changed

- `agents.definitions.sre-agent`: the default `systemMessage` now carries the muster meta-tool contract that was eval-tuned for the Backstage devportal AI chat ([giantswarm/backstage#1775](https://github.com/giantswarm/backstage/issues/1775)) — prefer a single-call `workflow_<name>` (discovered via `filter_tools(query=…)`), the exact `x_kubernetes_*` argument contract (`management_cluster: <mc>-mcp-kubernetes`, `resourceType`, `podName`, `tailLines`), and the cheap-listing recipe (`summary`, `fieldSelector: status.phase!=Running`, `reason=BackOff` for CrashLoopBackOff). The agent shares muster's meta-tool interface with the devportal chat, so the same guidance reduces wrong-arg retries and avoidable tool round trips. Clusters that enable the agent without overriding `systemMessage` (e.g. glean) inherit it.
- `agentic-platform-connectivity`: the agentgateway data-plane container now sets `ephemeral-storage` requests/limits (new `gateway.parameters.dataPlaneResources` value, merged onto the generated proxy container via `AgentgatewayParameters`). The agentgateway controller injects a writable `/tmp` emptyDir (readOnlyRootFilesystem) without a `sizeLimit`, which tripped the `require-emptydir-requests-and-limits` Kyverno policy; declaring ephemeral-storage on the mounting container clears the audit warning. Part of the namespace-wide Kyverno cleanup (giantswarm/giantswarm#36885).
- The bundled `valkey` (`muster-valkey`) component now documents that its RDB-only persistence is a deliberate cache-only choice and warns against enabling AOF to "fix" data loss: AOF lives on the same PVC so it does not survive PVC loss/recreation (the failure mode that actually occurs), and flipping `appendonly` in config + restart is a data-loss footgun (Valkey loads the empty AOF dir and ignores the RDB). Comment-only; no behaviour change. See [giantswarm/muster#884](https://github.com/giantswarm/muster/issues/884).
- `agentic-platform-connectivity`: the kagent controller `NetworkPolicy` (both `cilium` and `kubernetes` flavors) scopes intra-namespace ingress to agent pods (`app: kagent`) and the kagent UI on the controller API port (8083), instead of allowing every pod in the kagent namespace on all ports.
Expand Down
31 changes: 31 additions & 0 deletions helm/agentic-platform-connectivity/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -751,11 +751,42 @@ agents:
# creates from kagent.providers.anthropic.
modelConfig: default-model-config
description: "Giant Swarm SRE agent"
# The muster-tool guidance below is the same contract that was eval-tuned
# for the Backstage devportal AI chat (giantswarm/backstage#1775): prefer a
# single-call workflow, the exact x_kubernetes_* argument contract, and the
# cheap-listing recipe. It cuts wrong-arg retries and avoidable tool round
# trips against the muster meta-tool interface this agent shares.
systemMessage: |-
You are a Giant Swarm SRE agent. You investigate cluster, workload, and
observability issues using the tools exposed through the muster MCP
gateway. Prefer read-only diagnosis; explain findings concisely and cite
the evidence (resources, metrics, logs) you base conclusions on.

Your MCP tools come from an aggregator, muster, via meta-tools:
`filter_tools` discovers tools (pass a natural-language `query`; it returns
a short, relevance-ranked list), `describe_tool` returns a tool's input
schema, and `call_tool` runs a tool by its exact `name`.

Prefer a workflow. A `workflow_<name>` tool answers a whole question (pod
health, cluster health, a failing app) in a single call. Always look for
one first with `filter_tools(query="<the question's topic>")` and prefer it
over driving the raw `x_kubernetes_*` / `x_prometheus_*` tools yourself.
Fall back to raw tools only when no workflow fits.

Kubernetes tool contract. Raw Kubernetes tools are `x_kubernetes_*`
(`list`/`get`/`describe`/`logs`), with two easy-to-miss requirements:
- `management_cluster` is required, and its value is the full server name
`<mc>-mcp-kubernetes` — not the bare `<mc>`, and not `server`/`cluster`.
Prometheus tools take `<mc>-mcp-prometheus`.
- Argument names are exact: `x_kubernetes_list` selects with `resourceType`
(not `kind`); pod logs use `podName` (not `name`) and `tailLines`
(not `tail`).

List cheaply.
- `summary: true` for a per-kind overview; `fieldSelector:
status.phase!=Running` to surface only the pods needing attention.
- A CrashLoopBackOff pod is reported as `Running` by summary/phase — catch
it via events with `fieldSelector: reason=BackOff`.
# Tools to expose from the agent's muster server. Empty = all tools the
# server advertises (muster's meta-tools: list_tools, filter_tools, ...).
toolNames: []
Expand Down