Part of #459.
Problem
There is no operator-facing documentation for wiring a Prometheus scraper to
the platform. Each install reinvents annotations, rules, and alerts. No Helm,
plain Kubernetes manifests only.
Acceptance criteria
New directory deployments/observability/ containing:
pod-annotations.yaml - example Deployment patch showing
prometheus.io/scrape: "true", prometheus.io/port: "9090",
prometheus.io/path: "/metrics".
recording-rules.yaml - ConfigMap with the starter recording rules.
alert-rules.yaml - ConfigMap with the starter alert rules.
README.md - how to apply, where to mount, how to confirm Prometheus is
actually scraping (check up{job=...} query).
Plus:
- Plain Kubernetes resources only. NO Helm chart. NO HelmRelease. Kustomize is
optional; the raw files must work standalone with kubectl apply -f.
- Recording rule naming follows Prometheus best practice
level:metric_name:operations style.
- All rules pass
promtool check rules.
- Underlying metrics documented in
docs/observability.md.
- Per CLAUDE.md doc-update rule:
docs/llms.txt and docs/llms-full.txt also updated.
Starter recording rules
mcp:tool_call_duration:p95_5m
mcp:tool_call_error_rate:5m grouped by tool
apigateway:inbound_duration:p50_5m
apigateway:inbound_duration:p95_5m
apigateway:inbound_duration:p99_5m
apigateway:inbound_error_rate:5m grouped by connection
apigateway:outbound_duration:p95_5m grouped by connection
Starter alert rules
- High 5xx rate per connection sustained over 5m.
- Inbound p95 latency regression vs trailing 1h baseline.
- Auth failure spike of any kind over 5m.
- DB pool wait_count growing (saturation indicator).
up{job="mcp-data-platform"} == 0 (pod down).
Depends on
- Inbound API gateway metrics and toolkit metrics sub-issues, so the rules can
reference real metric names.
Files
deployments/observability/*
docs/observability.md
docs/llms.txt
docs/llms-full.txt
Part of #459.
Problem
There is no operator-facing documentation for wiring a Prometheus scraper to
the platform. Each install reinvents annotations, rules, and alerts. No Helm,
plain Kubernetes manifests only.
Acceptance criteria
New directory
deployments/observability/containing:pod-annotations.yaml- example Deployment patch showingprometheus.io/scrape: "true",prometheus.io/port: "9090",prometheus.io/path: "/metrics".recording-rules.yaml- ConfigMap with the starter recording rules.alert-rules.yaml- ConfigMap with the starter alert rules.README.md- how to apply, where to mount, how to confirm Prometheus isactually scraping (check
up{job=...}query).Plus:
optional; the raw files must work standalone with
kubectl apply -f.level:metric_name:operationsstyle.promtool check rules.docs/observability.md.docs/llms.txtanddocs/llms-full.txtalso updated.Starter recording rules
mcp:tool_call_duration:p95_5mmcp:tool_call_error_rate:5mgrouped bytoolapigateway:inbound_duration:p50_5mapigateway:inbound_duration:p95_5mapigateway:inbound_duration:p99_5mapigateway:inbound_error_rate:5mgrouped byconnectionapigateway:outbound_duration:p95_5mgrouped byconnectionStarter alert rules
up{job="mcp-data-platform"} == 0(pod down).Depends on
reference real metric names.
Files
deployments/observability/*docs/observability.mddocs/llms.txtdocs/llms-full.txt