Skip to content

docs(deploy): example Kubernetes manifests for Prometheus scrape, recording rules, alert rules #463

@cjimti

Description

@cjimti

Part of #459.

Problem

There is no operator-facing documentation for wiring a Prometheus scraper to
the platform. Each install reinvents annotations, rules, and alerts. No Helm,
plain Kubernetes manifests only.

Acceptance criteria

New directory deployments/observability/ containing:

  • pod-annotations.yaml - example Deployment patch showing
    prometheus.io/scrape: "true", prometheus.io/port: "9090",
    prometheus.io/path: "/metrics".
  • recording-rules.yaml - ConfigMap with the starter recording rules.
  • alert-rules.yaml - ConfigMap with the starter alert rules.
  • README.md - how to apply, where to mount, how to confirm Prometheus is
    actually scraping (check up{job=...} query).

Plus:

  • Plain Kubernetes resources only. NO Helm chart. NO HelmRelease. Kustomize is
    optional; the raw files must work standalone with kubectl apply -f.
  • Recording rule naming follows Prometheus best practice
    level:metric_name:operations style.
  • All rules pass promtool check rules.
  • Underlying metrics documented in docs/observability.md.
  • Per CLAUDE.md doc-update rule: docs/llms.txt and docs/llms-full.txt also updated.

Starter recording rules

  • mcp:tool_call_duration:p95_5m
  • mcp:tool_call_error_rate:5m grouped by tool
  • apigateway:inbound_duration:p50_5m
  • apigateway:inbound_duration:p95_5m
  • apigateway:inbound_duration:p99_5m
  • apigateway:inbound_error_rate:5m grouped by connection
  • apigateway:outbound_duration:p95_5m grouped by connection

Starter alert rules

  • High 5xx rate per connection sustained over 5m.
  • Inbound p95 latency regression vs trailing 1h baseline.
  • Auth failure spike of any kind over 5m.
  • DB pool wait_count growing (saturation indicator).
  • up{job="mcp-data-platform"} == 0 (pod down).

Depends on

  • Inbound API gateway metrics and toolkit metrics sub-issues, so the rules can
    reference real metric names.

Files

  • deployments/observability/*
  • docs/observability.md
  • docs/llms.txt
  • docs/llms-full.txt

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions