Skip to content

Epic: observability, alerts, and recovery runbooks #574

Description

Goal

Make failures visible, diagnosable, and recoverable through metrics, alerts, and documented operational flows.

Background

The observed failure combined unhealthy nodes, stuck pods, stale processes, and CPU consumption without functional service. These states need first-class metrics and runbooks so operators can detect and recover before manual investigation becomes the primary tool.

Outcomes

  • Metrics expose quorum, Raft role, epoch, duplicate identity, probe state, repair actions, and stuck lifecycle phases.
  • Alerts cover degraded quorum, duplicate identity, stuck terminating pods, NodeNotReady, and failed repairs.
  • Runbooks document drain, decommission, fencing, repair, restore, and post-recovery validation.

Child issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    docsDocumentation: architecture, plugin authoring, CLI refepicEpic-level tracking issueopsObservability and operationsp1Should have

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions