Epic: observability, alerts, and recovery runbooks

## Goal
Make failures visible, diagnosable, and recoverable through metrics, alerts, and documented operational flows.

## Background
The observed failure combined unhealthy nodes, stuck pods, stale processes, and CPU consumption without functional service. These states need first-class metrics and runbooks so operators can detect and recover before manual investigation becomes the primary tool.

## Outcomes
- Metrics expose quorum, Raft role, epoch, duplicate identity, probe state, repair actions, and stuck lifecycle phases.
- Alerts cover degraded quorum, duplicate identity, stuck terminating pods, NodeNotReady, and failed repairs.
- Runbooks document drain, decommission, fencing, repair, restore, and post-recovery validation.

## Child issues
- #588 Add resilience metrics, dashboards, and alerts
- #589 Write recovery runbooks for drain, fencing, repair, and restore


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Epic: observability, alerts, and recovery runbooks #574

Goal

Background

Outcomes

Child issues

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Epic: observability, alerts, and recovery runbooks #574

Description

Goal

Background

Outcomes

Child issues

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions