Parent epic: #574
Problem
Operators need direct signals for the states that caused the recent incident.
Acceptance
- Metrics cover quorum, leader, Raft term, applied index, identity epoch, duplicate identity, probe state, repair action count, and shutdown phase.
- Alerts cover quorum loss/degradation, duplicate identity, stuck terminating pod, NodeNotReady, failed repair, and long non-ready duration.
- Dashboard shows cluster health, node distribution, repair state, and traffic-serving status.
Parent epic: #574
Problem
Operators need direct signals for the states that caused the recent incident.
Acceptance