Skip to content

Add resilience metrics, dashboards, and alerts #588

Description

Parent epic: #574

Problem

Operators need direct signals for the states that caused the recent incident.

Acceptance

  • Metrics cover quorum, leader, Raft term, applied index, identity epoch, duplicate identity, probe state, repair action count, and shutdown phase.
  • Alerts cover quorum loss/degradation, duplicate identity, stuck terminating pod, NodeNotReady, failed repair, and long non-ready duration.
  • Dashboard shows cluster health, node distribution, repair state, and traffic-serving status.

Metadata

Metadata

Assignees

No one assigned

    Labels

    opsObservability and operationsp1Should have

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions