fix(observability): rebuild etcd-availability as fleet summary by swiencki · Pull Request #5241 · Azure/ARO-HCP

swiencki · 2026-05-12T19:14:06Z

What

Rebuild etcd-availability as a fleet summary and move it to observability/grafana-dashboards/sre/user-journey/.

Why

Each stat panel rendered as N unlabeled tiles in prod (one per regional HCP Prom workspace) because Mixed datasource + multi-select All fanned queries out with no reduction step.

Stat panels: [merge, reduce reduceFields <sum|max>] collapses N regional frames into one tile.
Tables: merge unions per-region rows; sort fields fixed.
Timeseries: {{cluster}} legend; p99 panels aggregate by (le, cluster).
Stale per-region thresholds and legendFormats recalibrated.
Worst Regional P99 documented honestly (not a true fleet p99; PromQL cannot federate histograms across independent workspaces).
observability.yaml retargets the User Journey folder to the new path.

Testing

make -C observability verify passes. JSON valid at schemaVersion 41. Confirmed in dev Grafana that each stat tile renders as a single number; warning triangles in dev are the documented stale-datasource hazard, not a dashboard issue.

Special notes for your reviewer

The Worst Regional P99 tiles use Reduce(max) across per-region p99s. This is the only mathematically defensible single-number representation without federation infra (Thanos/Mimir). Called out in the Dashboard Info panel.

… sre/ The merged etcd-availability dashboard rendered every stat panel as N unlabeled tiles in prod (one per regional HCP Prometheus workspace), because Mixed datasource + multi-select All fanned queries out without any reduction step. Rebuild as a true fleet summary: - Multi-select datasource var with includeAll, fans queries to every regional HCP Prom workspace matching ^Managed_Prometheus_hcps-.*$. - Stat panels: [merge, reduce reduceFields <sum|max>] transform chain collapses N regional frames into one tile. - Sums: Fleet Clusters Monitored, Fleet Total Stable Clusters (24h), Fleet Total Leader Changes (24h). - Max: Worst Regional P99 WAL Fsync / Backend Commit / Peer RTT, Fleet Worst Cluster DB Size. - Tables: merge transform unions per-region rows; sort fields fixed. - Per-cluster timeseries disambiguate via {{cluster}} legend. - Dashboard Info documents that Worst Regional P99 is NOT a true fleet p99 (PromQL cannot federate histograms across independent workspaces; a true fleet p99 requires Thanos / Mimir / federated workspace work). - Stale per-region thresholds and legend strings recalibrated for new fleet semantics. - Move under observability/grafana-dashboards/sre/user-journey/ and retarget the User Journey folder registration accordingly.

openshift-ci · 2026-05-12T19:14:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: swiencki
Once this PR has been reviewed and has the lgtm label, please assign roivaz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

observability/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

This PR refactors the etcd-availability Grafana dashboard into a fleet summary that correctly aggregates across multiple regional Managed Prometheus workspaces (via Mixed datasource + transforms), and relocates it under the SRE “User Journey” dashboards folder.

Changes:

Move/retarget the “User Journey” dashboard folder to observability/grafana-dashboards/sre/user-journey.
Rebuild etcd-availability.json to reduce multi-region queries into single-value fleet tiles (sum/max) and clarify “worst regional p99” semantics.
Update table/timeseries panels to merge multi-region data and improve legends/aggregation grouping.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
observability/observability.yaml	Retargets the “User Journey” dashboard folder to the new SRE path so Grafana publishes dashboards from the new location.
observability/grafana-dashboards/sre/user-journey/etcd-availability.json	Reworked dashboard to provide fleet-summary aggregations across regional Prometheus workspaces and updated panel queries/transforms accordingly.

Comments suppressed due to low confidence (3)

observability/grafana-dashboards/sre/user-journey/etcd-availability.json:59

This stat panel now represents a fleet-wide count of stable clusters, but the panel is still configured with decimals: 1, which will render values like 123.0. Set decimals to 0 so the value displays as an integer count.

This issue also appears in the following locations of the same file:

line 146
line 958
observability/grafana-dashboards/sre/user-journey/etcd-availability.json:150

This stat panel shows a total number of leader changes (an integer), but decimals is still set to 1. Set decimals to 0 so the UI doesn’t show fractional leader-change counts.
observability/grafana-dashboards/sre/user-journey/etcd-availability.json:962
This table is displaying leader-change counts over the last hour (from changes(...)), but the field config uses decimals: 2, which will show fractional counts (e.g., 4.00). Set decimals to 0 for an integer display.

Set decimals=0 on stat panels showing integer counts (Fleet Total Stable Clusters, Fleet Total Leader Changes, Fleet Clusters Monitored) and on the Leader Instability hourly table, so values render as whole numbers instead of e.g. 4.0 / 4.00.

swiencki · 2026-05-12T19:35:48Z

Addressed in 9092b54: set decimals=0 on the integer-count panels (Fleet Total Stable Clusters, Fleet Total Leader Changes, Fleet Clusters Monitored, Leader Instability hourly table). Counts now render as whole numbers.

Copilot AI review requested due to automatic review settings May 12, 2026 19:14

openshift-ci Bot requested review from avollmer-redhat and hbhushan3 May 12, 2026 19:14

Copilot started reviewing on behalf of swiencki May 12, 2026 19:15 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(observability): rebuild etcd-availability as fleet summary#5241

fix(observability): rebuild etcd-availability as fleet summary#5241
swiencki wants to merge 2 commits into
mainfrom
swiencki/etcd-fleet-summary

swiencki commented May 12, 2026

Uh oh!

openshift-ci Bot commented May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

swiencki commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

swiencki commented May 12, 2026

What

Why

Testing

Special notes for your reviewer

Uh oh!

openshift-ci Bot commented May 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

swiencki commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants