Skip to content

fix(observability): rebuild etcd-availability as fleet summary#5241

Open
swiencki wants to merge 2 commits into
mainfrom
swiencki/etcd-fleet-summary
Open

fix(observability): rebuild etcd-availability as fleet summary#5241
swiencki wants to merge 2 commits into
mainfrom
swiencki/etcd-fleet-summary

Conversation

@swiencki
Copy link
Copy Markdown
Collaborator

What

Rebuild etcd-availability as a fleet summary and move it to observability/grafana-dashboards/sre/user-journey/.

Why

Each stat panel rendered as N unlabeled tiles in prod (one per regional HCP Prom workspace) because Mixed datasource + multi-select All fanned queries out with no reduction step.

  • Stat panels: [merge, reduce reduceFields <sum|max>] collapses N regional frames into one tile.
  • Tables: merge unions per-region rows; sort fields fixed.
  • Timeseries: {{cluster}} legend; p99 panels aggregate by (le, cluster).
  • Stale per-region thresholds and legendFormats recalibrated.
  • Worst Regional P99 documented honestly (not a true fleet p99; PromQL cannot federate histograms across independent workspaces).
  • observability.yaml retargets the User Journey folder to the new path.

Testing

make -C observability verify passes. JSON valid at schemaVersion 41. Confirmed in dev Grafana that each stat tile renders as a single number; warning triangles in dev are the documented stale-datasource hazard, not a dashboard issue.

Special notes for your reviewer

The Worst Regional P99 tiles use Reduce(max) across per-region p99s. This is the only mathematically defensible single-number representation without federation infra (Thanos/Mimir). Called out in the Dashboard Info panel.

… sre/

The merged etcd-availability dashboard rendered every stat panel as N
unlabeled tiles in prod (one per regional HCP Prometheus workspace),
because Mixed datasource + multi-select All fanned queries out without
any reduction step.

Rebuild as a true fleet summary:

- Multi-select datasource var with includeAll, fans queries to every
  regional HCP Prom workspace matching ^Managed_Prometheus_hcps-.*$.
- Stat panels: [merge, reduce reduceFields <sum|max>] transform chain
  collapses N regional frames into one tile.
  - Sums: Fleet Clusters Monitored, Fleet Total Stable Clusters (24h),
    Fleet Total Leader Changes (24h).
  - Max: Worst Regional P99 WAL Fsync / Backend Commit / Peer RTT,
    Fleet Worst Cluster DB Size.
- Tables: merge transform unions per-region rows; sort fields fixed.
- Per-cluster timeseries disambiguate via {{cluster}} legend.
- Dashboard Info documents that Worst Regional P99 is NOT a true fleet
  p99 (PromQL cannot federate histograms across independent workspaces;
  a true fleet p99 requires Thanos / Mimir / federated workspace work).
- Stale per-region thresholds and legend strings recalibrated for new
  fleet semantics.
- Move under observability/grafana-dashboards/sre/user-journey/ and
  retarget the User Journey folder registration accordingly.
Copilot AI review requested due to automatic review settings May 12, 2026 19:14
@openshift-ci
Copy link
Copy Markdown

openshift-ci Bot commented May 12, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: swiencki
Once this PR has been reviewed and has the lgtm label, please assign roivaz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the etcd-availability Grafana dashboard into a fleet summary that correctly aggregates across multiple regional Managed Prometheus workspaces (via Mixed datasource + transforms), and relocates it under the SRE “User Journey” dashboards folder.

Changes:

  • Move/retarget the “User Journey” dashboard folder to observability/grafana-dashboards/sre/user-journey.
  • Rebuild etcd-availability.json to reduce multi-region queries into single-value fleet tiles (sum/max) and clarify “worst regional p99” semantics.
  • Update table/timeseries panels to merge multi-region data and improve legends/aggregation grouping.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
observability/observability.yaml Retargets the “User Journey” dashboard folder to the new SRE path so Grafana publishes dashboards from the new location.
observability/grafana-dashboards/sre/user-journey/etcd-availability.json Reworked dashboard to provide fleet-summary aggregations across regional Prometheus workspaces and updated panel queries/transforms accordingly.
Comments suppressed due to low confidence (3)

observability/grafana-dashboards/sre/user-journey/etcd-availability.json:59

  • This stat panel now represents a fleet-wide count of stable clusters, but the panel is still configured with decimals: 1, which will render values like 123.0. Set decimals to 0 so the value displays as an integer count.

This issue also appears in the following locations of the same file:

  • line 146
  • line 958
    observability/grafana-dashboards/sre/user-journey/etcd-availability.json:150
  • This stat panel shows a total number of leader changes (an integer), but decimals is still set to 1. Set decimals to 0 so the UI doesn’t show fractional leader-change counts.
    observability/grafana-dashboards/sre/user-journey/etcd-availability.json:962
  • This table is displaying leader-change counts over the last hour (from changes(...)), but the field config uses decimals: 2, which will show fractional counts (e.g., 4.00). Set decimals to 0 for an integer display.

Set decimals=0 on stat panels showing integer counts (Fleet Total
Stable Clusters, Fleet Total Leader Changes, Fleet Clusters Monitored)
and on the Leader Instability hourly table, so values render as whole
numbers instead of e.g. 4.0 / 4.00.
@swiencki
Copy link
Copy Markdown
Collaborator Author

Addressed in 9092b54: set decimals=0 on the integer-count panels (Fleet Total Stable Clusters, Fleet Total Leader Changes, Fleet Clusters Monitored, Leader Instability hourly table). Counts now render as whole numbers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants