fix(observability): rebuild etcd-availability as fleet summary#5241
fix(observability): rebuild etcd-availability as fleet summary#5241swiencki wants to merge 2 commits into
Conversation
… sre/
The merged etcd-availability dashboard rendered every stat panel as N
unlabeled tiles in prod (one per regional HCP Prometheus workspace),
because Mixed datasource + multi-select All fanned queries out without
any reduction step.
Rebuild as a true fleet summary:
- Multi-select datasource var with includeAll, fans queries to every
regional HCP Prom workspace matching ^Managed_Prometheus_hcps-.*$.
- Stat panels: [merge, reduce reduceFields <sum|max>] transform chain
collapses N regional frames into one tile.
- Sums: Fleet Clusters Monitored, Fleet Total Stable Clusters (24h),
Fleet Total Leader Changes (24h).
- Max: Worst Regional P99 WAL Fsync / Backend Commit / Peer RTT,
Fleet Worst Cluster DB Size.
- Tables: merge transform unions per-region rows; sort fields fixed.
- Per-cluster timeseries disambiguate via {{cluster}} legend.
- Dashboard Info documents that Worst Regional P99 is NOT a true fleet
p99 (PromQL cannot federate histograms across independent workspaces;
a true fleet p99 requires Thanos / Mimir / federated workspace work).
- Stale per-region thresholds and legend strings recalibrated for new
fleet semantics.
- Move under observability/grafana-dashboards/sre/user-journey/ and
retarget the User Journey folder registration accordingly.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: swiencki The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR refactors the etcd-availability Grafana dashboard into a fleet summary that correctly aggregates across multiple regional Managed Prometheus workspaces (via Mixed datasource + transforms), and relocates it under the SRE “User Journey” dashboards folder.
Changes:
- Move/retarget the “User Journey” dashboard folder to
observability/grafana-dashboards/sre/user-journey. - Rebuild
etcd-availability.jsonto reduce multi-region queries into single-value fleet tiles (sum/max) and clarify “worst regional p99” semantics. - Update table/timeseries panels to merge multi-region data and improve legends/aggregation grouping.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| observability/observability.yaml | Retargets the “User Journey” dashboard folder to the new SRE path so Grafana publishes dashboards from the new location. |
| observability/grafana-dashboards/sre/user-journey/etcd-availability.json | Reworked dashboard to provide fleet-summary aggregations across regional Prometheus workspaces and updated panel queries/transforms accordingly. |
Comments suppressed due to low confidence (3)
observability/grafana-dashboards/sre/user-journey/etcd-availability.json:59
- This stat panel now represents a fleet-wide count of stable clusters, but the panel is still configured with
decimals: 1, which will render values like123.0. Set decimals to 0 so the value displays as an integer count.
This issue also appears in the following locations of the same file:
- line 146
- line 958
observability/grafana-dashboards/sre/user-journey/etcd-availability.json:150
- This stat panel shows a total number of leader changes (an integer), but
decimalsis still set to 1. Set decimals to 0 so the UI doesn’t show fractional leader-change counts.
observability/grafana-dashboards/sre/user-journey/etcd-availability.json:962 - This table is displaying leader-change counts over the last hour (from
changes(...)), but the field config usesdecimals: 2, which will show fractional counts (e.g.,4.00). Set decimals to 0 for an integer display.
Set decimals=0 on stat panels showing integer counts (Fleet Total Stable Clusters, Fleet Total Leader Changes, Fleet Clusters Monitored) and on the Leader Instability hourly table, so values render as whole numbers instead of e.g. 4.0 / 4.00.
|
Addressed in 9092b54: set |
What
Rebuild
etcd-availabilityas a fleet summary and move it toobservability/grafana-dashboards/sre/user-journey/.Why
Each stat panel rendered as N unlabeled tiles in prod (one per regional HCP Prom workspace) because Mixed datasource + multi-select All fanned queries out with no reduction step.
[merge, reduce reduceFields <sum|max>]collapses N regional frames into one tile.mergeunions per-region rows; sort fields fixed.{{cluster}}legend; p99 panels aggregateby (le, cluster).observability.yamlretargets the User Journey folder to the new path.Testing
make -C observability verifypasses. JSON valid at schemaVersion 41. Confirmed in dev Grafana that each stat tile renders as a single number; warning triangles in dev are the documented stale-datasource hazard, not a dashboard issue.Special notes for your reviewer
The Worst Regional P99 tiles use Reduce(max) across per-region p99s. This is the only mathematically defensible single-number representation without federation infra (Thanos/Mimir). Called out in the Dashboard Info panel.