Skip to content

ROSAENG-15456: add per-investigation-type 'triage acceleration' Grafa…#850

Open
rolandmkunkel wants to merge 1 commit into
openshift:mainfrom
rolandmkunkel:ROSAENG-15456-add-dashboard-for-new-investigations
Open

ROSAENG-15456: add per-investigation-type 'triage acceleration' Grafa…#850
rolandmkunkel wants to merge 1 commit into
openshift:mainfrom
rolandmkunkel:ROSAENG-15456-add-dashboard-for-new-investigations

Conversation

@rolandmkunkel

@rolandmkunkel rolandmkunkel commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

…na dashboard panel

What type of PR is this?

feature

What this PR does / Why we need it?

Adds a new Grafana dashboard panel titled "Estimated SRE hours saved (triage acceleration)" to the existing CAD dashboard. The existing "Estimated SRE hours saved" panel only counts prevented alerts (where CAD took a terminal action like setting limited support or sending a service log) using a flat
15-minute heuristic. The three new informational investigations from SREP-4516 accelerate triage but don't prevent pages — their value isn't captured by the existing panel.

The new panel computes estimated hours saved using per-investigation manual-time estimates:

  • consoleerrorbudgetburn — 30 min/alert (SREP-4517) -> actually pending, just using 30 min as a placeholder
  • expiredcertificates — 20 min/alert (SREP-4521)
  • pdbblockingnodedrain — 20 min/alert (SREP-4522)

Each term uses OR on() vector(0) to gracefully handle investigation types that have no metric data yet (e.g., pdbblockingnodedrain is not yet merged).

Resolves https://redhat.atlassian.net/browse/ROSAENG-15456

Special notes for your reviewer

Test Coverage

Guidelines for CAD investigations

  • New investgations should be accompanied by unit tests and/or step-by-step manual tests in the investigation README.
  • Actioning investigations should be locally tested in staging, and E2E testing is desired. See README for more info on investigation graduation process.

Test coverage checks

  • Added tests
  • Created jira card to add unit test
  • This PR may not need unit tests

Pre-checks (if applicable)

  • Ran unit tests locally
  • Validated the changes in a cluster
  • Included documentation changes with PR

Summary by CodeRabbit

  • New Features

    • Added a new "Estimated SRE hours saved (triage acceleration)" stat panel to the anomaly detection dashboard to measure efficiency gains from alert investigation improvements.
  • UI Updates

    • Reorganized existing dashboard panels to accommodate the new visualization.

@openshift-ci-robot

openshift-ci-robot commented Jun 19, 2026

Copy link
Copy Markdown

@rolandmkunkel: This pull request references ROSAENG-15456 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

…na dashboard panel

What type of PR is this?

feature

What this PR does / Why we need it?

Adds a new Grafana dashboard panel titled "Estimated SRE hours saved (triage acceleration)" to the existing CAD dashboard. The existing "Estimated SRE hours saved" panel only counts prevented alerts (where CAD took a terminal action like setting limited support or sending a service log) using a flat
15-minute heuristic. The three new informational investigations from SREP-4516 accelerate triage but don't prevent pages — their value isn't captured by the existing panel.

The new panel computes estimated hours saved using per-investigation manual-time estimates:

  • consoleerrorbudgetburn — 30 min/alert (SREP-4517) -> actually pending, just using 30 min as a placeholder
  • expiredcertificates — 20 min/alert (SREP-4521)
  • pdbblockingnodedrain — 20 min/alert (SREP-4522)

Each term uses OR on() vector(0) to gracefully handle investigation types that have no metric data yet (e.g., pdbblockingnodedrain is not yet merged).

Resolves https://redhat.atlassian.net/browse/ROSAENG-15456

Special notes for your reviewer

Test Coverage

Guidelines for CAD investigations

  • New investgations should be accompanied by unit tests and/or step-by-step manual tests in the investigation README.
  • Actioning investigations should be locally tested in staging, and E2E testing is desired. See README for more info on investigation graduation process.

Test coverage checks

  • Added tests
  • Created jira card to add unit test
  • This PR may not need unit tests

Pre-checks (if applicable)

  • Ran unit tests locally
  • Validated the changes in a cluster
  • Included documentation changes with PR

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 19, 2026
@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 500edfac-bd33-4bb5-a6ab-3a3cd85ff123

📥 Commits

Reviewing files that changed from the base of the PR and between 8b04176 and 18479dc.

📒 Files selected for processing (1)
  • dashboards/grafana-dashboard-configuration-anomaly-detection.configmap.yaml

Walkthrough

A new Grafana stat panel ("Estimated SRE hours saved (triage acceleration)") is added to the dashboard ConfigMap. It queries cad_investigate_alerts_total weighted by alert_type and converts minutes to hours. Five existing panels have their gridPos.y values increased to accommodate the insertion.

Changes

Grafana Dashboard Panel Addition

Layer / File(s) Summary
New stat panel definition
dashboards/grafana-dashboard-configuration-anomaly-detection.configmap.yaml
Adds panel id: 23 with a PromQL expression that sums weighted cad_investigate_alerts_total increases for consoleerrorbudgetburn (30 min), expiredcertificates (20 min), and pdbblockingnodedrain (20 min), divides by 60, and applies absolute threshold coloring (green default, red at 80).
Existing panel layout shifts
dashboards/grafana-dashboard-configuration-anomaly-detection.configmap.yaml
Increments gridPos.y for five panels (ids 12, 18, 13, 17, 21) — moving them from rows 17/25/33 to 25/33/41 — to make room for the inserted panel.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Microshift Test Compatibility ⚠️ Warning The PR adds new Ginkgo e2e test that triggers investigations using unsupported MicroShift APIs: Machine (machine.openshift.io), etcd pods/openshift-etcd namespace, and HyperShift (hypershift.opensh... Add [Skipped:MicroShift] label to test name, or add runtime platform check with exutil.IsMicroShiftCluster() and g.Skip() to guard against unsupported APIs, or guard investigations with per-investigation API group checks.
✅ Passed checks (14 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly references the specific Jira ticket (ROSAENG-15456) and accurately describes the main change: adding a per-investigation-type 'triage acceleration' Grafana dashboard panel.
Description check ✅ Passed The description covers all required template sections including PR type (feature), comprehensive explanation of changes with references to related tickets, and test coverage checklist acknowledgment, though no tests were added.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All 12 Ginkgo test files in the PR use stable, static test names with no dynamic content (fmt.Sprintf, string concatenation, variables, timestamps, etc.).
Test Structure And Quality ✅ Passed PR only changes Grafana dashboard YAML; no Ginkgo test files were added or modified, so the test-structure checklist is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR adds no Ginkgo e2e tests; it only modifies a Grafana dashboard YAML configuration file. The SNO compatibility check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed This PR modifies only a Grafana dashboard ConfigMap (grafana-dashboard-configuration-anomaly-detection.configmap.yaml) to add a new stat panel and adjust existing panel positions. It does not intro...
Ote Binary Stdout Contract ✅ Passed PR only modifies Grafana dashboard configuration YAML file with no Go code, test code, or process-level stdout writes; OTE stdout contract check does not apply.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR adds only a Grafana dashboard ConfigMap YAML file and does not add any Ginkgo e2e tests. The custom check is not applicable.
No-Weak-Crypto ✅ Passed PR only modifies a Grafana dashboard ConfigMap YAML file with dashboard panel configurations. No cryptographic code, weak crypto algorithms (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto...
Container-Privileges ✅ Passed The modified file is a Kubernetes ConfigMap containing Grafana dashboard JSON configuration, not a container manifest. It lacks container specs, security contexts, and privilege-related fields chec...
No-Sensitive-Data-In-Logs ✅ Passed New Grafana panel displays only aggregated alert counts and calculated SRE hours, with no passwords, tokens, PII, customer data, or sensitive metric labels exposed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from Makdaam and joshbranham June 19, 2026 13:58
@openshift-ci

openshift-ci Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rolandmkunkel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 19, 2026
@openshift-ci

openshift-ci Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

@rolandmkunkel: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.14%. Comparing base (0bcd8eb) to head (18479dc).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #850      +/-   ##
==========================================
+ Coverage   43.09%   43.14%   +0.04%     
==========================================
  Files          71       71              
  Lines        8254     8270      +16     
==========================================
+ Hits         3557     3568      +11     
- Misses       4484     4489       +5     
  Partials      213      213              

see 5 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants