ROSAENG-14171: add probe history and LB security group checks to console-ErrorBudgetBurn investigation by MateSaary · Pull Request #849 · openshift/configuration-anomaly-detection

MateSaary · 2026-06-19T10:44:02Z

What type of PR is this?

feature

What this PR does / Why we need it?

Adds two new informational checks to the consoleerrorbudgetburn investigation, automating manual SOP troubleshooting steps (sections 3.3 and 3.6-3.7 of console-ErrorBudgetBurn.md):
Probe History (classic only):

Queries in-cluster Prometheus for 2 hours of historical probe_success range data via exec into the cluster-monitoring-operator pod
Classifies the failure pattern as persistent, intermittent, or already recovered
Uses regex match on the probe_url label (probe_url=~"..") to handle varying label formats across RMO versions (the label may include scheme and path, e.g. https://console-openshift-console.apps.cluster.example.com/health)
Skipped on HCP clusters where probe_success is forwarded to RHOBS and not available in in-cluster Prometheus
Load Balancer Security Group checks (AWS Classic clusters only):
CLB path: retrieves CLB security groups and verifies TCP 443 inbound is allowed from 0.0.0.0/0 or machine CIDR
NLB path: warns if the NLB has security groups attached (unusual per SOP), retrieves target instance security groups, and verifies the NodePort range (30000-32767) is allowed inbound
Extends the AWS Client interface with GetCLBSecurityGroupIDs, GetSecurityGroupRules, GetInstanceSecurityGroupIDs
Extends FindNLBByDNSName to also return NLB security groups
Both checks are informational-only — no automated actions (no service logs, no silencing).

Special notes for your reviewer

Test Coverage

Guidelines for CAD investigations

New investgations should be accompanied by unit tests and/or step-by-step manual tests in the investigation README.
Actioning investigations should be locally tested in staging, and E2E testing is desired. See README for more info on investigation graduation process.

Test coverage checks

Added tests
Created jira card to add unit test
This PR may not need unit tests

Pre-checks (if applicable)

Ran unit tests locally
Validated the changes in a cluster
Included documentation changes with PR

Summary by CodeRabbit

New Features
- Enhanced console error budget burn investigation with AWS security group validation for load balancers and instances.
- Added Prometheus-based probe history analysis to identify persistent vs. intermittent failures in classic clusters.
Tests
- Added comprehensive test coverage for security group rule validation and probe history checking.

openshift-ci-robot · 2026-06-19T10:44:07Z

@MateSaary: This pull request references ROSAENG-14171 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What type of PR is this?

feature

What this PR does / Why we need it?

Adds two new informational checks to the consoleerrorbudgetburn investigation, automating manual SOP troubleshooting steps (sections 3.3 and 3.6-3.7 of console-ErrorBudgetBurn.md):
Probe History (classic only):

Queries in-cluster Prometheus for 2 hours of historical probe_success range data via exec into the cluster-monitoring-operator pod

Classifies the failure pattern as persistent, intermittent, or already recovered

Uses regex match on the probe_url label (probe_url=~"..") to handle varying label formats across RMO versions (the label may include scheme and path, e.g. https://console-openshift-console.apps.cluster.example.com/health)

Skipped on HCP clusters where probe_success is forwarded to RHOBS and not available in in-cluster Prometheus
Load Balancer Security Group checks (AWS Classic clusters only):

CLB path: retrieves CLB security groups and verifies TCP 443 inbound is allowed from 0.0.0.0/0 or machine CIDR

NLB path: warns if the NLB has security groups attached (unusual per SOP), retrieves target instance security groups, and verifies the NodePort range (30000-32767) is allowed inbound

Extends the AWS Client interface with GetCLBSecurityGroupIDs, GetSecurityGroupRules, GetInstanceSecurityGroupIDs

Extends FindNLBByDNSName to also return NLB security groups
Both checks are informational-only — no automated actions (no service logs, no silencing).

Special notes for your reviewer

Test Coverage

Guidelines for CAD investigations

New investgations should be accompanied by unit tests and/or step-by-step manual tests in the investigation README.

Actioning investigations should be locally tested in staging, and E2E testing is desired. See README for more info on investigation graduation process.

Test coverage checks

Added tests

Created jira card to add unit test

This PR may not need unit tests

Pre-checks (if applicable)

Ran unit tests locally

Validated the changes in a cluster

Included documentation changes with PR

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-06-19T10:44:17Z

Walkthrough

The PR extends the AWS Client interface and SdkClient with security group retrieval methods (GetCLBSecurityGroupIDs, GetSecurityGroupRules, GetInstanceSecurityGroupIDs) and updates FindNLBByDNSName to return attached security group IDs. These are wired into the console error budget burn investigation to evaluate whether NLB instance NodePort TCP traffic and CLB TCP 443 inbound are permitted. A pluggable probeHistoryChecker is added to query Prometheus probe history on classic clusters. Mocks, tests, and an unrelated OCM mock import alias fix are included.

Changes

Security Group Validation and Probe History Investigation

Layer / File(s)	Summary
AWS Client interface and SdkClient security group methods `pkg/aws/aws.go`	`FindNLBByDNSName` gains a `securityGroups []string` return value. Three new methods are added to `Client` and implemented on `SdkClient`: `GetCLBSecurityGroupIDs` (ELBv1), `GetSecurityGroupRules` (EC2 DescribeSecurityGroups), and `GetInstanceSecurityGroupIDs` (EC2 DescribeInstances → instance-id map).
AWS MockClient updates `pkg/aws/mock/aws.go`	`MockClient.FindNLBByDNSName` updated to return the new `[]string` third value. Mock methods and recorder stubs added for `GetCLBSecurityGroupIDs`, `GetInstanceSecurityGroupIDs`, and `GetSecurityGroupRules`.
`sgAllowsTCPInbound` helper and `probeHistoryChecker` abstraction `pkg/investigations/consoleerrorbudgetburn/consoleerrorbudgetburn.go`	Introduces the `probeHistoryChecker` interface and `probeHistoryCheck` field on `Investigation`. Adds `sgAllowsTCPInbound` to evaluate inbound TCP rules with protocol `-1` and CIDR containment support.
NLB and CLB health check security group integration `pkg/investigations/consoleerrorbudgetburn/consoleerrorbudgetburn.go`	`checkNLBHealth` fetches instance security groups and evaluates NodePort TCP (30000–32767) access. `checkCLBHealth` fetches CLB security groups and evaluates TCP 443 inbound from `0.0.0.0/0` or the machine CIDR.
Classic-cluster Prometheus probe history querying `pkg/investigations/consoleerrorbudgetburn/consoleerrorbudgetburn.go`	After the blackbox probe check on classic clusters, `queryProbeHistory` constructs a Prometheus range query URL, unmarshals the response, and classifies the last ~2 hours of `probe_success` samples as all-success, intermittent, or persistent failure.
Unit and integration tests `pkg/investigations/consoleerrorbudgetburn/consoleerrorbudgetburn_test.go`	Adds `mockProbeHistoryChecker`, Prometheus fixtures, and six `checkProbeHistory` unit tests. Extends existing NLB/CLB health tests with security group mock expectations. Adds a large suite of CLB/NLB security group and `sgAllowsTCPInbound` unit tests. Updates Run-level tests with `noopProbeHistoryChecker` wiring and new security group expectations.

OCM Mock Import Alias Fix

Layer / File(s)	Summary
OCM mock `GetConnection` import alias fix `pkg/ocm/mock/ocmmock.go`	Renames the `ocm-sdk-go` import alias from `ocm_sdk_go` to `sdk`; updates `GetConnection` return type and type assertion to `*sdk.Connection`.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 29.79% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (14 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main changes: adding probe history and load balancer security group checks to the console-ErrorBudgetBurn investigation.
Description check	✅ Passed	The description comprehensively addresses all template sections with detailed explanations of the feature, test coverage confirmation, and pre-checks verification.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	Test suite uses Go testing framework, not Ginkgo. All 113 test names are static and deterministic with no timestamps, UUIDs, pod/node names, or IP addresses.
Test Structure And Quality	✅ Passed	All test code quality requirements are met: tests have single responsibility (narrowly focused on specific behaviors), setup/cleanup is proper (31 gomock controllers with matching defers), no actua...
Microshift Test Compatibility	✅ Passed	No Ginkgo e2e tests added. PR adds 20 standard Go unit tests using testing.T framework, which are outside the scope of this MicroShift compatibility check designed for Ginkgo e2e tests.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	PR adds only standard Go unit tests (func Test*()), not Ginkgo e2e tests. SNO compatibility check applies only to Ginkgo e2e tests, so check is not applicable.
Topology-Aware Scheduling Compatibility	✅ Passed	This PR modifies investigation/diagnostic code and AWS SDK wrappers, not deployment manifests, operators, or controllers. No scheduling constraints, affinity rules, or topology-related changes are...
Ote Binary Stdout Contract	✅ Passed	PR contains no process-level code (main, init, TestMain, BeforeSuite) that writes to stdout. All changes are normal library methods, mocks, and test code with writes properly isolated within test b...
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No new Ginkgo e2e tests were added in this PR. Only standard Go unit tests were added to consoleerrorbudgetburn_test.go, which are not subject to this IPv6/disconnected network compatibility check.
No-Weak-Crypto	✅ Passed	No weak crypto (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom implementations, or insecure comparisons found. All code is AWS SDK wrappers and standard networking utilities.
Container-Privileges	✅ Passed	PR modifies only Go source files (AWS clients, investigation logic, tests); no Kubernetes/container manifests with privilege settings are present.
No-Sensitive-Data-In-Logs	✅ Passed	Code review found no sensitive data (passwords, tokens, API keys, PII, session IDs) in logs. Bearer tokens use safe shell expansion, not hardcoded values; hostnames and security group IDs are infra...

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-06-19T10:44:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MateSaary

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [MateSaary]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

🧹 Nitpick comments (1)

pkg/investigations/consoleerrorbudgetburn/consoleerrorbudgetburn.go (1)
256-301: ⚡ Quick win

Consider handling net.ParseCIDR error to avoid silent failures.

Line 262 silently discards the error from net.ParseCIDR. If machineCIDR is non-empty but malformed, machineNet will be nil and CIDR-based matching will silently fail—only 0.0.0.0/0 rules will match, potentially producing misleading warnings.

Additionally, the function only inspects IpRanges (IPv4). Security group rules can also specify Ipv6Ranges, PrefixListIds, or UserIdGroupPairs, which would be missed. This is acceptable for an informational check but worth documenting.
Proposed fix for CIDR error handling
 func sgAllowsTCPInbound(securityGroups []ec2v2types.SecurityGroup, fromPort, toPort int32, machineCIDR string) bool {
 	var machineIP net.IP
 	var machineNet *net.IPNet
 	if machineCIDR != "" {
-		machineIP, machineNet, _ = net.ParseCIDR(machineCIDR)
+		var err error
+		machineIP, machineNet, err = net.ParseCIDR(machineCIDR)
+		if err != nil {
+			// Log or handle; for now treat as empty to fall back to 0.0.0.0/0 matching only
+			machineNet = nil
+		}
 	}
Based on learnings from the coding guidelines: "Never ignore error returns". While this specific failure mode is unlikely (cluster config should provide valid CIDRs), handling the error explicitly makes the fallback behavior intentional rather than accidental.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/investigations/consoleerrorbudgetburn/consoleerrorbudgetburn.go` around
lines 256 - 301, In the sgAllowsTCPInbound function, the error returned from
net.ParseCIDR is being silently discarded with underscore on the line where
machineCIDR is parsed. If machineCIDR is provided but malformed, machineNet will
be nil and the CIDR-based matching logic will silently fail, producing incorrect
results. Capture and explicitly handle the error from net.ParseCIDR—either by
logging it, returning an error from the function, or making the fallback
behavior intentional in a comment, so that callers understand when CIDR
validation fails and the function falls back to only matching against 0.0.0.0/0.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/investigations/consoleerrorbudgetburn/consoleerrorbudgetburn.go`:
- Around line 256-301: In the sgAllowsTCPInbound function, the error returned
from net.ParseCIDR is being silently discarded with underscore on the line where
machineCIDR is parsed. If machineCIDR is provided but malformed, machineNet will
be nil and the CIDR-based matching logic will silently fail, producing incorrect
results. Capture and explicitly handle the error from net.ParseCIDR—either by
logging it, returning an error from the function, or making the fallback
behavior intentional in a comment, so that callers understand when CIDR
validation fails and the function falls back to only matching against 0.0.0.0/0.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7fd85dbd-2075-481a-800c-59ea29d6e44c

📥 Commits

Reviewing files that changed from the base of the PR and between 0bcd8eb and 32f388f.

📒 Files selected for processing (5)

pkg/aws/aws.go
pkg/aws/mock/aws.go
pkg/investigations/consoleerrorbudgetburn/consoleerrorbudgetburn.go
pkg/investigations/consoleerrorbudgetburn/consoleerrorbudgetburn_test.go
pkg/ocm/mock/ocmmock.go

openshift-ci · 2026-06-19T10:55:31Z

@MateSaary: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

codecov-commenter · 2026-06-19T10:56:01Z

Codecov Report

❌ Patch coverage is 60.67961% with 81 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.52%. Comparing base (0bcd8eb) to head (32f388f).

Files with missing lines	Patch %	Lines
...s/consoleerrorbudgetburn/consoleerrorbudgetburn.go	68.11%	39 Missing and 5 partials ⚠️
pkg/aws/aws.go	0.00%	35 Missing ⚠️
pkg/ocm/mock/ocmmock.go	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #849      +/-   ##
==========================================
+ Coverage   43.09%   43.52%   +0.43%     
==========================================
  Files          71       71              
  Lines        8254     8450     +196     
==========================================
+ Hits         3557     3678     +121     
- Misses       4484     4554      +70     
- Partials      213      218       +5

Files with missing lines	Coverage Δ
pkg/aws/mock/aws.go	`42.72% <100.00%> (+4.10%)`	⬆️
pkg/ocm/mock/ocmmock.go	`34.24% <0.00%> (ø)`
pkg/aws/aws.go	`3.87% <0.00%> (-0.27%)`	⬇️
...s/consoleerrorbudgetburn/consoleerrorbudgetburn.go	`70.14% <68.11%> (-0.49%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

consoleerrorbudgetburn: add probe history and LB security group checks

32f388f

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 19, 2026

openshift-ci Bot requested review from RaphaelBut and typeid June 19, 2026 10:44

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 19, 2026

coderabbitai Bot reviewed Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROSAENG-14171: add probe history and LB security group checks to console-ErrorBudgetBurn investigation#849

ROSAENG-14171: add probe history and LB security group checks to console-ErrorBudgetBurn investigation#849
MateSaary wants to merge 1 commit into
openshift:mainfrom
MateSaary:srep-4517-followup

MateSaary commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

openshift-ci-robot commented Jun 19, 2026 •

edited by atlassian Bot

Loading

What type of PR is this?

What this PR does / Why we need it?

Special notes for your reviewer

Test Coverage

Guidelines for CAD investigations

Test coverage checks

Pre-checks (if applicable)

Uh oh!

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented Jun 19, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

openshift-ci Bot commented Jun 19, 2026

Uh oh!

codecov-commenter commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MateSaary commented Jun 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / Why we need it?

Special notes for your reviewer

Test Coverage

Guidelines for CAD investigations

Test coverage checks

Pre-checks (if applicable)

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Jun 19, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / Why we need it?

Special notes for your reviewer

Test Coverage

Guidelines for CAD investigations

Test coverage checks

Pre-checks (if applicable)

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented Jun 19, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Jun 19, 2026

Uh oh!

codecov-commenter commented Jun 19, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MateSaary commented Jun 19, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented Jun 19, 2026 •

edited by atlassian Bot

Loading

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading