Skip to content

ROSAENG-14171: add probe history and LB security group checks to console-ErrorBudgetBurn investigation#849

Open
MateSaary wants to merge 1 commit into
openshift:mainfrom
MateSaary:srep-4517-followup
Open

ROSAENG-14171: add probe history and LB security group checks to console-ErrorBudgetBurn investigation#849
MateSaary wants to merge 1 commit into
openshift:mainfrom
MateSaary:srep-4517-followup

Conversation

@MateSaary

@MateSaary MateSaary commented Jun 19, 2026

Copy link
Copy Markdown
Member

What type of PR is this?

feature

What this PR does / Why we need it?

Adds two new informational checks to the consoleerrorbudgetburn investigation, automating manual SOP troubleshooting steps (sections 3.3 and 3.6-3.7 of console-ErrorBudgetBurn.md):
Probe History (classic only):

  • Queries in-cluster Prometheus for 2 hours of historical probe_success range data via exec into the cluster-monitoring-operator pod
  • Classifies the failure pattern as persistent, intermittent, or already recovered
  • Uses regex match on the probe_url label (probe_url=~"..") to handle varying label formats across RMO versions (the label may include scheme and path, e.g. https://console-openshift-console.apps.cluster.example.com/health)
  • Skipped on HCP clusters where probe_success is forwarded to RHOBS and not available in in-cluster Prometheus
    Load Balancer Security Group checks (AWS Classic clusters only):
  • CLB path: retrieves CLB security groups and verifies TCP 443 inbound is allowed from 0.0.0.0/0 or machine CIDR
  • NLB path: warns if the NLB has security groups attached (unusual per SOP), retrieves target instance security groups, and verifies the NodePort range (30000-32767) is allowed inbound
  • Extends the AWS Client interface with GetCLBSecurityGroupIDs, GetSecurityGroupRules, GetInstanceSecurityGroupIDs
  • Extends FindNLBByDNSName to also return NLB security groups
    Both checks are informational-only — no automated actions (no service logs, no silencing).

Special notes for your reviewer

Test Coverage

Guidelines for CAD investigations

  • New investgations should be accompanied by unit tests and/or step-by-step manual tests in the investigation README.
  • Actioning investigations should be locally tested in staging, and E2E testing is desired. See README for more info on investigation graduation process.

Test coverage checks

  • Added tests
  • Created jira card to add unit test
  • This PR may not need unit tests

Pre-checks (if applicable)

  • Ran unit tests locally
  • Validated the changes in a cluster
  • Included documentation changes with PR

Summary by CodeRabbit

  • New Features

    • Enhanced console error budget burn investigation with AWS security group validation for load balancers and instances.
    • Added Prometheus-based probe history analysis to identify persistent vs. intermittent failures in classic clusters.
  • Tests

    • Added comprehensive test coverage for security group rule validation and probe history checking.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 19, 2026
@openshift-ci-robot

openshift-ci-robot commented Jun 19, 2026

Copy link
Copy Markdown

@MateSaary: This pull request references ROSAENG-14171 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What type of PR is this?

feature

What this PR does / Why we need it?

Adds two new informational checks to the consoleerrorbudgetburn investigation, automating manual SOP troubleshooting steps (sections 3.3 and 3.6-3.7 of console-ErrorBudgetBurn.md):
Probe History (classic only):

  • Queries in-cluster Prometheus for 2 hours of historical probe_success range data via exec into the cluster-monitoring-operator pod
  • Classifies the failure pattern as persistent, intermittent, or already recovered
  • Uses regex match on the probe_url label (probe_url=~"..") to handle varying label formats across RMO versions (the label may include scheme and path, e.g. https://console-openshift-console.apps.cluster.example.com/health)
  • Skipped on HCP clusters where probe_success is forwarded to RHOBS and not available in in-cluster Prometheus
    Load Balancer Security Group checks (AWS Classic clusters only):
  • CLB path: retrieves CLB security groups and verifies TCP 443 inbound is allowed from 0.0.0.0/0 or machine CIDR
  • NLB path: warns if the NLB has security groups attached (unusual per SOP), retrieves target instance security groups, and verifies the NodePort range (30000-32767) is allowed inbound
  • Extends the AWS Client interface with GetCLBSecurityGroupIDs, GetSecurityGroupRules, GetInstanceSecurityGroupIDs
  • Extends FindNLBByDNSName to also return NLB security groups
    Both checks are informational-only — no automated actions (no service logs, no silencing).

Special notes for your reviewer

Test Coverage

Guidelines for CAD investigations

  • New investgations should be accompanied by unit tests and/or step-by-step manual tests in the investigation README.
  • Actioning investigations should be locally tested in staging, and E2E testing is desired. See README for more info on investigation graduation process.

Test coverage checks

  • Added tests
  • Created jira card to add unit test
  • This PR may not need unit tests

Pre-checks (if applicable)

  • Ran unit tests locally
  • Validated the changes in a cluster
  • Included documentation changes with PR

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown

Walkthrough

The PR extends the AWS Client interface and SdkClient with security group retrieval methods (GetCLBSecurityGroupIDs, GetSecurityGroupRules, GetInstanceSecurityGroupIDs) and updates FindNLBByDNSName to return attached security group IDs. These are wired into the console error budget burn investigation to evaluate whether NLB instance NodePort TCP traffic and CLB TCP 443 inbound are permitted. A pluggable probeHistoryChecker is added to query Prometheus probe history on classic clusters. Mocks, tests, and an unrelated OCM mock import alias fix are included.

Changes

Security Group Validation and Probe History Investigation

Layer / File(s) Summary
AWS Client interface and SdkClient security group methods
pkg/aws/aws.go
FindNLBByDNSName gains a securityGroups []string return value. Three new methods are added to Client and implemented on SdkClient: GetCLBSecurityGroupIDs (ELBv1), GetSecurityGroupRules (EC2 DescribeSecurityGroups), and GetInstanceSecurityGroupIDs (EC2 DescribeInstances → instance-id map).
AWS MockClient updates
pkg/aws/mock/aws.go
MockClient.FindNLBByDNSName updated to return the new []string third value. Mock methods and recorder stubs added for GetCLBSecurityGroupIDs, GetInstanceSecurityGroupIDs, and GetSecurityGroupRules.
sgAllowsTCPInbound helper and probeHistoryChecker abstraction
pkg/investigations/consoleerrorbudgetburn/consoleerrorbudgetburn.go
Introduces the probeHistoryChecker interface and probeHistoryCheck field on Investigation. Adds sgAllowsTCPInbound to evaluate inbound TCP rules with protocol -1 and CIDR containment support.
NLB and CLB health check security group integration
pkg/investigations/consoleerrorbudgetburn/consoleerrorbudgetburn.go
checkNLBHealth fetches instance security groups and evaluates NodePort TCP (30000–32767) access. checkCLBHealth fetches CLB security groups and evaluates TCP 443 inbound from 0.0.0.0/0 or the machine CIDR.
Classic-cluster Prometheus probe history querying
pkg/investigations/consoleerrorbudgetburn/consoleerrorbudgetburn.go
After the blackbox probe check on classic clusters, queryProbeHistory constructs a Prometheus range query URL, unmarshals the response, and classifies the last ~2 hours of probe_success samples as all-success, intermittent, or persistent failure.
Unit and integration tests
pkg/investigations/consoleerrorbudgetburn/consoleerrorbudgetburn_test.go
Adds mockProbeHistoryChecker, Prometheus fixtures, and six checkProbeHistory unit tests. Extends existing NLB/CLB health tests with security group mock expectations. Adds a large suite of CLB/NLB security group and sgAllowsTCPInbound unit tests. Updates Run-level tests with noopProbeHistoryChecker wiring and new security group expectations.

OCM Mock Import Alias Fix

Layer / File(s) Summary
OCM mock GetConnection import alias fix
pkg/ocm/mock/ocmmock.go
Renames the ocm-sdk-go import alias from ocm_sdk_go to sdk; updates GetConnection return type and type assertion to *sdk.Connection.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 29.79% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (14 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main changes: adding probe history and load balancer security group checks to the console-ErrorBudgetBurn investigation.
Description check ✅ Passed The description comprehensively addresses all template sections with detailed explanations of the feature, test coverage confirmation, and pre-checks verification.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Test suite uses Go testing framework, not Ginkgo. All 113 test names are static and deterministic with no timestamps, UUIDs, pod/node names, or IP addresses.
Test Structure And Quality ✅ Passed All test code quality requirements are met: tests have single responsibility (narrowly focused on specific behaviors), setup/cleanup is proper (31 gomock controllers with matching defers), no actua...
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests added. PR adds 20 standard Go unit tests using testing.T framework, which are outside the scope of this MicroShift compatibility check designed for Ginkgo e2e tests.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR adds only standard Go unit tests (func Test*()), not Ginkgo e2e tests. SNO compatibility check applies only to Ginkgo e2e tests, so check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed This PR modifies investigation/diagnostic code and AWS SDK wrappers, not deployment manifests, operators, or controllers. No scheduling constraints, affinity rules, or topology-related changes are...
Ote Binary Stdout Contract ✅ Passed PR contains no process-level code (main, init, TestMain, BeforeSuite) that writes to stdout. All changes are normal library methods, mocks, and test code with writes properly isolated within test b...
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests were added in this PR. Only standard Go unit tests were added to consoleerrorbudgetburn_test.go, which are not subject to this IPv6/disconnected network compatibility check.
No-Weak-Crypto ✅ Passed No weak crypto (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom implementations, or insecure comparisons found. All code is AWS SDK wrappers and standard networking utilities.
Container-Privileges ✅ Passed PR modifies only Go source files (AWS clients, investigation logic, tests); no Kubernetes/container manifests with privilege settings are present.
No-Sensitive-Data-In-Logs ✅ Passed Code review found no sensitive data (passwords, tokens, API keys, PII, session IDs) in logs. Bearer tokens use safe shell expansion, not hardcoded values; hostnames and security group IDs are infra...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from RaphaelBut and typeid June 19, 2026 10:44
@openshift-ci

openshift-ci Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: MateSaary

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 19, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/investigations/consoleerrorbudgetburn/consoleerrorbudgetburn.go (1)

256-301: ⚡ Quick win

Consider handling net.ParseCIDR error to avoid silent failures.

Line 262 silently discards the error from net.ParseCIDR. If machineCIDR is non-empty but malformed, machineNet will be nil and CIDR-based matching will silently fail—only 0.0.0.0/0 rules will match, potentially producing misleading warnings.

Additionally, the function only inspects IpRanges (IPv4). Security group rules can also specify Ipv6Ranges, PrefixListIds, or UserIdGroupPairs, which would be missed. This is acceptable for an informational check but worth documenting.

Proposed fix for CIDR error handling
 func sgAllowsTCPInbound(securityGroups []ec2v2types.SecurityGroup, fromPort, toPort int32, machineCIDR string) bool {
 	var machineIP net.IP
 	var machineNet *net.IPNet
 	if machineCIDR != "" {
-		machineIP, machineNet, _ = net.ParseCIDR(machineCIDR)
+		var err error
+		machineIP, machineNet, err = net.ParseCIDR(machineCIDR)
+		if err != nil {
+			// Log or handle; for now treat as empty to fall back to 0.0.0.0/0 matching only
+			machineNet = nil
+		}
 	}

Based on learnings from the coding guidelines: "Never ignore error returns". While this specific failure mode is unlikely (cluster config should provide valid CIDRs), handling the error explicitly makes the fallback behavior intentional rather than accidental.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/investigations/consoleerrorbudgetburn/consoleerrorbudgetburn.go` around
lines 256 - 301, In the sgAllowsTCPInbound function, the error returned from
net.ParseCIDR is being silently discarded with underscore on the line where
machineCIDR is parsed. If machineCIDR is provided but malformed, machineNet will
be nil and the CIDR-based matching logic will silently fail, producing incorrect
results. Capture and explicitly handle the error from net.ParseCIDR—either by
logging it, returning an error from the function, or making the fallback
behavior intentional in a comment, so that callers understand when CIDR
validation fails and the function falls back to only matching against 0.0.0.0/0.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@pkg/investigations/consoleerrorbudgetburn/consoleerrorbudgetburn.go`:
- Around line 256-301: In the sgAllowsTCPInbound function, the error returned
from net.ParseCIDR is being silently discarded with underscore on the line where
machineCIDR is parsed. If machineCIDR is provided but malformed, machineNet will
be nil and the CIDR-based matching logic will silently fail, producing incorrect
results. Capture and explicitly handle the error from net.ParseCIDR—either by
logging it, returning an error from the function, or making the fallback
behavior intentional in a comment, so that callers understand when CIDR
validation fails and the function falls back to only matching against 0.0.0.0/0.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7fd85dbd-2075-481a-800c-59ea29d6e44c

📥 Commits

Reviewing files that changed from the base of the PR and between 0bcd8eb and 32f388f.

📒 Files selected for processing (5)
  • pkg/aws/aws.go
  • pkg/aws/mock/aws.go
  • pkg/investigations/consoleerrorbudgetburn/consoleerrorbudgetburn.go
  • pkg/investigations/consoleerrorbudgetburn/consoleerrorbudgetburn_test.go
  • pkg/ocm/mock/ocmmock.go

@openshift-ci

openshift-ci Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

@MateSaary: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 60.67961% with 81 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.52%. Comparing base (0bcd8eb) to head (32f388f).

Files with missing lines Patch % Lines
...s/consoleerrorbudgetburn/consoleerrorbudgetburn.go 68.11% 39 Missing and 5 partials ⚠️
pkg/aws/aws.go 0.00% 35 Missing ⚠️
pkg/ocm/mock/ocmmock.go 0.00% 2 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #849      +/-   ##
==========================================
+ Coverage   43.09%   43.52%   +0.43%     
==========================================
  Files          71       71              
  Lines        8254     8450     +196     
==========================================
+ Hits         3557     3678     +121     
- Misses       4484     4554      +70     
- Partials      213      218       +5     
Files with missing lines Coverage Δ
pkg/aws/mock/aws.go 42.72% <100.00%> (+4.10%) ⬆️
pkg/ocm/mock/ocmmock.go 34.24% <0.00%> (ø)
pkg/aws/aws.go 3.87% <0.00%> (-0.27%) ⬇️
...s/consoleerrorbudgetburn/consoleerrorbudgetburn.go 70.14% <68.11%> (-0.49%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants