ROSAENG-14105: Add investigation for HCPNodepoolUpgradeDelay by rolandmkunkel · Pull Request #847 · openshift/configuration-anomaly-detection

rolandmkunkel · 2026-06-18T12:31:57Z

What type of PR is this?

feature

What this PR does / Why we need it?

Adds a new investigation pdbblockingnodedrain for the HCPNodepoolUpgradeDelay alert (ROSAENG-14105). When a node drain stalls during an HCP cluster upgrade, this investigation automatically identifies which PodDisruptionBudgets are blocking eviction, classifies them as customer-managed or platform-managed, checks node health, and assesses whether scaling the workload would unblock the drain.

Investigation steps:

Draining nodes — identifies nodes with the unschedulable taint stalled longer than 10 minutes
Blocking PDBs — enumerates PDBs with disruptionsAllowed=0 whose selected pods are on stalled nodes, resolves owner workloads (Deployment, StatefulSet, etc.), and classifies each as customer or platform-managed
Node health — checks non-draining nodes for NotReady or resource pressure conditions that would prevent rescheduling
Scaling assessment — for each blocking PDB, evaluates whether scaling up the workload would unblock the drain without requiring a PDB change

Special notes for your reviewer

Eviction event correlation (ticket step 2) intentionally omitted. These events live on the management cluster. Accessing them requires RHOBS integration (potentially done blocked by ROSAENG-15920) and MC-to-RHOBS log forwarding on staging (not yet available). The blocking PDB check already provides definitive detection via disruptionsAllowed=0 + pod-to-node correlation, making event correlation supplementary.
API server audit log check (also ticket step 2) covered by the same reasoning. Audit logs showing rejected eviction requests would live on the management cluster and require the same RHOBS integration.
Node health check added (not in original ticket). During testing, I identified that scaling recommendations could be misleading if non-draining nodes are unhealthy. The investigation now checks node health and qualifies the scaling assessment accordingly.
Scheduling constraints not checked (future improvement). Pod scheduling constraints (anti-affinity, node selectors, resource limits) can prevent evicted pods from rescheduling, causing disruptionsAllowed to stay at 0. The investigation catches this state but does not diagnose why rescheduling failed.
The scaling assessment messaging hints at possible causes to guide SREs.
Should also run for UpgradeNodeDrainFailedSRE alert. This investigation is equally applicable to that alert. Adding multi-alert support per investigation is feasible once ROSAENG-59444 is done.

Test Coverage

Guidelines for CAD investigations

New investgations should be accompanied by unit tests and/or step-by-step manual tests in the investigation README.
Actioning investigations should be locally tested in staging, and E2E testing is desired. See README for more info on investigation graduation process.

Test coverage checks

Added tests
Created jira card to add unit test
This PR may not need unit tests

Pre-checks (if applicable)

Ran unit tests locally
Validated the changes in a cluster
Included documentation changes with PR

Summary by CodeRabbit

New Features
- Added a new investigation to detect stalled node drains during upgrades when blocking PodDisruptionBudgets are present, including guidance on potential scaling-based mitigation.
- Registered the investigation for discoverability and extended incident generation to recognize the related upgrade-delay alert.
Documentation
- Added a runbook describing the investigation workflow, escalation behavior, and local end-to-end testing steps.
Tests
- Added comprehensive unit tests covering draining detection, PDB correlation/classification, node health handling, and scaling assessment logic.

openshift-ci-robot · 2026-06-18T12:32:01Z

@rolandmkunkel: This pull request references ROSAENG-14105 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

What type of PR is this?

feature

What this PR does / Why we need it?

Adds a new investigation pdbblockingnodedrain for the HCPNodepoolUpgradeDelay alert (ROSAENG-14105). When a node drain stalls during an HCP cluster upgrade, this investigation automatically identifies which PodDisruptionBudgets are blocking eviction, classifies them as customer-managed or platform-managed, checks node health, and assesses whether scaling the workload would unblock the drain.

Investigation steps:

Draining nodes — identifies nodes with the unschedulable taint stalled longer than 10 minutes

Blocking PDBs — enumerates PDBs with disruptionsAllowed=0 whose selected pods are on stalled nodes, resolves owner workloads (Deployment, StatefulSet, etc.), and classifies each as customer or platform-managed

Node health — checks non-draining nodes for NotReady or resource pressure conditions that would prevent rescheduling

Scaling assessment — for each blocking PDB, evaluates whether scaling up the workload would unblock the drain without requiring a PDB change

Special notes for your reviewer

Eviction event correlation (ticket step 2) intentionally omitted. These events live on the management cluster. Accessing them requires RHOBS integration (potentially done blocked by ROSAENG-15920) and MC-to-RHOBS log forwarding on staging (not yet available). The blocking PDB check already provides definitive detection via disruptionsAllowed=0 + pod-to-node correlation, making event correlation supplementary.

API server audit log check (also ticket step 2) covered by the same reasoning. Audit logs showing rejected eviction requests would live on the management cluster and require the same RHOBS integration.

Node health check added (not in original ticket). During testing, I identified that scaling recommendations could be misleading if non-draining nodes are unhealthy. The investigation now checks node health and qualifies the scaling assessment accordingly.

Scheduling constraints not checked (future improvement). Pod scheduling constraints (anti-affinity, node selectors, resource limits) can prevent evicted pods from rescheduling, causing disruptionsAllowed to stay at 0. The investigation catches this state but does not diagnose why rescheduling failed.
The scaling assessment messaging hints at possible causes to guide SREs.

Should also run for UpgradeNodeDrainFailedSRE alert. This investigation is equally applicable to that alert. Adding multi-alert support per investigation is feasible once ROSAENG-59444 is done.

Test Coverage

Guidelines for CAD investigations

New investgations should be accompanied by unit tests and/or step-by-step manual tests in the investigation README.

Actioning investigations should be locally tested in staging, and E2E testing is desired. See README for more info on investigation graduation process.

Test coverage checks

Added tests

Created jira card to add unit test

This PR may not need unit tests

Pre-checks (if applicable)

Ran unit tests locally

Validated the changes in a cluster

Included documentation changes with PR

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-06-18T12:32:11Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1f94a5fd-0c40-4a45-97a5-d94e959d3b4a

📥 Commits

Reviewing files that changed from the base of the PR and between 6a990f9 and 04b6898.

📒 Files selected for processing (7)

pkg/investigations/pdbblockingnodedrain/README.md
pkg/investigations/pdbblockingnodedrain/metadata.yaml
pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain.go
pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain_test.go
pkg/investigations/pdbblockingnodedrain/testing/README.md
pkg/investigations/registry.go
test/generate_incident.sh

✅ Files skipped from review due to trivial changes (3)

pkg/investigations/pdbblockingnodedrain/README.md
pkg/investigations/pdbblockingnodedrain/metadata.yaml
pkg/investigations/pdbblockingnodedrain/testing/README.md

🚧 Files skipped from review as they are similar to previous changes (4)

test/generate_incident.sh
pkg/investigations/registry.go
pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain.go
pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain_test.go

Walkthrough

Adds a new pdbblockingnodedrain investigation that detects stalled HCP node drains caused by PodDisruptionBudgets with disruptionsAllowed=0. The implementation includes RBAC metadata, core logic for draining-node detection/node health/PDB matching/scaling assessment, a unit test suite, investigation registry wiring, and full documentation.

Changes

PDB Blocking Node Drain Investigation

Layer / File(s)	Summary
Investigation metadata and contract `pkg/investigations/pdbblockingnodedrain/metadata.yaml`, `pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain.go`	Defines investigation name, read-only RBAC permissions for nodes/pods/PDBs/workload resources, and exported contract methods (`Name`, `AlertTitle`, `Description`, `IsExperimental`). Constants define the 10-minute drain stall threshold and internal data structures carry stalled-node and PDB details through the workflow.
Investigation orchestration and draining-node detection `pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain.go`	Implements `Run` orchestration with cluster-access error handling, then detects nodes with the `node.kubernetes.io/unschedulable` NoSchedule taint and classifies them as stalled (>10 minutes) vs recently draining based on taint `TimeAdded`, falling back to node creation time when taint time is unavailable.
Node health evaluation and blocking PDB analysis `pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain.go`	Evaluates non-draining nodes for health issues (`NodeReady` and Disk/Memory/PID pressure), lists PDBs with `disruptionsAllowed=0` and matches them against pods scheduled on stalled nodes via label selectors, resolves top-level workload owners (including ReplicaSet→Deployment chains), classifies PDBs as platform vs customer-managed by namespace prefix rules, and assesses scaling viability by interpreting `minAvailable`/`maxUnavailable` constraints against `expectedPods` vs `currentHealthy` replica counts.
Comprehensive unit test suite `pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain_test.go`	Provides 40+ tests covering metadata, taint detection, drain stall classification (mixed/multiple scenarios), node health evaluation, PDB matching logic, workload owner resolution with ReplicaSet/Deployment/StatefulSet variants, namespace classification, scaling decision logic with integer and percentage constraint representations, and note generation for various PDB/health states.
Registry and test infrastructure integration `pkg/investigations/registry.go`, `test/generate_incident.sh`	Registers `&pdbblockingnodedrain.Investigation{}` in `availableInvestigations` list and adds `HCPNodepoolUpgradeDelay` to the test incident generator's `alert_mapping` associative array.
Documentation and testing guidance `pkg/investigations/pdbblockingnodedrain/README.md`, `pkg/investigations/pdbblockingnodedrain/testing/README.md`	Documents the investigation flow, alert silencing vs escalation behavior, design decision to omit eviction-event correlation, and single-phase resource-build architecture; provides comprehensive manual end-to-end testing guide with drain-stall scenario setup, output verification steps, scaling-based remediation validation, and platform/customer PDB classification test variants.

Sequence Diagram(s)

sequenceDiagram
  participant cadctl
  participant Run as Investigation.Run
  participant K8sAPI as Service Cluster K8s API
  participant NoteWriter

  cadctl->>Run: Run(ResourceBuilder)
  Run->>K8sAPI: List Nodes
  K8sAPI-->>Run: nodeList
  Run->>Run: checkDrainingNodes → stalledNodes
  alt No stalled nodes
    Run->>NoteWriter: append success note
    Run-->>cadctl: escalate/manual review
  else Stalled nodes found
    Run->>K8sAPI: List Nodes (health check)
    K8sAPI-->>Run: unhealthyNodeCount
    Run->>K8sAPI: List PodDisruptionBudgets
    Run->>K8sAPI: List Pods
    K8sAPI-->>Run: PDBs + Pods
    Run->>Run: match PDB selectors → pods on stalled nodes
    Run->>K8sAPI: Get ReplicaSet (owner resolution)
    K8sAPI-->>Run: ownerKind/ownerName
    Run->>Run: assessScaling per blocking PDB
    Run->>NoteWriter: append blocking PDB warnings + scaling notes
    Run-->>cadctl: manual review escalation
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

openshift/configuration-anomaly-detection#842: Also modifies test/generate_incident.sh to add a new entry to the alert_mapping associative array, the same pattern used here for HCPNodepoolUpgradeDelay.

Suggested labels

lgtm

Suggested reviewers

Nikokolas3270
zmird-r
MateSaary

🚥 Pre-merge checks | ✅ 4 | ❌ 11

❌ Failed checks (1 warning, 10 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 21.05% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Stable And Deterministic Test Names	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Test Structure And Quality	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Microshift Test Compatibility	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Single Node Openshift (Sno) Test Compatibility	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Topology-Aware Scheduling Compatibility	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Ote Binary Stdout Contract	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Ipv6 And Disconnected Network Test Compatibility	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Weak-Crypto	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
Container-Privileges	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.
No-Sensitive-Data-In-Logs	❓ Inconclusive	Repository clone failed, so this custom check could not run with code access.	Retry the review run. If this persists, inspect pre-merge custom-check logs for infrastructure or agent runtime failures.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding a new investigation for the HCPNodepoolUpgradeDelay alert, which is the primary objective of the PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Warning

Tools execution failed with the following error:

Failed to run tools: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error)

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-06-18T12:34:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rolandmkunkel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [rolandmkunkel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

pkg/investigations/pdbblockingnodedrain/testing/README.md (2)

182-183: ⚡ Quick win

Complete the platform-managed PDB testing variant with step-by-step instructions.

The platform-managed PDB variant (lines 182–183) provides only a high-level description without detailed steps. To match the structure and clarity of the well-configured PDB variant (lines 161–180), add full instructions on how to deploy the workload in an openshift-* namespace, create the blocking PDB, cordon a node, and verify the investigation classifies it as [platform].

📝 Proposed expansion

 ### Platform-managed PDB
 
-Deploy a workload in an `openshift-*` namespace to test platform classification. The investigation should report it as `[platform]` with guidance to escalate to engineering.
+Deploy a workload in an `openshift-*` namespace to test platform classification. The investigation should report it as `[platform]` with guidance to escalate to engineering.
+
+```bash
+oc new-project openshift-test-pdb
+```
+
+```bash
+oc create deployment platform-blocker --image=registry.access.redhat.com/ubi9/ubi-minimal:latest --replicas=2 -n openshift-test-pdb -- sleep infinity
+```
+
+Create a PDB in the `openshift-*` namespace:
+
+```bash
+cat <<'EOF' | oc create -f -
+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+  name: platform-blocker-pdb
+  namespace: openshift-test-pdb
+spec:
+  minAvailable: 2
+  selector:
+    matchLabels:
+      app: platform-blocker
+EOF
+```
+
+Cordon a node and wait 10+ minutes, then generate and run the investigation (steps 3–5 above). The PDB should be reported as `[platform]` with guidance to escalate.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/investigations/pdbblockingnodedrain/testing/README.md` around lines 182 -
183, The Platform-managed PDB section header at line 182 lacks the detailed
step-by-step instructions that are present in the well-configured PDB variant
(lines 161-180). Expand the Platform-managed PDB section by adding sequential
bash commands and instructions that cover: creating a new project in an
openshift-* namespace, deploying a test workload with multiple replicas,
creating a PodDisruptionBudget resource with minAvailable constraint in that
namespace, cordoning a node and waiting for the investigation to run, and
verifying that the output correctly classifies the blocking PDB as [platform]
with escalation guidance. Structure this expansion to match the format and
clarity of the well-configured PDB variant section.

126-126: ⚡ Quick win

Add language specifier to the example output code block.

The fenced code block at line 126 (Example output) is missing a language tag. Markdown linters expect all code blocks to specify a language for syntax highlighting.

📝 Proposed fix

 Example output:
-```
+```text
 ⚠️ Draining Nodes: 1/1 draining node(s) stalled (>10m0s):

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/investigations/pdbblockingnodedrain/testing/README.md` at line 126, The
fenced code block in the Example output section (at line 126) is missing a
language specifier tag, which causes markdown linters to fail. Add the language
specifier "text" to the opening backticks of the code block containing the
example output that starts with "⚠️ Draining Nodes". Change the opening ``` to
```text to properly declare the code block language for syntax highlighting.

Source: Linters/SAST tools

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain_test.go`:
- Around line 141-145: The TestIsExperimental function has an inverted assertion
that causes the test to fail despite the Investigation.IsExperimental()
implementation being correct. In the TestIsExperimental function, remove the
logical NOT operator (!) from the condition `if !inv.IsExperimental()` so the
test properly asserts that the IsExperimental method returns false as expected.
Update the error message accordingly to reflect the correct expected behavior.

In `@pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain.go`:
- Around line 425-428: The return statement in the namespace classification
logic incorrectly includes `namespace == "default"` as a platform-managed
namespace check, which causes customer-owned resources to be misclassified.
Remove the condition `namespace == "default"` from the boolean expression so
that only namespaces starting with "openshift-", "kube-", or exactly matching
"openshift" are treated as platform-managed.
- Around line 164-179: The filter in the checkNodeHealth function only excludes
stalled nodes from the unhealthy node analysis but should also exclude all
actively draining nodes, not just stalled ones. Create an additional map for all
draining nodes (similar to how stalledNodeNames is created from stalledNodes),
then update the condition in the for loop iterating through nodeList.Items to
skip both stalled nodes and actively draining nodes so that only truly unhealthy
nodes are included in the findings.

---

Nitpick comments:
In `@pkg/investigations/pdbblockingnodedrain/testing/README.md`:
- Around line 182-183: The Platform-managed PDB section header at line 182 lacks
the detailed step-by-step instructions that are present in the well-configured
PDB variant (lines 161-180). Expand the Platform-managed PDB section by adding
sequential bash commands and instructions that cover: creating a new project in
an openshift-* namespace, deploying a test workload with multiple replicas,
creating a PodDisruptionBudget resource with minAvailable constraint in that
namespace, cordoning a node and waiting for the investigation to run, and
verifying that the output correctly classifies the blocking PDB as [platform]
with escalation guidance. Structure this expansion to match the format and
clarity of the well-configured PDB variant section.
- Line 126: The fenced code block in the Example output section (at line 126) is
missing a language specifier tag, which causes markdown linters to fail. Add the
language specifier "text" to the opening backticks of the code block containing
the example output that starts with "⚠️ Draining Nodes". Change the opening ```
to ```text to properly declare the code block language for syntax highlighting.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 956e0f47-8fe4-46ca-9582-59e6facea772

📥 Commits

Reviewing files that changed from the base of the PR and between 0bcd8eb and 632631f.

📒 Files selected for processing (7)

pkg/investigations/pdbblockingnodedrain/README.md
pkg/investigations/pdbblockingnodedrain/metadata.yaml
pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain.go
pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain_test.go
pkg/investigations/pdbblockingnodedrain/testing/README.md
pkg/investigations/registry.go
test/generate_incident.sh

coderabbitai · 2026-06-18T12:54:18Z

+func TestIsExperimental(t *testing.T) {
+	inv := &Investigation{}
+	if !inv.IsExperimental() {
+		t.Error("expected IsExperimental to be true")
+	}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix inverted IsExperimental assertion.

Line 143 asserts the opposite behavior. Investigation.IsExperimental() returns false, so this test fails while implementation is correct.

Proposed fix

func TestIsExperimental(t *testing.T) { inv := &Investigation{} - if !inv.IsExperimental() { - t.Error("expected IsExperimental to be true") + if inv.IsExperimental() { + t.Error("expected IsExperimental to be false") } }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain_test.go` around lines 141 - 145, The TestIsExperimental function has an inverted assertion that causes the test to fail despite the Investigation.IsExperimental() implementation being correct. In the TestIsExperimental function, remove the logical NOT operator (!) from the condition `if !inv.IsExperimental()` so the test properly asserts that the IsExperimental method returns false as expected. Update the error message accordingly to reflect the correct expected behavior.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

pkg/investigations/pdbblockingnodedrain/testing/README.md (1)
126-126: ⚡ Quick win

Add language specifier to fenced code block.

Line 126 starts a code fence without a language specifier. Since the content is example output text, add a language identifier for proper Markdown rendering.
📝 Proposed fix
-Example output:
-```
+Example output:
+```text
 ⚠️ Draining Nodes: 1/1 draining node(s) stalled (>10m0s):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/investigations/pdbblockingnodedrain/testing/README.md` at line 126, The
fenced code block in the README.md file at line 126 lacks a language specifier
after the opening triple backticks. Add "text" as the language identifier to the
opening code fence (change ``` to ```text) to ensure proper Markdown rendering
of the example output content that follows.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain.go`:
- Around line 116-135: The issue is in the stalledNode construction where
unschedulableSince is initialized: when taint.TimeAdded is nil, using now as the
fallback causes DrainDuration to be zero, which prevents the node from ever
exceeding drainStallThreshold. Instead of defaulting to now when TimeAdded is
nil, you should either skip processing that node entirely or use an alternative
time reference such as the node's creation time to accurately calculate how long
the node has actually been draining. This ensures that nodes with missing
TimeAdded values don't produce false negatives in stall detection.
- Around line 398-404: The current logic in the maxUnavailable check
unconditionally marks scaling as not helpful for any non-zero maxUnavailable
value, but percentage-based values (ending with "%") actually allow more
disruptions when scaling up. Modify the logic to differentiate between absolute
values and percentage values: for absolute values like "1" or "2", keep
returning canScale=false since scaling doesn't change the fixed disruption
limit, but for percentage values, calculate unhealthy pods and return
canScale=true only if all pods are healthy (unhealthy == 0), since scaling can
increase the computed disruption allowance; otherwise return canScale=false for
percentage values with unhealthy pods. Check if bp.MaxUnavailable contains "%"
to distinguish between the two cases.

---

Nitpick comments:
In `@pkg/investigations/pdbblockingnodedrain/testing/README.md`:
- Line 126: The fenced code block in the README.md file at line 126 lacks a
language specifier after the opening triple backticks. Add "text" as the
language identifier to the opening code fence (change ``` to ```text) to ensure
proper Markdown rendering of the example output content that follows.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3436a9bb-4c84-4700-8ecc-989475c7c8fd

📥 Commits

Reviewing files that changed from the base of the PR and between 632631f and 6a990f9.

📒 Files selected for processing (7)

pkg/investigations/pdbblockingnodedrain/README.md
pkg/investigations/pdbblockingnodedrain/metadata.yaml
pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain.go
pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain_test.go
pkg/investigations/pdbblockingnodedrain/testing/README.md
pkg/investigations/registry.go
test/generate_incident.sh

✅ Files skipped from review due to trivial changes (1)

pkg/investigations/pdbblockingnodedrain/README.md

🚧 Files skipped from review as they are similar to previous changes (4)

pkg/investigations/pdbblockingnodedrain/metadata.yaml
test/generate_incident.sh
pkg/investigations/registry.go
pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain_test.go

codecov-commenter · 2026-06-18T13:38:52Z

Codecov Report

❌ Patch coverage is 78.86179% with 52 lines in your changes missing coverage. Please review.
✅ Project coverage is 44.12%. Comparing base (0bcd8eb) to head (04b6898).

Files with missing lines	Patch %	Lines
...tions/pdbblockingnodedrain/pdbblockingnodedrain.go	78.86%	46 Missing and 6 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #847      +/-   ##
==========================================
+ Coverage   43.09%   44.12%   +1.03%     
==========================================
  Files          71       72       +1     
  Lines        8254     8500     +246     
==========================================
+ Hits         3557     3751     +194     
- Misses       4484     4530      +46     
- Partials      213      219       +6

Files with missing lines	Coverage Δ
pkg/investigations/registry.go	`0.00% <ø> (ø)`
...tions/pdbblockingnodedrain/pdbblockingnodedrain.go	`78.86% <78.86%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

openshift-ci · 2026-06-18T13:54:18Z

@rolandmkunkel: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 18, 2026

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 18, 2026

openshift-ci Bot requested review from AlexSmithGH and Nikokolas3270 June 18, 2026 12:34

rolandmkunkel force-pushed the ROSAENG-14105-investigate-pdb-blocking-drain branch from ea4a828 to 632631f Compare June 18, 2026 12:34

openshift-ci Bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 18, 2026

coderabbitai Bot reviewed Jun 18, 2026

View reviewed changes

rolandmkunkel force-pushed the ROSAENG-14105-investigate-pdb-blocking-drain branch from 632631f to 6a990f9 Compare June 18, 2026 13:28

coderabbitai Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain.go Outdated

Comment thread pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain.go

ROSAENG-14105: Add investigation for HCPNodepoolUpgradeDelay

04b6898

rolandmkunkel force-pushed the ROSAENG-14105-investigate-pdb-blocking-drain branch from 6a990f9 to 04b6898 Compare June 18, 2026 13:43

Conversation

rolandmkunkel commented Jun 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / Why we need it?

Special notes for your reviewer

Test Coverage

Guidelines for CAD investigations

Test coverage checks

Pre-checks (if applicable)

Summary by CodeRabbit

Uh oh!

openshift-ci-robot commented Jun 18, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / Why we need it?

Special notes for your reviewer

Test Coverage

Guidelines for CAD investigations

Test coverage checks

Pre-checks (if applicable)

Uh oh!

coderabbitai Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning, 10 inconclusive)

Uh oh!

openshift-ci Bot commented Jun 18, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

openshift-ci Bot commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rolandmkunkel commented Jun 18, 2026 •

edited by coderabbitai Bot

Loading

openshift-ci-robot commented Jun 18, 2026 •

edited by atlassian Bot

Loading

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading

codecov-commenter commented Jun 18, 2026 •

edited

Loading