ROSAENG-14105: Add investigation for HCPNodepoolUpgradeDelay#847
ROSAENG-14105: Add investigation for HCPNodepoolUpgradeDelay#847rolandmkunkel wants to merge 1 commit into
Conversation
|
@rolandmkunkel: This pull request references ROSAENG-14105 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (7)
✅ Files skipped from review due to trivial changes (3)
🚧 Files skipped from review as they are similar to previous changes (4)
WalkthroughAdds a new ChangesPDB Blocking Node Drain Investigation
Sequence Diagram(s)sequenceDiagram
participant cadctl
participant Run as Investigation.Run
participant K8sAPI as Service Cluster K8s API
participant NoteWriter
cadctl->>Run: Run(ResourceBuilder)
Run->>K8sAPI: List Nodes
K8sAPI-->>Run: nodeList
Run->>Run: checkDrainingNodes → stalledNodes
alt No stalled nodes
Run->>NoteWriter: append success note
Run-->>cadctl: escalate/manual review
else Stalled nodes found
Run->>K8sAPI: List Nodes (health check)
K8sAPI-->>Run: unhealthyNodeCount
Run->>K8sAPI: List PodDisruptionBudgets
Run->>K8sAPI: List Pods
K8sAPI-->>Run: PDBs + Pods
Run->>Run: match PDB selectors → pods on stalled nodes
Run->>K8sAPI: Get ReplicaSet (owner resolution)
K8sAPI-->>Run: ownerKind/ownerName
Run->>Run: assessScaling per blocking PDB
Run->>NoteWriter: append blocking PDB warnings + scaling notes
Run-->>cadctl: manual review escalation
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 11❌ Failed checks (1 warning, 10 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning Tools execution failed with the following error: Failed to run tools: 13 INTERNAL: Received RST_STREAM with code 2 (Internal server error) Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rolandmkunkel The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
ea4a828 to
632631f
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
pkg/investigations/pdbblockingnodedrain/testing/README.md (2)
182-183: ⚡ Quick winComplete the platform-managed PDB testing variant with step-by-step instructions.
The platform-managed PDB variant (lines 182–183) provides only a high-level description without detailed steps. To match the structure and clarity of the well-configured PDB variant (lines 161–180), add full instructions on how to deploy the workload in an
openshift-*namespace, create the blocking PDB, cordon a node, and verify the investigation classifies it as[platform].📝 Proposed expansion
### Platform-managed PDB -Deploy a workload in an `openshift-*` namespace to test platform classification. The investigation should report it as `[platform]` with guidance to escalate to engineering. +Deploy a workload in an `openshift-*` namespace to test platform classification. The investigation should report it as `[platform]` with guidance to escalate to engineering. + +```bash +oc new-project openshift-test-pdb +``` + +```bash +oc create deployment platform-blocker --image=registry.access.redhat.com/ubi9/ubi-minimal:latest --replicas=2 -n openshift-test-pdb -- sleep infinity +``` + +Create a PDB in the `openshift-*` namespace: + +```bash +cat <<'EOF' | oc create -f - +apiVersion: policy/v1 +kind: PodDisruptionBudget +metadata: + name: platform-blocker-pdb + namespace: openshift-test-pdb +spec: + minAvailable: 2 + selector: + matchLabels: + app: platform-blocker +EOF +``` + +Cordon a node and wait 10+ minutes, then generate and run the investigation (steps 3–5 above). The PDB should be reported as `[platform]` with guidance to escalate.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/investigations/pdbblockingnodedrain/testing/README.md` around lines 182 - 183, The Platform-managed PDB section header at line 182 lacks the detailed step-by-step instructions that are present in the well-configured PDB variant (lines 161-180). Expand the Platform-managed PDB section by adding sequential bash commands and instructions that cover: creating a new project in an openshift-* namespace, deploying a test workload with multiple replicas, creating a PodDisruptionBudget resource with minAvailable constraint in that namespace, cordoning a node and waiting for the investigation to run, and verifying that the output correctly classifies the blocking PDB as [platform] with escalation guidance. Structure this expansion to match the format and clarity of the well-configured PDB variant section.
126-126: ⚡ Quick winAdd language specifier to the example output code block.
The fenced code block at line 126 (Example output) is missing a language tag. Markdown linters expect all code blocks to specify a language for syntax highlighting.
📝 Proposed fix
Example output: -``` +```text ⚠️ Draining Nodes: 1/1 draining node(s) stalled (>10m0s):🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/investigations/pdbblockingnodedrain/testing/README.md` at line 126, The fenced code block in the Example output section (at line 126) is missing a language specifier tag, which causes markdown linters to fail. Add the language specifier "text" to the opening backticks of the code block containing the example output that starts with "⚠️ Draining Nodes". Change the opening ``` to ```text to properly declare the code block language for syntax highlighting.Source: Linters/SAST tools
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain_test.go`:
- Around line 141-145: The TestIsExperimental function has an inverted assertion
that causes the test to fail despite the Investigation.IsExperimental()
implementation being correct. In the TestIsExperimental function, remove the
logical NOT operator (!) from the condition `if !inv.IsExperimental()` so the
test properly asserts that the IsExperimental method returns false as expected.
Update the error message accordingly to reflect the correct expected behavior.
In `@pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain.go`:
- Around line 425-428: The return statement in the namespace classification
logic incorrectly includes `namespace == "default"` as a platform-managed
namespace check, which causes customer-owned resources to be misclassified.
Remove the condition `namespace == "default"` from the boolean expression so
that only namespaces starting with "openshift-", "kube-", or exactly matching
"openshift" are treated as platform-managed.
- Around line 164-179: The filter in the checkNodeHealth function only excludes
stalled nodes from the unhealthy node analysis but should also exclude all
actively draining nodes, not just stalled ones. Create an additional map for all
draining nodes (similar to how stalledNodeNames is created from stalledNodes),
then update the condition in the for loop iterating through nodeList.Items to
skip both stalled nodes and actively draining nodes so that only truly unhealthy
nodes are included in the findings.
---
Nitpick comments:
In `@pkg/investigations/pdbblockingnodedrain/testing/README.md`:
- Around line 182-183: The Platform-managed PDB section header at line 182 lacks
the detailed step-by-step instructions that are present in the well-configured
PDB variant (lines 161-180). Expand the Platform-managed PDB section by adding
sequential bash commands and instructions that cover: creating a new project in
an openshift-* namespace, deploying a test workload with multiple replicas,
creating a PodDisruptionBudget resource with minAvailable constraint in that
namespace, cordoning a node and waiting for the investigation to run, and
verifying that the output correctly classifies the blocking PDB as [platform]
with escalation guidance. Structure this expansion to match the format and
clarity of the well-configured PDB variant section.
- Line 126: The fenced code block in the Example output section (at line 126) is
missing a language specifier tag, which causes markdown linters to fail. Add the
language specifier "text" to the opening backticks of the code block containing
the example output that starts with "⚠️ Draining Nodes". Change the opening ```
to ```text to properly declare the code block language for syntax highlighting.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 956e0f47-8fe4-46ca-9582-59e6facea772
📒 Files selected for processing (7)
pkg/investigations/pdbblockingnodedrain/README.mdpkg/investigations/pdbblockingnodedrain/metadata.yamlpkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain.gopkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain_test.gopkg/investigations/pdbblockingnodedrain/testing/README.mdpkg/investigations/registry.gotest/generate_incident.sh
| func TestIsExperimental(t *testing.T) { | ||
| inv := &Investigation{} | ||
| if !inv.IsExperimental() { | ||
| t.Error("expected IsExperimental to be true") | ||
| } |
There was a problem hiding this comment.
Fix inverted IsExperimental assertion.
Line 143 asserts the opposite behavior. Investigation.IsExperimental() returns false, so this test fails while implementation is correct.
Proposed fix
func TestIsExperimental(t *testing.T) {
inv := &Investigation{}
- if !inv.IsExperimental() {
- t.Error("expected IsExperimental to be true")
+ if inv.IsExperimental() {
+ t.Error("expected IsExperimental to be false")
}
}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain_test.go` around
lines 141 - 145, The TestIsExperimental function has an inverted assertion that
causes the test to fail despite the Investigation.IsExperimental()
implementation being correct. In the TestIsExperimental function, remove the
logical NOT operator (!) from the condition `if !inv.IsExperimental()` so the
test properly asserts that the IsExperimental method returns false as expected.
Update the error message accordingly to reflect the correct expected behavior.
632631f to
6a990f9
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
pkg/investigations/pdbblockingnodedrain/testing/README.md (1)
126-126: ⚡ Quick winAdd language specifier to fenced code block.
Line 126 starts a code fence without a language specifier. Since the content is example output text, add a language identifier for proper Markdown rendering.
📝 Proposed fix
-Example output: -``` +Example output: +```text ⚠️ Draining Nodes: 1/1 draining node(s) stalled (>10m0s):🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/investigations/pdbblockingnodedrain/testing/README.md` at line 126, The fenced code block in the README.md file at line 126 lacks a language specifier after the opening triple backticks. Add "text" as the language identifier to the opening code fence (change ``` to ```text) to ensure proper Markdown rendering of the example output content that follows.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain.go`:
- Around line 116-135: The issue is in the stalledNode construction where
unschedulableSince is initialized: when taint.TimeAdded is nil, using now as the
fallback causes DrainDuration to be zero, which prevents the node from ever
exceeding drainStallThreshold. Instead of defaulting to now when TimeAdded is
nil, you should either skip processing that node entirely or use an alternative
time reference such as the node's creation time to accurately calculate how long
the node has actually been draining. This ensures that nodes with missing
TimeAdded values don't produce false negatives in stall detection.
- Around line 398-404: The current logic in the maxUnavailable check
unconditionally marks scaling as not helpful for any non-zero maxUnavailable
value, but percentage-based values (ending with "%") actually allow more
disruptions when scaling up. Modify the logic to differentiate between absolute
values and percentage values: for absolute values like "1" or "2", keep
returning canScale=false since scaling doesn't change the fixed disruption
limit, but for percentage values, calculate unhealthy pods and return
canScale=true only if all pods are healthy (unhealthy == 0), since scaling can
increase the computed disruption allowance; otherwise return canScale=false for
percentage values with unhealthy pods. Check if bp.MaxUnavailable contains "%"
to distinguish between the two cases.
---
Nitpick comments:
In `@pkg/investigations/pdbblockingnodedrain/testing/README.md`:
- Line 126: The fenced code block in the README.md file at line 126 lacks a
language specifier after the opening triple backticks. Add "text" as the
language identifier to the opening code fence (change ``` to ```text) to ensure
proper Markdown rendering of the example output content that follows.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 3436a9bb-4c84-4700-8ecc-989475c7c8fd
📒 Files selected for processing (7)
pkg/investigations/pdbblockingnodedrain/README.mdpkg/investigations/pdbblockingnodedrain/metadata.yamlpkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain.gopkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain_test.gopkg/investigations/pdbblockingnodedrain/testing/README.mdpkg/investigations/registry.gotest/generate_incident.sh
✅ Files skipped from review due to trivial changes (1)
- pkg/investigations/pdbblockingnodedrain/README.md
🚧 Files skipped from review as they are similar to previous changes (4)
- pkg/investigations/pdbblockingnodedrain/metadata.yaml
- test/generate_incident.sh
- pkg/investigations/registry.go
- pkg/investigations/pdbblockingnodedrain/pdbblockingnodedrain_test.go
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #847 +/- ##
==========================================
+ Coverage 43.09% 44.12% +1.03%
==========================================
Files 71 72 +1
Lines 8254 8500 +246
==========================================
+ Hits 3557 3751 +194
- Misses 4484 4530 +46
- Partials 213 219 +6
🚀 New features to boost your workflow:
|
6a990f9 to
04b6898
Compare
|
@rolandmkunkel: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What type of PR is this?
feature
What this PR does / Why we need it?
Adds a new investigation pdbblockingnodedrain for the HCPNodepoolUpgradeDelay alert (ROSAENG-14105). When a node drain stalls during an HCP cluster upgrade, this investigation automatically identifies which PodDisruptionBudgets are blocking eviction, classifies them as customer-managed or platform-managed, checks node health, and assesses whether scaling the workload would unblock the drain.
Investigation steps:
Special notes for your reviewer
The scaling assessment messaging hints at possible causes to guide SREs.
Test Coverage
Guidelines for CAD investigations
Test coverage checks
Pre-checks (if applicable)
Summary by CodeRabbit
New Features
Documentation
Tests