Fix longrunning suite for node test cases#31339
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (11)
🚧 Files skipped from review as they are similar to previous changes (6)
WalkthroughNode execution helpers now accept ChangesNode context plumbing and readiness
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning |
|
@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/3d432cf0-703b-11f1-86c3-0f1b17f4c2e5-0 |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/extended/node/node_utils.go`:
- Around line 163-183: The `oc debug` call in `node_utils.go` can keep running
after the timeout because `Outputs()` is launched in a goroutine without any
cancellation, so the command and goroutine may leak. Update both helper call
sites around `oc.AsAdmin().WithoutNamespace().Run("debug")`/`Outputs()` to use a
cancelable context or otherwise terminate the underlying command when the
timeout fires, and make sure the timeout path in the `select` fully stops the
in-flight `oc debug` execution.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 51bee914-d5e0-439b-a0c2-9b9586929a25
📒 Files selected for processing (1)
test/extended/node/node_utils.go
|
/payload-abort periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning |
|
@ngopalak-redhat: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info. |
|
/payload-abort |
|
@ngopalak-redhat: aborted 1 active payload job(s) for pull request #31339 |
|
/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning |
|
@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/dcd16630-703e-11f1-9549-931a9d3f68e2-0 |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/extended/node/node_e2e/node.go`:
- Line 183: The mirror-set teardown in the “[OTP] Create ImageDigestMirrorSet
and ImageTagMirrorSet…” spec is closing over the interruptible test ctx, so
cleanup can be canceled before deleting IDMS/ITMS and waiting for MCP recovery.
Update the g.DeferCleanup block to use a cleanup-scoped context derived for
teardown instead of the main ctx, and pass that cleanup context through the
delete-and-wait logic so the cluster-scoped resources are always removed even if
the test times out.
In `@test/extended/node/node_swap_cnv.go`:
- Around line 700-705: The deferred teardown blocks in the swap test are using
the main test ctx, so cleanup can stop early if the spec is canceled or times
out. Update each defer that performs teardown work around removeDropInFile,
swapoff, swapfile deletion, kubelet restarts, and ready waits to create and use
a cleanup-only context with its own timeout. Keep the deferred cleanup
independent of the test lifecycle, and make sure all cleanup calls still check
and handle returned errors.
In `@test/extended/node/node_utils.go`:
- Around line 161-190: The timeout branch in the debug helper is conflating
parent context cancellation with the 2-minute deadline and is also ignoring
failures from terminating the spawned process. Update the `RunDebug`-style flow
in `node_utils.go` so the `select` distinguishes `ctx.Done()` from
`timeoutCtx.Done()` and returns the appropriate cancellation error instead of
always reporting a timeout; also handle and propagate any error from
`execCmd.Process.Kill()` on the timeout path to avoid hiding a still-running `oc
debug` process.
- Around line 304-306: The cleanup helper cleanupDropInAndRestartKubelet is
driven by the spec context, so deferred teardown can be skipped once that
context is canceled. Change cleanupDropInAndRestartKubelet to use a context that
is independent of the spec lifecycle for the removeDropInFile and restart path,
so cleanup still runs after timeouts/interruption. Keep handling the returned
errors from removeDropInFile and the restart path rather than ignoring them.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 27b2b914-31d1-4aa3-b336-32141b6de457
📒 Files selected for processing (8)
test/extended/node/node_e2e/container_runtime_config.gotest/extended/node/node_e2e/image_registry_config.gotest/extended/node/node_e2e/initcontainer.gotest/extended/node/node_e2e/node.gotest/extended/node/node_sizing.gotest/extended/node/node_swap_cnv.gotest/extended/node/node_utils.gotest/extended/node/system_compressible.go
| defer func() { | ||
| g.By("Cleaning up multiple config files") | ||
| framework.Logf("Removing: %s", file98) | ||
| removeDropInFile(oc, cnvWorkerNode, file98) | ||
| removeDropInFile(ctx, oc, cnvWorkerNode, file98) | ||
| framework.Logf("Removing: %s", file99) | ||
| removeDropInFile(oc, cnvWorkerNode, file99) | ||
| removeDropInFile(ctx, oc, cnvWorkerNode, file99) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Use a cleanup-only context in deferred teardown blocks.
These defers reuse the test ctx for drop-in removal, swapoff, swapfile deletion, kubelet restarts, and ready waits. Once the spec is canceled or times out, teardown can abort immediately and later cases can inherit leftover swap/drop-in state.
Suggested fix
defer func() {
+ cleanupCtx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
+ defer cancel()
g.By("Final cleanup")
framework.Logf("Disabling test swap file if present")
- ExecOnNodeWithNsenter(ctx, oc, cnvWorkerNode, "swapoff", swapFilePath)
- ExecOnNodeWithChroot(ctx, oc, cnvWorkerNode, "rm", "-f", swapFilePath)
+ if _, err := ExecOnNodeWithNsenter(cleanupCtx, oc, cnvWorkerNode, "swapoff", swapFilePath); err != nil {
+ framework.Logf("Warning: failed to disable swap on %s: %v", cnvWorkerNode, err)
+ }
+ if _, err := ExecOnNodeWithChroot(cleanupCtx, oc, cnvWorkerNode, "rm", "-f", swapFilePath); err != nil {
+ framework.Logf("Warning: failed to remove swap file %s on %s: %v", swapFilePath, cnvWorkerNode, err)
+ }
framework.Logf("Removing drop-in file: %s", cnvDropInFilePath)
- removeDropInFile(ctx, oc, cnvWorkerNode, cnvDropInFilePath)
+ if err := removeDropInFile(cleanupCtx, oc, cnvWorkerNode, cnvDropInFilePath); err != nil {
+ framework.Logf("Warning: failed to remove drop-in %s on %s: %v", cnvDropInFilePath, cnvWorkerNode, err)
+ }
framework.Logf("Restarting kubelet")
- restartKubeletOnNode(ctx, oc, cnvWorkerNode)
- waitForNodeToBeReady(ctx, oc, cnvWorkerNode)
+ restartKubeletOnNode(cleanupCtx, oc, cnvWorkerNode)
+ waitForNodeToBeReady(cleanupCtx, oc, cnvWorkerNode)
}()As per path instructions, **/*.go: Never ignore error returns; context.Context for cancellation and timeouts.
Also applies to: 777-777, 940-943, 1077-1085, 1279-1287
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/extended/node/node_swap_cnv.go` around lines 700 - 705, The deferred
teardown blocks in the swap test are using the main test ctx, so cleanup can
stop early if the spec is canceled or times out. Update each defer that performs
teardown work around removeDropInFile, swapoff, swapfile deletion, kubelet
restarts, and ready waits to create and use a cleanup-only context with its own
timeout. Keep the deferred cleanup independent of the test lifecycle, and make
sure all cleanup calls still check and handle returned errors.
Source: Path instructions
There was a problem hiding this comment.
Are you sure "context.Background()" is fine. I was always encouraged to pass context when possible
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
if [ -f go.mod ]; then
rg -n '^go [0-9]+\.[0-9]+' go.mod
else
fd '^go\.mod$' -a
fiLength of output: 268
@ngopalak-redhat Good call — for this repo, context.Background() is not the best recommendation.
I checked go.mod, and this PR is on go 1.25.0, so context.WithoutCancel is available. In this case, the better pattern is:
defer func() {
cleanupBaseCtx := context.WithoutCancel(ctx)
cleanupCtx, cancel := context.WithTimeout(cleanupBaseCtx, 5*time.Minute)
defer cancel()
// cleanup work using cleanupCtx
}()Why this is better here:
- it keeps cleanup independent from the spec’s cancellation/timeout
- it still preserves any useful values carried on
ctx - the explicit timeout prevents teardown from hanging forever
So my earlier context.Background() suggestion was safe in the narrow sense of “cleanup should outlive the test context”, but context.WithoutCancel(ctx) is the cleaner choice for this codebase.
Thanks for pushing on that nuance.
✏️ Learnings added
Learnt from: ngopalak-redhat
Repo: openshift/origin PR: 31339
File: test/extended/node/node_swap_cnv.go:700-705
Timestamp: 2026-06-25T02:53:16.767Z
Learning: In the openshift/origin Go codebase, when deferred teardown must continue after the main test `context.Context` is canceled or times out, prefer deriving cleanup context with `context.WithoutCancel(ctx)` and then bounding it with `context.WithTimeout(...)` instead of starting from `context.Background()`, so cleanup remains independent while preserving context values.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: Chandan9112
Repo: openshift/origin PR: 31182
File: test/extended/node/node_e2e/node.go:0-0
Timestamp: 2026-05-19T12:28:55.867Z
Learning: When reviewing Go code that uses OpenShift configv1 (`github.com/openshift/api/config/v1`), note that `ImageSpec.RegistrySources` is declared as a non-pointer value (`RegistrySources RegistrySources`). It is always safely zero-initialized by Go, so it can never be `nil`. Therefore, do not add nil-guards for `imageSpec.RegistrySources` before accessing its fields (e.g., `AllowedRegistries`, `ContainerRuntimeSearchRegistries`).
Learnt from: Chandan9112
Repo: openshift/origin PR: 31182
File: test/extended/node/node_e2e/node.go:0-0
Timestamp: 2026-05-19T12:28:55.867Z
Learning: When using the OpenShift `configv1` API (`github.com/openshift/api/config/v1`), treat `ImageSpec.RegistrySources` as a non-nil Go struct value (`RegistrySources`, not `*RegistrySources`). Because it can never be nil (it’s always zero-initialized), don’t add nil-guards before accessing its fields (e.g., `AllowedRegistries`, `ContainerRuntimeSearchRegistries`). You may still need to handle zero-value contents, but a nil check on `RegistrySources` itself is unnecessary.
|
Scheduling required tests: |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
test/extended/node/node_utils.go (1)
311-317: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winPropagate failures from
cleanupDropInAndRestartKubelet.Line 313 and Line 315 still drop the returned errors from the mutating steps. If removing the drop-in or restarting kubelet fails, later cases can inherit the modified node state with no direct signal from this helper.
Suggested fix
-func cleanupDropInAndRestartKubelet(ctx context.Context, oc *exutil.CLI, nodeName, filePath string) { +func cleanupDropInAndRestartKubelet(ctx context.Context, oc *exutil.CLI, nodeName, filePath string) error { framework.Logf("Removing drop-in file: %s", filePath) - removeDropInFile(ctx, oc, nodeName, filePath) + if err := removeDropInFile(ctx, oc, nodeName, filePath); err != nil { + return fmt.Errorf("remove drop-in %s on %s: %w", filePath, nodeName, err) + } framework.Logf("Restarting kubelet on node: %s", nodeName) - restartKubeletOnNode(ctx, oc, nodeName) + if err := restartKubeletOnNode(ctx, oc, nodeName); err != nil { + return fmt.Errorf("restart kubelet on %s: %w", nodeName, err) + } framework.Logf("Waiting for node to be ready...") waitForNodeToBeReady(ctx, oc, nodeName) + return nil }As per path instructions,
**/*.go: Never ignore error returns.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/node/node_utils.go` around lines 311 - 317, Propagate the errors from cleanupDropInAndRestartKubelet instead of discarding them: update the helper to return an error and handle failures from removeDropInFile, restartKubeletOnNode, and waitForNodeToBeReady explicitly. Use the existing helper names to thread the error back to callers so any failed drop-in removal or kubelet restart stops the test flow immediately.Source: Path instructions
test/extended/node/node_e2e/node.go (1)
221-234: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winSkip the MCP wait when mirror-set deletion failed.
The new
cleanupCtxonly bounds the delete RPCs. If Line 227 or Line 230 returns a real error, teardown still goes on toWaitForMCPConfigSpecChangeAndUpdated(...)for resources that may still exist, which can stretch cleanup and hide the root failure.Suggested fix
g.DeferCleanup(func() { cleanupCtx, cancel := context.WithTimeout(context.Background(), 10*time.Minute) defer cancel() g.By("Cleanup: Delete IDMS and ITMS resources") cleanupWorkerSpec := imagepolicy.GetMCPCurrentSpecConfigName(oc, "worker") cleanupMasterSpec := imagepolicy.GetMCPCurrentSpecConfigName(oc, "master") - if delErr := configClient.ImageTagMirrorSets().Delete(cleanupCtx, itmsName, metav1.DeleteOptions{}); delErr != nil { + waitForRollback := true + if delErr := configClient.ImageTagMirrorSets().Delete(cleanupCtx, itmsName, metav1.DeleteOptions{}); delErr != nil { e2e.Logf("Warning: failed to delete ImageTagMirrorSet: %v", delErr) + waitForRollback = false } - if delErr := configClient.ImageDigestMirrorSets().Delete(cleanupCtx, idmsName, metav1.DeleteOptions{}); delErr != nil { + if delErr := configClient.ImageDigestMirrorSets().Delete(cleanupCtx, idmsName, metav1.DeleteOptions{}); delErr != nil { e2e.Logf("Warning: failed to delete ImageDigestMirrorSet: %v", delErr) + waitForRollback = false + } + if waitForRollback { + imagepolicy.WaitForMCPConfigSpecChangeAndUpdated(oc, "worker", cleanupWorkerSpec) + imagepolicy.WaitForMCPConfigSpecChangeAndUpdated(oc, "master", cleanupMasterSpec) } - imagepolicy.WaitForMCPConfigSpecChangeAndUpdated(oc, "worker", cleanupWorkerSpec) - imagepolicy.WaitForMCPConfigSpecChangeAndUpdated(oc, "master", cleanupMasterSpec) })🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/node/node_e2e/node.go` around lines 221 - 234, In the DeferCleanup teardown block in node.go, the current cleanup flow always calls imagepolicy.WaitForMCPConfigSpecChangeAndUpdated after attempting to delete ImageTagMirrorSets and ImageDigestMirrorSets, even when one of those Delete calls fails. Track whether either delete in the cleanup lambda returns an error and only run the MCP wait when both deletions succeed; otherwise log the failure and skip the waits so teardown doesn’t block on resources that may still exist.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/extended/node/node_swap_cnv.go`:
- Around line 704-719: The deferred cleanup in node_swap_cnv.go ignores the
result of restartKubeletOnNode, which violates the never-ignore-errors rule and
hides the real cleanup failure. Update each deferred cleanup block that calls
restartKubeletOnNode to capture and handle its error the same way
removeDropInFile does, with a warning log or fail-fast behavior before calling
waitForNodeToBeReady; apply this to all matching cleanup sections in the node
swap tests.
- Line 353: The node-exec setup/probe paths are still discarding command errors
by assigning them to `_`, which can hide real failures and cause incorrect state
detection. Update the affected calls around ExecOnNodeWithChroot and the swap
probe logic so each command’s error is checked and propagated or handled
explicitly, especially the swapon -s check that can currently misclassify nodes
when the probe itself fails. Make the same fix for the other listed probe/setup
sites in this Go test file so no remaining node-exec error is ignored.
---
Outside diff comments:
In `@test/extended/node/node_e2e/node.go`:
- Around line 221-234: In the DeferCleanup teardown block in node.go, the
current cleanup flow always calls
imagepolicy.WaitForMCPConfigSpecChangeAndUpdated after attempting to delete
ImageTagMirrorSets and ImageDigestMirrorSets, even when one of those Delete
calls fails. Track whether either delete in the cleanup lambda returns an error
and only run the MCP wait when both deletions succeed; otherwise log the failure
and skip the waits so teardown doesn’t block on resources that may still exist.
In `@test/extended/node/node_utils.go`:
- Around line 311-317: Propagate the errors from cleanupDropInAndRestartKubelet
instead of discarding them: update the helper to return an error and handle
failures from removeDropInFile, restartKubeletOnNode, and waitForNodeToBeReady
explicitly. Use the existing helper names to thread the error back to callers so
any failed drop-in removal or kubelet restart stops the test flow immediately.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 925afdf0-eebd-4d54-868c-665638f7545d
📒 Files selected for processing (8)
test/extended/node/node_e2e/container_runtime_config.gotest/extended/node/node_e2e/image_registry_config.gotest/extended/node/node_e2e/initcontainer.gotest/extended/node/node_e2e/node.gotest/extended/node/node_sizing.gotest/extended/node/node_swap_cnv.gotest/extended/node/node_utils.gotest/extended/node/system_compressible.go
|
|
||
| g.By(fmt.Sprintf("Creating drop-in directory on %s if not exists", cpNodeName)) | ||
| _, _ = ExecOnNodeWithChroot(oc, cpNodeName, "mkdir", "-p", cnvDropInDir) | ||
| _, _ = ExecOnNodeWithChroot(ctx, oc, cpNodeName, "mkdir", "-p", cnvDropInDir) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Stop discarding the remaining node-exec errors.
These setup/state probes still assign _ to command errors. That makes later failures misleading, and Line 1263 can even misclassify a node as having no preexisting swap when swapon -s actually failed.
As per path instructions, **/*.go: Never ignore error returns.
Also applies to: 380-380, 408-408, 945-945, 1157-1157, 1263-1263
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/extended/node/node_swap_cnv.go` at line 353, The node-exec setup/probe
paths are still discarding command errors by assigning them to `_`, which can
hide real failures and cause incorrect state detection. Update the affected
calls around ExecOnNodeWithChroot and the swap probe logic so each command’s
error is checked and propagated or handled explicitly, especially the swapon -s
check that can currently misclassify nodes when the probe itself fails. Make the
same fix for the other listed probe/setup sites in this Go test file so no
remaining node-exec error is ignored.
Source: Path instructions
| defer func() { | ||
| cleanupCtx, cancel := context.WithTimeout(context.Background(), 5*time.Minute) | ||
| defer cancel() | ||
| g.By("Cleaning up multiple config files") | ||
| framework.Logf("Removing: %s", file98) | ||
| removeDropInFile(oc, cnvWorkerNode, file98) | ||
| if err := removeDropInFile(cleanupCtx, oc, cnvWorkerNode, file98); err != nil { | ||
| framework.Logf("Warning: failed to remove %s: %v", file98, err) | ||
| } | ||
| framework.Logf("Removing: %s", file99) | ||
| removeDropInFile(oc, cnvWorkerNode, file99) | ||
| if err := removeDropInFile(cleanupCtx, oc, cnvWorkerNode, file99); err != nil { | ||
| framework.Logf("Warning: failed to remove %s: %v", file99, err) | ||
| } | ||
| framework.Logf("Running: systemctl restart kubelet") | ||
| restartKubeletOnNode(ctx, oc, cnvWorkerNode) | ||
| waitForNodeToBeReady(ctx, oc, cnvWorkerNode) | ||
| restartKubeletOnNode(cleanupCtx, oc, cnvWorkerNode) | ||
| waitForNodeToBeReady(cleanupCtx, oc, cnvWorkerNode) | ||
| framework.Logf("Cleanup completed") |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Handle restartKubeletOnNode failures in deferred cleanup.
These defers check file-removal errors, but still ignore the kubelet restart result. If restart fails, the next symptom is usually a readiness wait or a polluted node for the next longrunning case, not the real cause.
As per path instructions, **/*.go: Never ignore error returns.
Also applies to: 783-799, 954-967, 1094-1113
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/extended/node/node_swap_cnv.go` around lines 704 - 719, The deferred
cleanup in node_swap_cnv.go ignores the result of restartKubeletOnNode, which
violates the never-ignore-errors rule and hides the real cleanup failure. Update
each deferred cleanup block that calls restartKubeletOnNode to capture and
handle its error the same way removeDropInFile does, with a warning log or
fail-fast behavior before calling waitForNodeToBeReady; apply this to all
matching cleanup sections in the node swap tests.
Source: Path instructions
|
Scheduling required tests: |
|
/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning |
|
@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/948eb880-7072-11f1-83e6-7191af33d1d1-0 |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ngopalak-redhat The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/extended/node/kubelet_secret_pulled_images.go`:
- Around line 207-210: The ordered cleanup in the kubelet secret pulled images
test is swallowing rollback failures because both deleteKC and waitForMCP errors
are ignored. Update the DeferCleanup registered in the Case 4 setup to handle
and surface errors from deleteKC and waitForMCP instead of discarding them, so
the g.Ordered suite does not continue with a partially rolled back worker pool.
Use the existing cleanup block near the Case 4 setup to keep the cluster state
consistent for later cases.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: b9105c28-a543-45f9-9502-ae785817fab4
📒 Files selected for processing (4)
test/extended/imagepolicy/imagepolicy.gotest/extended/node/kubelet_secret_pulled_images.gotest/extended/node/kubeletconfig_tls.gotest/extended/node/node_utils.go
| g.DeferCleanup(func() { | ||
| _ = deleteKC(oc, kcName) | ||
| _ = waitForMCP(ctx, mcClient, "worker", 30*time.Minute) | ||
| _ = waitForMCP(ctx, mcClient, "worker", 15*time.Minute) | ||
| }) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Don't swallow rollback failures in this ordered cleanup.
This DeferCleanup drops both the deleteKC and waitForMCP errors, so Case 4 can leave the worker pool mid-rollout and still pass cleanup. In an g.Ordered suite, that leaks mutated cluster state into later cases.
Suggested fix
g.DeferCleanup(func() {
- _ = deleteKC(oc, kcName)
- _ = waitForMCP(ctx, mcClient, "worker", 15*time.Minute)
+ if err := deleteKC(oc, kcName); err != nil {
+ framework.Failf("cleanup: failed to delete KubeletConfig %s: %v", kcName, err)
+ }
+ if err := waitForMCP(ctx, mcClient, "worker", 15*time.Minute); err != nil {
+ framework.Failf("cleanup: worker MCP did not recover after deleting %s: %v", kcName, err)
+ }
})As per path instructions, **/*.go: Never ignore error returns.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| g.DeferCleanup(func() { | |
| _ = deleteKC(oc, kcName) | |
| _ = waitForMCP(ctx, mcClient, "worker", 30*time.Minute) | |
| _ = waitForMCP(ctx, mcClient, "worker", 15*time.Minute) | |
| }) | |
| g.DeferCleanup(func() { | |
| if err := deleteKC(oc, kcName); err != nil { | |
| framework.Failf("cleanup: failed to delete KubeletConfig %s: %v", kcName, err) | |
| } | |
| if err := waitForMCP(ctx, mcClient, "worker", 15*time.Minute); err != nil { | |
| framework.Failf("cleanup: worker MCP did not recover after deleting %s: %v", kcName, err) | |
| } | |
| }) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@test/extended/node/kubelet_secret_pulled_images.go` around lines 207 - 210,
The ordered cleanup in the kubelet secret pulled images test is swallowing
rollback failures because both deleteKC and waitForMCP errors are ignored.
Update the DeferCleanup registered in the Case 4 setup to handle and surface
errors from deleteKC and waitForMCP instead of discarding them, so the g.Ordered
suite does not continue with a partially rolled back worker pool. Use the
existing cleanup block near the Case 4 setup to keep the cluster state
consistent for later cases.
Source: Path instructions
|
Scheduling required tests: |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
test/extended/node/node_e2e/image_registry_config.go (1)
43-44: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick winUse the Ginkgo-provided context instead of
context.Background()in the test body.The
BeforeEachwas updated to acceptctx context.Context, but thisg.Itstill usesfunc()and createsctx := context.Background(). TheExecOnNodeWithChroot(ctx, ...)calls at Lines 110 and 124 therefore run against a non-cancelable, deadline-less context, so the spec-level timeout/cancellation this PR aims to enforce won't propagate to the node exec calls. Prefer wiring the spec context through.♻️ Proposed change
- g.It("[OTP] change container registry config [OCP-44820]", func() { - ctx := context.Background() + g.It("[OTP] change container registry config [OCP-44820]", func(ctx context.Context) { searchRegistry := "qe.quay.io"Note:
g.DeferCleanupat Line 54 also captures thisctx; verify cleanup remains bounded if you switch to the spec context, since the spec context is canceled when the test ends.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/extended/node/node_e2e/image_registry_config.go` around lines 43 - 44, The test body in g.It still creates its own context with context.Background(), which bypasses the Ginkgo spec context and prevents timeout/cancellation from reaching ExecOnNodeWithChroot. Update this spec to accept and use the Ginkgo-provided ctx consistently (matching the BeforeEach(ctx context.Context) change), pass that ctx through all node exec calls, and ensure g.DeferCleanup continues to use the same bounded spec context so cleanup is canceled when the test ends.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@test/extended/node/node_e2e/image_registry_config.go`:
- Around line 43-44: The test body in g.It still creates its own context with
context.Background(), which bypasses the Ginkgo spec context and prevents
timeout/cancellation from reaching ExecOnNodeWithChroot. Update this spec to
accept and use the Ginkgo-provided ctx consistently (matching the BeforeEach(ctx
context.Context) change), pass that ctx through all node exec calls, and ensure
g.DeferCleanup continues to use the same bounded spec context so cleanup is
canceled when the test ends.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 6b94b459-25b6-4ec3-b2a1-5efc72330f74
📒 Files selected for processing (11)
test/extended/node/image_volume.gotest/extended/node/node_e2e/container_runtime_config.gotest/extended/node/node_e2e/image_registry_config.gotest/extended/node/node_e2e/initcontainer.gotest/extended/node/node_e2e/node.gotest/extended/node/node_e2e/pdb_drain.gotest/extended/node/node_e2e/probe_termination.gotest/extended/node/node_sizing.gotest/extended/node/node_swap.gotest/extended/node/node_utils.gotest/extended/node/system_compressible.go
🚧 Files skipped from review as they are similar to previous changes (4)
- test/extended/node/node_e2e/initcontainer.go
- test/extended/node/node_e2e/container_runtime_config.go
- test/extended/node/node_e2e/node.go
- test/extended/node/node_utils.go
|
|
||
| // execOnNodeWithDebug is a common helper that runs oc debug with timeout and process cleanup | ||
| func execOnNodeWithDebug(ctx context.Context, oc *exutil.CLI, nodeName string, args []string) (string, error) { | ||
| timeoutCtx, cancel := context.WithTimeout(ctx, 2*time.Minute) |
There was a problem hiding this comment.
I wonder if this is a legit value.
Some command may take some time, I guess it's better to use just a given context and let the caller decide the timeout.
| isMicroShift, err := exutil.IsMicroShiftCluster(oc.AdminKubeClient()) | ||
| o.Expect(err).NotTo(o.HaveOccurred(), "failed to detect MicroShift cluster") | ||
| if isMicroShift { | ||
| g.Skip("Skipping test on MicroShift cluster - MachineConfig resources are not available") | ||
| } | ||
|
|
||
| notReadyNodes, err := nodeutils.GetNotReadyNodes(ctx, oc) | ||
| o.Expect(err).NotTo(o.HaveOccurred(), "failed to check node readiness") | ||
| o.Expect(notReadyNodes).To(o.BeEmpty(), | ||
| "Cannot start test: nodes not Ready: %v. Cluster may be recovering from previous test.", notReadyNodes) |
|
/payload-abort |
|
@ngopalak-redhat: aborted 1 active payload job(s) for pull request #31339 |
|
/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning |
|
@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/de03c3b0-7096-11f1-8bc9-1d5f884a2a3a-0 |
|
Scheduling required tests: |
Ensure that the Exec calls timeout after 2 mins. Fixes the issue where the longrunning test suite keeps running more than 4 hours when there are issues.
https://sippy.dptools.openshift.org/sippy-ng/jobs/5.0/runs?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning%22%7D%5D%7D&sortField=timestamp&sort=desc
https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning/2069752664094150656
Also fixed:
Summary by CodeRabbit
Summary (Updated)
Bug Fixes
Test Improvements