Skip to content

Fix longrunning suite for node test cases#31339

Open
ngopalak-redhat wants to merge 6 commits into
openshift:mainfrom
ngopalak-redhat:ngopalak/fix_long_running_timeout
Open

Fix longrunning suite for node test cases#31339
ngopalak-redhat wants to merge 6 commits into
openshift:mainfrom
ngopalak-redhat:ngopalak/fix_long_running_timeout

Conversation

@ngopalak-redhat

@ngopalak-redhat ngopalak-redhat commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Ensure that the Exec calls timeout after 2 mins. Fixes the issue where the longrunning test suite keeps running more than 4 hours when there are issues.
https://sippy.dptools.openshift.org/sippy-ng/jobs/5.0/runs?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning%22%7D%5D%7D&sortField=timestamp&sort=desc

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning/2069752664094150656

Also fixed:

  • Made sure to check if the nodes are in ready state before starting the test. Fail fast is better
  • Also ensure that context is passed correctly in the tests
  • Also reduced the timeouts to 15 mins max

Summary by CodeRabbit

Summary (Updated)

  • Bug Fixes

    • Improved end-to-end node debugging and remote command execution by making node operations fully context-aware, with better timeout handling and more reliable cleanup on failures.
    • Enhanced error output for node command failures with clearer node-specific stdout/stderr details.
  • Test Improvements

    • Added consistent “nodes are ready” pre-checks across more node test suites.
    • Reduced MachineConfigPool-related wait times to 15 minutes to speed up feedback during slow rollouts.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 6a1acf07-50a8-4037-9162-dd886b657646

📥 Commits

Reviewing files that changed from the base of the PR and between c6ec636 and c4d3b5b.

📒 Files selected for processing (11)
  • test/extended/node/image_volume.go
  • test/extended/node/node_e2e/container_runtime_config.go
  • test/extended/node/node_e2e/image_registry_config.go
  • test/extended/node/node_e2e/initcontainer.go
  • test/extended/node/node_e2e/node.go
  • test/extended/node/node_e2e/pdb_drain.go
  • test/extended/node/node_e2e/probe_termination.go
  • test/extended/node/node_sizing.go
  • test/extended/node/node_swap.go
  • test/extended/node/node_utils.go
  • test/extended/node/system_compressible.go
🚧 Files skipped from review as they are similar to previous changes (6)
  • test/extended/node/image_volume.go
  • test/extended/node/node_e2e/initcontainer.go
  • test/extended/node/system_compressible.go
  • test/extended/node/node_e2e/image_registry_config.go
  • test/extended/node/node_e2e/node.go
  • test/extended/node/node_utils.go

Walkthrough

Node execution helpers now accept context.Context, node suites pass context through readiness checks and command execution, CNV swap tests use context-aware cleanup and execution, and several MachineConfigPool waits were shortened.

Changes

Node context plumbing and readiness

Layer / File(s) Summary
Node exec helpers and readiness utilities
test/extended/node/node_utils.go
oc debug execution becomes context-aware, chroot/nsenter helpers accept ctx, drop-in helpers pass ctx through, EnsureNodesReady uses GetNotReadyNodes, and CNV wait timeouts are reduced.
Suite readiness gates
test/extended/node/node_e2e/container_runtime_config.go, test/extended/node/node_e2e/image_registry_config.go, test/extended/node/node_e2e/initcontainer.go, test/extended/node/node_e2e/node.go, test/extended/node/node_e2e/pdb_drain.go, test/extended/node/node_e2e/probe_termination.go, test/extended/node/node_sizing.go, test/extended/node/node_swap.go, test/extended/node/image_volume.go, test/extended/node/system_compressible.go
BeforeEach hooks now accept ctx and call EnsureNodesReady(ctx, oc) after MicroShift skips.
Context propagation in node tests
test/extended/node/node_e2e/container_runtime_config.go, test/extended/node/node_e2e/image_registry_config.go, test/extended/node/node_e2e/initcontainer.go, test/extended/node/node_e2e/node.go, test/extended/node/node_sizing.go, test/extended/node/system_compressible.go, test/extended/node/node_swap_cnv.go
Node command execution, cleanup paths, and helper calls pass ctx through chroot/nsenter execution and timeout-scoped cleanup contexts across container runtime, image registry, init container, node, node sizing, system-compressible, and CNV swap tests.
MCP wait reductions
test/extended/imagepolicy/imagepolicy.go, test/extended/node/kubelet_secret_pulled_images.go, test/extended/node/kubeletconfig_tls.go
MachineConfigPool wait timeouts are reduced from longer durations to 15 minutes in the MCP update helpers and kubelet rollout tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • sjenning
  • p0lyn0mial
  • neisw
🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title matches the main change: fixing the longrunning node test suite.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed No touched Ginkgo titles embed dynamic values; the only concatenated suite label uses ثابت config labels and other titles are static strings.
Test Structure And Quality ✅ Passed PASS: The PR adds readiness gating, propagates ctx, uses explicit Eventually timeouts, and enforces 2-minute node exec timeouts with cleanup contexts.
Microshift Test Compatibility ✅ Passed All modified e2e suites using non-MicroShift APIs are guarded by MicroShift skips or apigroup tags; no new unguarded incompatible tests were added.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new SNO-incompatible Ginkgo cases were added; the touched suites are mostly context/timeout plumbing, and explicit multi-node cases already skip SNO.
Topology-Aware Scheduling Compatibility ✅ Passed The PR only changes test helpers and test cases to add ctx/timeouts/readiness checks; it does not modify manifests, operators, controllers, or add new scheduling constraints.
Ote Binary Stdout Contract ✅ Passed Touched files only change test hooks/helpers; no init/TestMain/BeforeSuite/RunSpecs or stdout-printing calls were found in process-level code.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PASS: The PR only changes context/timeouts/readiness; scans found no new IPv4 literals, IPv4-only parsing, or public-internet calls in the modified hunks.
No-Weak-Crypto ✅ Passed The changed node test files only adjust contexts/timeouts and node-exec helpers; searches found no weak-crypto APIs or secret/token comparisons.
Container-Privileges ✅ Passed Touched files are Go test/helpers only; no added K8s/container manifest fields like privileged, hostNetwork, allowPrivilegeEscalation, SYS_ADMIN, or runAsUser:0 were found.
No-Sensitive-Data-In-Logs ✅ Passed PASS: The patch only adds ctx/timeouts/readiness checks; no new explicit logging of secrets, PII, or host data appears in the changed paths.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@ngopalak-redhat

Copy link
Copy Markdown
Contributor Author

/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning

@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/3d432cf0-703b-11f1-86c3-0f1b17f4c2e5-0

@openshift-ci openshift-ci Bot requested review from MaysaMacedo and bitoku June 25, 2026 02:12

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/node/node_utils.go`:
- Around line 163-183: The `oc debug` call in `node_utils.go` can keep running
after the timeout because `Outputs()` is launched in a goroutine without any
cancellation, so the command and goroutine may leak. Update both helper call
sites around `oc.AsAdmin().WithoutNamespace().Run("debug")`/`Outputs()` to use a
cancelable context or otherwise terminate the underlying command when the
timeout fires, and make sure the timeout path in the `select` fully stops the
in-flight `oc debug` execution.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 51bee914-d5e0-439b-a0c2-9b9586929a25

📥 Commits

Reviewing files that changed from the base of the PR and between 91cd015 and 14bf640.

📒 Files selected for processing (1)
  • test/extended/node/node_utils.go

Comment thread test/extended/node/node_utils.go Outdated
@ngopalak-redhat

Copy link
Copy Markdown
Contributor Author

/payload-abort periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning

@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@ngopalak-redhat: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

@ngopalak-redhat

Copy link
Copy Markdown
Contributor Author

/payload-abort

@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@ngopalak-redhat: aborted 1 active payload job(s) for pull request #31339

@ngopalak-redhat

Copy link
Copy Markdown
Contributor Author

/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning

@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/dcd16630-703e-11f1-9549-931a9d3f68e2-0

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/node/node_e2e/node.go`:
- Line 183: The mirror-set teardown in the “[OTP] Create ImageDigestMirrorSet
and ImageTagMirrorSet…” spec is closing over the interruptible test ctx, so
cleanup can be canceled before deleting IDMS/ITMS and waiting for MCP recovery.
Update the g.DeferCleanup block to use a cleanup-scoped context derived for
teardown instead of the main ctx, and pass that cleanup context through the
delete-and-wait logic so the cluster-scoped resources are always removed even if
the test times out.

In `@test/extended/node/node_swap_cnv.go`:
- Around line 700-705: The deferred teardown blocks in the swap test are using
the main test ctx, so cleanup can stop early if the spec is canceled or times
out. Update each defer that performs teardown work around removeDropInFile,
swapoff, swapfile deletion, kubelet restarts, and ready waits to create and use
a cleanup-only context with its own timeout. Keep the deferred cleanup
independent of the test lifecycle, and make sure all cleanup calls still check
and handle returned errors.

In `@test/extended/node/node_utils.go`:
- Around line 161-190: The timeout branch in the debug helper is conflating
parent context cancellation with the 2-minute deadline and is also ignoring
failures from terminating the spawned process. Update the `RunDebug`-style flow
in `node_utils.go` so the `select` distinguishes `ctx.Done()` from
`timeoutCtx.Done()` and returns the appropriate cancellation error instead of
always reporting a timeout; also handle and propagate any error from
`execCmd.Process.Kill()` on the timeout path to avoid hiding a still-running `oc
debug` process.
- Around line 304-306: The cleanup helper cleanupDropInAndRestartKubelet is
driven by the spec context, so deferred teardown can be skipped once that
context is canceled. Change cleanupDropInAndRestartKubelet to use a context that
is independent of the spec lifecycle for the removeDropInFile and restart path,
so cleanup still runs after timeouts/interruption. Keep handling the returned
errors from removeDropInFile and the restart path rather than ignoring them.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 27b2b914-31d1-4aa3-b336-32141b6de457

📥 Commits

Reviewing files that changed from the base of the PR and between 14bf640 and e1ccb62.

📒 Files selected for processing (8)
  • test/extended/node/node_e2e/container_runtime_config.go
  • test/extended/node/node_e2e/image_registry_config.go
  • test/extended/node/node_e2e/initcontainer.go
  • test/extended/node/node_e2e/node.go
  • test/extended/node/node_sizing.go
  • test/extended/node/node_swap_cnv.go
  • test/extended/node/node_utils.go
  • test/extended/node/system_compressible.go

Comment thread test/extended/node/node_e2e/node.go
Comment thread test/extended/node/node_swap_cnv.go Outdated
Comment on lines +700 to +705
defer func() {
g.By("Cleaning up multiple config files")
framework.Logf("Removing: %s", file98)
removeDropInFile(oc, cnvWorkerNode, file98)
removeDropInFile(ctx, oc, cnvWorkerNode, file98)
framework.Logf("Removing: %s", file99)
removeDropInFile(oc, cnvWorkerNode, file99)
removeDropInFile(ctx, oc, cnvWorkerNode, file99)

@coderabbitai coderabbitai Bot Jun 25, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Use a cleanup-only context in deferred teardown blocks.

These defers reuse the test ctx for drop-in removal, swapoff, swapfile deletion, kubelet restarts, and ready waits. Once the spec is canceled or times out, teardown can abort immediately and later cases can inherit leftover swap/drop-in state.

Suggested fix
 defer func() {
+	cleanupCtx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
+	defer cancel()
 	g.By("Final cleanup")
 	framework.Logf("Disabling test swap file if present")
-	ExecOnNodeWithNsenter(ctx, oc, cnvWorkerNode, "swapoff", swapFilePath)
-	ExecOnNodeWithChroot(ctx, oc, cnvWorkerNode, "rm", "-f", swapFilePath)
+	if _, err := ExecOnNodeWithNsenter(cleanupCtx, oc, cnvWorkerNode, "swapoff", swapFilePath); err != nil {
+		framework.Logf("Warning: failed to disable swap on %s: %v", cnvWorkerNode, err)
+	}
+	if _, err := ExecOnNodeWithChroot(cleanupCtx, oc, cnvWorkerNode, "rm", "-f", swapFilePath); err != nil {
+		framework.Logf("Warning: failed to remove swap file %s on %s: %v", swapFilePath, cnvWorkerNode, err)
+	}
 	framework.Logf("Removing drop-in file: %s", cnvDropInFilePath)
-	removeDropInFile(ctx, oc, cnvWorkerNode, cnvDropInFilePath)
+	if err := removeDropInFile(cleanupCtx, oc, cnvWorkerNode, cnvDropInFilePath); err != nil {
+		framework.Logf("Warning: failed to remove drop-in %s on %s: %v", cnvDropInFilePath, cnvWorkerNode, err)
+	}
 	framework.Logf("Restarting kubelet")
-	restartKubeletOnNode(ctx, oc, cnvWorkerNode)
-	waitForNodeToBeReady(ctx, oc, cnvWorkerNode)
+	restartKubeletOnNode(cleanupCtx, oc, cnvWorkerNode)
+	waitForNodeToBeReady(cleanupCtx, oc, cnvWorkerNode)
 }()

As per path instructions, **/*.go: Never ignore error returns; context.Context for cancellation and timeouts.

Also applies to: 777-777, 940-943, 1077-1085, 1279-1287

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/node_swap_cnv.go` around lines 700 - 705, The deferred
teardown blocks in the swap test are using the main test ctx, so cleanup can
stop early if the spec is canceled or times out. Update each defer that performs
teardown work around removeDropInFile, swapoff, swapfile deletion, kubelet
restarts, and ready waits to create and use a cleanup-only context with its own
timeout. Keep the deferred cleanup independent of the test lifecycle, and make
sure all cleanup calls still check and handle returned errors.

Source: Path instructions

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure "context.Background()" is fine. I was always encouraged to pass context when possible

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

if [ -f go.mod ]; then
  rg -n '^go [0-9]+\.[0-9]+' go.mod
else
  fd '^go\.mod$' -a
fi

Length of output: 268



@ngopalak-redhat Good call — for this repo, context.Background() is not the best recommendation.

I checked go.mod, and this PR is on go 1.25.0, so context.WithoutCancel is available. In this case, the better pattern is:

defer func() {
	cleanupBaseCtx := context.WithoutCancel(ctx)
	cleanupCtx, cancel := context.WithTimeout(cleanupBaseCtx, 5*time.Minute)
	defer cancel()

	// cleanup work using cleanupCtx
}()

Why this is better here:

  • it keeps cleanup independent from the spec’s cancellation/timeout
  • it still preserves any useful values carried on ctx
  • the explicit timeout prevents teardown from hanging forever

So my earlier context.Background() suggestion was safe in the narrow sense of “cleanup should outlive the test context”, but context.WithoutCancel(ctx) is the cleaner choice for this codebase.

Thanks for pushing on that nuance.


✏️ Learnings added
Learnt from: ngopalak-redhat
Repo: openshift/origin PR: 31339
File: test/extended/node/node_swap_cnv.go:700-705
Timestamp: 2026-06-25T02:53:16.767Z
Learning: In the openshift/origin Go codebase, when deferred teardown must continue after the main test `context.Context` is canceled or times out, prefer deriving cleanup context with `context.WithoutCancel(ctx)` and then bounding it with `context.WithTimeout(...)` instead of starting from `context.Background()`, so cleanup remains independent while preserving context values.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.


🧠 Learnings used
Learnt from: Chandan9112
Repo: openshift/origin PR: 31182
File: test/extended/node/node_e2e/node.go:0-0
Timestamp: 2026-05-19T12:28:55.867Z
Learning: When reviewing Go code that uses OpenShift configv1 (`github.com/openshift/api/config/v1`), note that `ImageSpec.RegistrySources` is declared as a non-pointer value (`RegistrySources RegistrySources`). It is always safely zero-initialized by Go, so it can never be `nil`. Therefore, do not add nil-guards for `imageSpec.RegistrySources` before accessing its fields (e.g., `AllowedRegistries`, `ContainerRuntimeSearchRegistries`).

Learnt from: Chandan9112
Repo: openshift/origin PR: 31182
File: test/extended/node/node_e2e/node.go:0-0
Timestamp: 2026-05-19T12:28:55.867Z
Learning: When using the OpenShift `configv1` API (`github.com/openshift/api/config/v1`), treat `ImageSpec.RegistrySources` as a non-nil Go struct value (`RegistrySources`, not `*RegistrySources`). Because it can never be nil (it’s always zero-initialized), don’t add nil-guards before accessing its fields (e.g., `AllowedRegistries`, `ContainerRuntimeSearchRegistries`). You may still need to handle zero-value contents, but a nil check on `RegistrySources` itself is unnecessary.

Comment thread test/extended/node/node_utils.go Outdated
Comment thread test/extended/node/node_utils.go
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
test/extended/node/node_utils.go (1)

311-317: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Propagate failures from cleanupDropInAndRestartKubelet.

Line 313 and Line 315 still drop the returned errors from the mutating steps. If removing the drop-in or restarting kubelet fails, later cases can inherit the modified node state with no direct signal from this helper.

Suggested fix
-func cleanupDropInAndRestartKubelet(ctx context.Context, oc *exutil.CLI, nodeName, filePath string) {
+func cleanupDropInAndRestartKubelet(ctx context.Context, oc *exutil.CLI, nodeName, filePath string) error {
 	framework.Logf("Removing drop-in file: %s", filePath)
-	removeDropInFile(ctx, oc, nodeName, filePath)
+	if err := removeDropInFile(ctx, oc, nodeName, filePath); err != nil {
+		return fmt.Errorf("remove drop-in %s on %s: %w", filePath, nodeName, err)
+	}
 	framework.Logf("Restarting kubelet on node: %s", nodeName)
-	restartKubeletOnNode(ctx, oc, nodeName)
+	if err := restartKubeletOnNode(ctx, oc, nodeName); err != nil {
+		return fmt.Errorf("restart kubelet on %s: %w", nodeName, err)
+	}
 	framework.Logf("Waiting for node to be ready...")
 	waitForNodeToBeReady(ctx, oc, nodeName)
+	return nil
 }

As per path instructions, **/*.go: Never ignore error returns.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/node_utils.go` around lines 311 - 317, Propagate the
errors from cleanupDropInAndRestartKubelet instead of discarding them: update
the helper to return an error and handle failures from removeDropInFile,
restartKubeletOnNode, and waitForNodeToBeReady explicitly. Use the existing
helper names to thread the error back to callers so any failed drop-in removal
or kubelet restart stops the test flow immediately.

Source: Path instructions

test/extended/node/node_e2e/node.go (1)

221-234: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Skip the MCP wait when mirror-set deletion failed.

The new cleanupCtx only bounds the delete RPCs. If Line 227 or Line 230 returns a real error, teardown still goes on to WaitForMCPConfigSpecChangeAndUpdated(...) for resources that may still exist, which can stretch cleanup and hide the root failure.

Suggested fix
 		g.DeferCleanup(func() {
 			cleanupCtx, cancel := context.WithTimeout(context.Background(), 10*time.Minute)
 			defer cancel()
 			g.By("Cleanup: Delete IDMS and ITMS resources")
 			cleanupWorkerSpec := imagepolicy.GetMCPCurrentSpecConfigName(oc, "worker")
 			cleanupMasterSpec := imagepolicy.GetMCPCurrentSpecConfigName(oc, "master")
-			if delErr := configClient.ImageTagMirrorSets().Delete(cleanupCtx, itmsName, metav1.DeleteOptions{}); delErr != nil {
+			waitForRollback := true
+			if delErr := configClient.ImageTagMirrorSets().Delete(cleanupCtx, itmsName, metav1.DeleteOptions{}); delErr != nil {
 				e2e.Logf("Warning: failed to delete ImageTagMirrorSet: %v", delErr)
+				waitForRollback = false
 			}
-			if delErr := configClient.ImageDigestMirrorSets().Delete(cleanupCtx, idmsName, metav1.DeleteOptions{}); delErr != nil {
+			if delErr := configClient.ImageDigestMirrorSets().Delete(cleanupCtx, idmsName, metav1.DeleteOptions{}); delErr != nil {
 				e2e.Logf("Warning: failed to delete ImageDigestMirrorSet: %v", delErr)
+				waitForRollback = false
+			}
+			if waitForRollback {
+				imagepolicy.WaitForMCPConfigSpecChangeAndUpdated(oc, "worker", cleanupWorkerSpec)
+				imagepolicy.WaitForMCPConfigSpecChangeAndUpdated(oc, "master", cleanupMasterSpec)
 			}
-			imagepolicy.WaitForMCPConfigSpecChangeAndUpdated(oc, "worker", cleanupWorkerSpec)
-			imagepolicy.WaitForMCPConfigSpecChangeAndUpdated(oc, "master", cleanupMasterSpec)
 		})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/node_e2e/node.go` around lines 221 - 234, In the
DeferCleanup teardown block in node.go, the current cleanup flow always calls
imagepolicy.WaitForMCPConfigSpecChangeAndUpdated after attempting to delete
ImageTagMirrorSets and ImageDigestMirrorSets, even when one of those Delete
calls fails. Track whether either delete in the cleanup lambda returns an error
and only run the MCP wait when both deletions succeed; otherwise log the failure
and skip the waits so teardown doesn’t block on resources that may still exist.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/node/node_swap_cnv.go`:
- Around line 704-719: The deferred cleanup in node_swap_cnv.go ignores the
result of restartKubeletOnNode, which violates the never-ignore-errors rule and
hides the real cleanup failure. Update each deferred cleanup block that calls
restartKubeletOnNode to capture and handle its error the same way
removeDropInFile does, with a warning log or fail-fast behavior before calling
waitForNodeToBeReady; apply this to all matching cleanup sections in the node
swap tests.
- Line 353: The node-exec setup/probe paths are still discarding command errors
by assigning them to `_`, which can hide real failures and cause incorrect state
detection. Update the affected calls around ExecOnNodeWithChroot and the swap
probe logic so each command’s error is checked and propagated or handled
explicitly, especially the swapon -s check that can currently misclassify nodes
when the probe itself fails. Make the same fix for the other listed probe/setup
sites in this Go test file so no remaining node-exec error is ignored.

---

Outside diff comments:
In `@test/extended/node/node_e2e/node.go`:
- Around line 221-234: In the DeferCleanup teardown block in node.go, the
current cleanup flow always calls
imagepolicy.WaitForMCPConfigSpecChangeAndUpdated after attempting to delete
ImageTagMirrorSets and ImageDigestMirrorSets, even when one of those Delete
calls fails. Track whether either delete in the cleanup lambda returns an error
and only run the MCP wait when both deletions succeed; otherwise log the failure
and skip the waits so teardown doesn’t block on resources that may still exist.

In `@test/extended/node/node_utils.go`:
- Around line 311-317: Propagate the errors from cleanupDropInAndRestartKubelet
instead of discarding them: update the helper to return an error and handle
failures from removeDropInFile, restartKubeletOnNode, and waitForNodeToBeReady
explicitly. Use the existing helper names to thread the error back to callers so
any failed drop-in removal or kubelet restart stops the test flow immediately.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 925afdf0-eebd-4d54-868c-665638f7545d

📥 Commits

Reviewing files that changed from the base of the PR and between 91cd015 and 97e2b72.

📒 Files selected for processing (8)
  • test/extended/node/node_e2e/container_runtime_config.go
  • test/extended/node/node_e2e/image_registry_config.go
  • test/extended/node/node_e2e/initcontainer.go
  • test/extended/node/node_e2e/node.go
  • test/extended/node/node_sizing.go
  • test/extended/node/node_swap_cnv.go
  • test/extended/node/node_utils.go
  • test/extended/node/system_compressible.go


g.By(fmt.Sprintf("Creating drop-in directory on %s if not exists", cpNodeName))
_, _ = ExecOnNodeWithChroot(oc, cpNodeName, "mkdir", "-p", cnvDropInDir)
_, _ = ExecOnNodeWithChroot(ctx, oc, cpNodeName, "mkdir", "-p", cnvDropInDir)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Stop discarding the remaining node-exec errors.

These setup/state probes still assign _ to command errors. That makes later failures misleading, and Line 1263 can even misclassify a node as having no preexisting swap when swapon -s actually failed.

As per path instructions, **/*.go: Never ignore error returns.

Also applies to: 380-380, 408-408, 945-945, 1157-1157, 1263-1263

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/node_swap_cnv.go` at line 353, The node-exec setup/probe
paths are still discarding command errors by assigning them to `_`, which can
hide real failures and cause incorrect state detection. Update the affected
calls around ExecOnNodeWithChroot and the swap probe logic so each command’s
error is checked and propagated or handled explicitly, especially the swapon -s
check that can currently misclassify nodes when the probe itself fails. Make the
same fix for the other listed probe/setup sites in this Go test file so no
remaining node-exec error is ignored.

Source: Path instructions

Comment on lines 704 to 719
defer func() {
cleanupCtx, cancel := context.WithTimeout(context.Background(), 5*time.Minute)
defer cancel()
g.By("Cleaning up multiple config files")
framework.Logf("Removing: %s", file98)
removeDropInFile(oc, cnvWorkerNode, file98)
if err := removeDropInFile(cleanupCtx, oc, cnvWorkerNode, file98); err != nil {
framework.Logf("Warning: failed to remove %s: %v", file98, err)
}
framework.Logf("Removing: %s", file99)
removeDropInFile(oc, cnvWorkerNode, file99)
if err := removeDropInFile(cleanupCtx, oc, cnvWorkerNode, file99); err != nil {
framework.Logf("Warning: failed to remove %s: %v", file99, err)
}
framework.Logf("Running: systemctl restart kubelet")
restartKubeletOnNode(ctx, oc, cnvWorkerNode)
waitForNodeToBeReady(ctx, oc, cnvWorkerNode)
restartKubeletOnNode(cleanupCtx, oc, cnvWorkerNode)
waitForNodeToBeReady(cleanupCtx, oc, cnvWorkerNode)
framework.Logf("Cleanup completed")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Handle restartKubeletOnNode failures in deferred cleanup.

These defers check file-removal errors, but still ignore the kubelet restart result. If restart fails, the next symptom is usually a readiness wait or a polluted node for the next longrunning case, not the real cause.

As per path instructions, **/*.go: Never ignore error returns.

Also applies to: 783-799, 954-967, 1094-1113

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/node_swap_cnv.go` around lines 704 - 719, The deferred
cleanup in node_swap_cnv.go ignores the result of restartKubeletOnNode, which
violates the never-ignore-errors rule and hides the real cleanup failure. Update
each deferred cleanup block that calls restartKubeletOnNode to capture and
handle its error the same way removeDropInFile does, with a warning log or
fail-fast behavior before calling waitForNodeToBeReady; apply this to all
matching cleanup sections in the node swap tests.

Source: Path instructions

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@ngopalak-redhat

Copy link
Copy Markdown
Contributor Author

/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning

@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/948eb880-7072-11f1-83e6-7191af33d1d1-0

@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ngopalak-redhat
Once this PR has been reviewed and has the lgtm label, please assign miyadav for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/node/kubelet_secret_pulled_images.go`:
- Around line 207-210: The ordered cleanup in the kubelet secret pulled images
test is swallowing rollback failures because both deleteKC and waitForMCP errors
are ignored. Update the DeferCleanup registered in the Case 4 setup to handle
and surface errors from deleteKC and waitForMCP instead of discarding them, so
the g.Ordered suite does not continue with a partially rolled back worker pool.
Use the existing cleanup block near the Case 4 setup to keep the cluster state
consistent for later cases.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: b9105c28-a543-45f9-9502-ae785817fab4

📥 Commits

Reviewing files that changed from the base of the PR and between 97e2b72 and 7800b52.

📒 Files selected for processing (4)
  • test/extended/imagepolicy/imagepolicy.go
  • test/extended/node/kubelet_secret_pulled_images.go
  • test/extended/node/kubeletconfig_tls.go
  • test/extended/node/node_utils.go

Comment on lines 207 to 210
g.DeferCleanup(func() {
_ = deleteKC(oc, kcName)
_ = waitForMCP(ctx, mcClient, "worker", 30*time.Minute)
_ = waitForMCP(ctx, mcClient, "worker", 15*time.Minute)
})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Don't swallow rollback failures in this ordered cleanup.

This DeferCleanup drops both the deleteKC and waitForMCP errors, so Case 4 can leave the worker pool mid-rollout and still pass cleanup. In an g.Ordered suite, that leaks mutated cluster state into later cases.

Suggested fix
 		g.DeferCleanup(func() {
-			_ = deleteKC(oc, kcName)
-			_ = waitForMCP(ctx, mcClient, "worker", 15*time.Minute)
+			if err := deleteKC(oc, kcName); err != nil {
+				framework.Failf("cleanup: failed to delete KubeletConfig %s: %v", kcName, err)
+			}
+			if err := waitForMCP(ctx, mcClient, "worker", 15*time.Minute); err != nil {
+				framework.Failf("cleanup: worker MCP did not recover after deleting %s: %v", kcName, err)
+			}
 		})

As per path instructions, **/*.go: Never ignore error returns.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
g.DeferCleanup(func() {
_ = deleteKC(oc, kcName)
_ = waitForMCP(ctx, mcClient, "worker", 30*time.Minute)
_ = waitForMCP(ctx, mcClient, "worker", 15*time.Minute)
})
g.DeferCleanup(func() {
if err := deleteKC(oc, kcName); err != nil {
framework.Failf("cleanup: failed to delete KubeletConfig %s: %v", kcName, err)
}
if err := waitForMCP(ctx, mcClient, "worker", 15*time.Minute); err != nil {
framework.Failf("cleanup: worker MCP did not recover after deleting %s: %v", kcName, err)
}
})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/kubelet_secret_pulled_images.go` around lines 207 - 210,
The ordered cleanup in the kubelet secret pulled images test is swallowing
rollback failures because both deleteKC and waitForMCP errors are ignored.
Update the DeferCleanup registered in the Case 4 setup to handle and surface
errors from deleteKC and waitForMCP instead of discarding them, so the g.Ordered
suite does not continue with a partially rolled back worker pool. Use the
existing cleanup block near the Case 4 setup to keep the cluster state
consistent for later cases.

Source: Path instructions

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@ngopalak-redhat ngopalak-redhat changed the title Fix longrunning test case Exec timeout Fix longrunning suite for node test cases Jun 25, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
test/extended/node/node_e2e/image_registry_config.go (1)

43-44: 🚀 Performance & Scalability | 🔵 Trivial | ⚡ Quick win

Use the Ginkgo-provided context instead of context.Background() in the test body.

The BeforeEach was updated to accept ctx context.Context, but this g.It still uses func() and creates ctx := context.Background(). The ExecOnNodeWithChroot(ctx, ...) calls at Lines 110 and 124 therefore run against a non-cancelable, deadline-less context, so the spec-level timeout/cancellation this PR aims to enforce won't propagate to the node exec calls. Prefer wiring the spec context through.

♻️ Proposed change
-	g.It("[OTP] change container registry config [OCP-44820]", func() {
-		ctx := context.Background()
+	g.It("[OTP] change container registry config [OCP-44820]", func(ctx context.Context) {
 		searchRegistry := "qe.quay.io"

Note: g.DeferCleanup at Line 54 also captures this ctx; verify cleanup remains bounded if you switch to the spec context, since the spec context is canceled when the test ends.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/node_e2e/image_registry_config.go` around lines 43 - 44,
The test body in g.It still creates its own context with context.Background(),
which bypasses the Ginkgo spec context and prevents timeout/cancellation from
reaching ExecOnNodeWithChroot. Update this spec to accept and use the
Ginkgo-provided ctx consistently (matching the BeforeEach(ctx context.Context)
change), pass that ctx through all node exec calls, and ensure g.DeferCleanup
continues to use the same bounded spec context so cleanup is canceled when the
test ends.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@test/extended/node/node_e2e/image_registry_config.go`:
- Around line 43-44: The test body in g.It still creates its own context with
context.Background(), which bypasses the Ginkgo spec context and prevents
timeout/cancellation from reaching ExecOnNodeWithChroot. Update this spec to
accept and use the Ginkgo-provided ctx consistently (matching the BeforeEach(ctx
context.Context) change), pass that ctx through all node exec calls, and ensure
g.DeferCleanup continues to use the same bounded spec context so cleanup is
canceled when the test ends.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 6b94b459-25b6-4ec3-b2a1-5efc72330f74

📥 Commits

Reviewing files that changed from the base of the PR and between 7800b52 and c6ec636.

📒 Files selected for processing (11)
  • test/extended/node/image_volume.go
  • test/extended/node/node_e2e/container_runtime_config.go
  • test/extended/node/node_e2e/image_registry_config.go
  • test/extended/node/node_e2e/initcontainer.go
  • test/extended/node/node_e2e/node.go
  • test/extended/node/node_e2e/pdb_drain.go
  • test/extended/node/node_e2e/probe_termination.go
  • test/extended/node/node_sizing.go
  • test/extended/node/node_swap.go
  • test/extended/node/node_utils.go
  • test/extended/node/system_compressible.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • test/extended/node/node_e2e/initcontainer.go
  • test/extended/node/node_e2e/container_runtime_config.go
  • test/extended/node/node_e2e/node.go
  • test/extended/node/node_utils.go

Comment thread test/extended/node/node_utils.go Outdated

// execOnNodeWithDebug is a common helper that runs oc debug with timeout and process cleanup
func execOnNodeWithDebug(ctx context.Context, oc *exutil.CLI, nodeName string, args []string) (string, error) {
timeoutCtx, cancel := context.WithTimeout(ctx, 2*time.Minute)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this is a legit value.
Some command may take some time, I guess it's better to use just a given context and let the caller decide the timeout.

Comment on lines +32 to +41
isMicroShift, err := exutil.IsMicroShiftCluster(oc.AdminKubeClient())
o.Expect(err).NotTo(o.HaveOccurred(), "failed to detect MicroShift cluster")
if isMicroShift {
g.Skip("Skipping test on MicroShift cluster - MachineConfig resources are not available")
}

notReadyNodes, err := nodeutils.GetNotReadyNodes(ctx, oc)
o.Expect(err).NotTo(o.HaveOccurred(), "failed to check node readiness")
o.Expect(notReadyNodes).To(o.BeEmpty(),
"Cannot start test: nodes not Ready: %v. Cluster may be recovering from previous test.", notReadyNodes)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it DRY?

@ngopalak-redhat

Copy link
Copy Markdown
Contributor Author

/payload-abort

@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@ngopalak-redhat: aborted 1 active payload job(s) for pull request #31339

@ngopalak-redhat

Copy link
Copy Markdown
Contributor Author

/payload-job periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning

@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@ngopalak-redhat: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-5.0-e2e-aws-disruptive-longrunning

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/de03c3b0-7096-11f1-8bc9-1d5f884a2a3a-0

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants