CORENET-7140: feat(cluster-diagnostics): add nodes_debug_exec tool for node-level debugging #333
CORENET-7140: feat(cluster-diagnostics): add nodes_debug_exec tool for node-level debugging #333arkadeepsen wants to merge 3 commits into
Conversation
|
@arkadeepsen: This pull request references CORENET-7140 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: arkadeepsen The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (20)
✅ Files skipped from review due to trivial changes (3)
🚧 Files skipped from review as they are similar to previous changes (14)
📝 WalkthroughWalkthroughAdds a new cluster-diagnostics MCP toolset with a Changescluster-diagnostics Toolset
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (8)
pkg/ocp/nodesdebug/nodes_debug.go (3)
176-176: ⚡ Quick winSleep command creates a long-lived container that wastes resources.
The debug pod runs
sleep infinity(line 176) and relies solely on the defer cleanup. If cleanup fails (e.g., process crash, network partition), the pod runs indefinitely. Consider using a finite sleep duration as a safety net.♻️ Proposed improvement
- sleepCommand := []string{"sleep", "infinity"} + // Use a finite sleep as a safety net in case cleanup fails + sleepCommand := []string{"sleep", "3600"} // 1 hour🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/ocp/nodesdebug/nodes_debug.go` at line 176, The debug pod currently uses a perpetual sleep command stored in sleepCommand (sleepCommand := []string{"sleep", "infinity"}) which can leave pods running forever if deferred cleanup fails; change this to a finite, reasonable timeout (e.g., sleep for a configurable duration like 1h or derived from a context/timeout value) and ensure the value is documented/configurable where sleepCommand is defined (or within the function that creates the debug pod) so pods automatically exit as a safety net if cleanup doesn't run.
155-160: ⚡ Quick winRetrieve logs with parent context instead of Background.
Line 155 retrieves logs using
context.Background(), which means log retrieval continues even if the parent operation was canceled. This can waste resources and delay error reporting.♻️ Proposed refactor
- logs, _ := n.retrieveLogs(context.Background(), ns, created.Name) + logs, _ := n.retrieveLogs(timeoutCtx, ns, created.Name)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/ocp/nodesdebug/nodes_debug.go` around lines 155 - 160, The log retrieval is using context.Background() which ignores cancellation of the parent operation; change the call to n.retrieveLogs to pass the parent context (the ctx used in the surrounding function) instead of context.Background() so log retrieval is cancelled when the parent is cancelled; update the call in the error handling branch that references retrieveLogs(ctx?, ns, created.Name) and ensure any variable named execErr, created.Name and ns are preserved in the returned error formatting.
215-219: ⚖️ Poor tradeoffExcessive tolerations may schedule debug pods on incompatible nodes.
Lines 215-219 add three tolerations that collectively tolerate all taints (any key, any value, any effect). This means debug pods can land on nodes with critical taints like
node.kubernetes.io/unschedulableor custom infrastructure taints. While broad scheduling is intentional for debugging, consider documenting this behavior or adding a constraint.The first toleration (line 216,
Operator: TolerationOpExistswith no Effect) already tolerates everything, making lines 217-218 redundant. Consider simplifying:♻️ Optional simplification
Tolerations: []corev1.Toleration{ {Operator: corev1.TolerationOpExists}, - {Operator: corev1.TolerationOpExists, Effect: corev1.TaintEffectNoSchedule}, - {Operator: corev1.TolerationOpExists, Effect: corev1.TaintEffectNoExecute}, },🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/ocp/nodesdebug/nodes_debug.go` around lines 215 - 219, The Tolerations slice currently contains redundant and overly-broad entries (corev1.Toleration with Operator: corev1.TolerationOpExists plus separate entries for corev1.TaintEffectNoSchedule and corev1.TaintEffectNoExecute) which effectively tolerates all taints and may place debug pods on incompatible nodes; update the Tolerations on the Pod spec (the Tolerations []corev1.Toleration literal in nodes_debug.go) by either keeping a single corev1.Toleration{Operator: corev1.TolerationOpExists} and removing the two redundant entries, or narrow them to explicit effects/keys you actually need (e.g., specify Effect or Key) and add a clear code comment documenting the deliberate broad tolerance behavior if you keep it.pkg/toolsets/openshift/nodes/nodes.go (2)
62-109: 💤 Low valueConsider validating node name format to prevent obvious typos.
The handler accepts any non-empty string as a node name. While Kubernetes will reject invalid node names, catching obvious errors early (e.g., leading/trailing whitespace, invalid DNS characters) would improve user experience.
♻️ Optional improvement
nodeArg := params.GetArguments()["node"] nodeName, ok := nodeArg.(string) if nodeArg == nil || !ok || nodeName == "" { return api.NewToolCallResult("", errors.New("missing required argument: node")), nil } +nodeName = strings.TrimSpace(nodeName) +if nodeName == "" { + return api.NewToolCallResult("", errors.New("node name cannot be empty or whitespace")), nil +}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/toolsets/openshift/nodes/nodes.go` around lines 62 - 109, In nodesDebugExec, validate the parsed nodeName for obvious typos before proceeding: trim leading/trailing whitespace (strings.TrimSpace) and reject empty results, then enforce a Kubernetes DNS-1123-like pattern (e.g. regex ^[a-z0-9]([-a-z0-9]*[a-z0-9])?$ and maximum length check) to catch invalid characters or lengths; if validation fails return api.NewToolCallResult("", errors.New("invalid node name: <brief reason>")), nil. Update the existing nodeArg handling in nodesDebugExec to perform these checks right after nodeName is extracted and use clear error messages for whitespace/format/length failures.
100-100: ⚖️ Poor tradeoffHard-coded empty hostPath and mountPath limit debugging capabilities.
Line 100 passes empty strings for
hostPathandmountPath, which means the tool always uses the default/→/hostmapping. This prevents users from mounting specific host directories (e.g.,/var/log→/logs) which could be useful for targeted debugging. Consider exposing these as optional tool parameters.♻️ Optional enhancement
Add
host_pathandmount_pathfields to the InputSchema and parse them from arguments, similar to hownamespaceandimageare handled.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/toolsets/openshift/nodes/nodes.go` at line 100, The NodesDebugExec call currently passes empty strings for hostPath and mountPath (stdout, stderr, execErr := client.NodesDebugExec(params.Context, namespace, nodeName, image, command, "", "", timeout)), preventing custom host mounts; update the InputSchema to include optional host_path and mount_path fields, parse them from the CLI/args like namespace and image, and pass those variables instead of "" into NodesDebugExec so callers can mount arbitrary host directories (e.g., hostPath -> "/var/log", mountPath -> "/logs"); ensure defaults remain "/" -> "/host" when the new fields are omitted.pkg/toolsets/openshift/nodes/nodes_test.go (1)
1-75: ⚡ Quick winEnhance test coverage with success path and error propagation tests.
The current tests only validate input errors. Consider adding tests that:
- Verify successful command execution returns expected output
- Test error propagation from the underlying nodesdebug layer
- Verify timeout parameter is correctly passed through
- Test namespace and image parameter handling
Example test case structure:
🧪 Suggested additional tests
s.Run("successful execution returns stdout", func() { env := setupMockEnv() // Mock the underlying nodesdebug client params := api.ToolHandlerParams{ ToolCallRequest: staticRequest{args: map[string]any{ "node": "worker-0", "command": []interface{}{"echo", "test"}, }}, KubernetesClient: env.Client, Context: context.Background(), } result, err := nodesDebugExec(params) s.Require().NoError(err) s.Require().Nil(result.Error) s.Contains(result.Output, "test") })🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/toolsets/openshift/nodes/nodes_test.go` around lines 1 - 75, Add unit tests to nodes_test.go that cover the success and propagation paths: in NodesHandlerSuite add runs that (1) mock the underlying nodesdebug behavior (e.g., via a setupMockEnv or a fake client injected via ToolHandlerParams.KubernetesClient) and assert nodesDebugExec(params) returns no error, nil result.Error and expected stdout when command is ["echo","test"]; (2) simulate the nodesdebug layer returning an execution error and assert that nodesDebugExec returns a result with that error propagated; (3) pass a "timeout" argument and assert the fake client received the timeout value; and (4) pass "namespace" and "image" args and assert they are forwarded to the nodesdebug call—use the existing staticRequest, NodesHandlerSuite and nodesDebugExec identifiers to locate where to add these tests and wire the mock client via ToolHandlerParams.Context/KubernetesClient.evals/tasks/openshift/list-host-etc/cleanup.sh (1)
1-5: 💤 Low valueConsider extracting shared cleanup script to reduce duplication.
This cleanup script is identical to
evals/tasks/openshift/check-node-kernel/cleanup.sh. Both tasks share the same cleanup logic (deletingnode-debugpods). Consider one of these approaches:
- Create a shared cleanup script at
evals/tasks/openshift/cleanup.shand symlink to it from each task directory.- Reference a shared cleanup script path in the task YAML if the framework supports it.
- Accept the duplication if task-specific cleanup divergence is expected in the future.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@evals/tasks/openshift/list-host-etc/cleanup.sh` around lines 1 - 5, Both cleanup scripts (evals/tasks/openshift/list-host-etc/cleanup.sh and evals/tasks/openshift/check-node-kernel/cleanup.sh) duplicate the same kubectl deletion logic; extract that shared logic into a single script (e.g., evals/tasks/openshift/cleanup.sh) preserving the shebang and set -euo pipefail, move the kubectl delete command into it, make the new file executable, then replace the two originals with either (a) symlinks pointing to the new evals/tasks/openshift/cleanup.sh or (b) update the task YAMLs to reference the shared script path; ensure permissions and behavior remain identical (ignore-not-found, wait=false) after the change.evals/tasks/openshift/list-host-etc/verify.sh (1)
1-10: 💤 Low valueConsider extracting shared verification script to reduce duplication.
This verification script is identical to
evals/tasks/openshift/check-node-kernel/verify.sh. Both tasks use the same verification logic (checking for leftovernode-debugpods). Similar to the cleanup scripts, consider extracting to a shared location if the framework supports it, or document the intentional duplication if task-specific verification is expected to diverge.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@evals/tasks/openshift/list-host-etc/verify.sh` around lines 1 - 10, The verify.sh script duplicates the same node-debug pod check (using the label selector app.kubernetes.io/component=node-debug and the remaining variable computed via kubectl) found in the other task's verify.sh; extract that logic into a single shared verification script or library (e.g., a common script that exports a function like check_node_debug_pods) and have both verify.sh files source and call that shared function, or if the framework doesn't support shared scripts, add a short comment documenting intentional duplication.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/ocp/nodesdebug/nodes_debug_test.go`:
- Around line 92-170: Add negative tests that call NodeDebug.NodesDebugExec to
ensure validatePath rejects unsafe paths: create cases for invalid hostPath
(e.g., "../etc"), invalid mountPath (e.g., "relative/path" or paths with ".." or
unsafe chars), instantiate env via NewNodeDebugTestEnv, set env.Pods.Running =
true, call nd := &NodeDebug{NodeDebugClient: env.Client} and invoke
nd.NodesDebugExec(...) and assert an error is returned and its message contains
the expected validation clue (e.g., "path traversal" or "invalid path"); add one
test per invalid scenario to the TestCreateDebugPod table or as separate s.Run
blocks so validatePath behavior is covered.
- Line 136: The tests currently call the unexported method createDebugPod
directly; change them to exercise that behavior via the public API
(NodesDebugExec) instead or, if the behavior must be part of the package
contract, export the function (rename createDebugPod -> CreateDebugPod) and
update callers/tests accordingly; specifically, update the test cases at the
sites calling createDebugPod to invoke NodesDebugExec with the same inputs and
assert the same outcomes (or promote the method to exported and adjust
imports/tests), ensuring you reference the createDebugPod implementation for
mapping inputs/results to the public NodesDebugExec parameters and expected
outputs.
In `@pkg/ocp/nodesdebug/nodes_debug.go`:
- Around line 304-311: Remove the unused helper function containerStatusByName
from the codebase: delete the function definition and any associated comments or
imports that only exist for it, then run build/static-analysis/tests to confirm
no remaining references; if this helper was intended for future use, either add
a TODO comment linking to the planned usage or move it to a clearly named
utilities file to avoid dead code.
- Line 147: The call to n.pollForRunning currently passes a hard-coded
time.Minute which ignores the user-provided timeout parameter; update the
n.pollForRunning invocation to use the function's timeout variable (convert the
timeout_seconds int/uint to a time.Duration if needed) instead of time.Minute so
pollForRunning(timeoutCtx, ns, created.Name, <converted timeout>) honors the
user's timeout setting; ensure the timeoutCtx and timeout variable remain
consistent when passed to pollForRunning.
- Line 329: The current traversal check uses slices.Contains(strings.Split(path,
string(filepath.Separator)), "..") which is OS-dependent; replace this with a
POSIX-safe check by normalizing and splitting using the Unix separator or the
path package: call path.Clean(path) (from the standard path, not filepath), then
split on "/" (e.g. strings.Split(cleanPath, "/")) and use slices.Contains on the
resulting segments to detect ".."; update the check that references the variable
path and the slices.Contains call (and remove or stop using filepath.Separator
here) so it correctly detects upward traversal regardless of host OS.
- Around line 140-144: The deferred pod-delete currently creates a fixed 30s
context and discards any error, which can leave orphaned pods; update the defer
in the defer func() around deleteCtx/delete cancel so it derives its timeout
from the parent context (or uses the parent context directly) and
increase/parameterize the timeout, call n.Pods(ns).Delete(deleteCtx,
created.Name, metav1.DeleteOptions{}) and capture the returned error, and log it
(using the existing logger or context logger) instead of ignoring it so deletion
failures are visible for debugging.
---
Nitpick comments:
In `@evals/tasks/openshift/list-host-etc/cleanup.sh`:
- Around line 1-5: Both cleanup scripts
(evals/tasks/openshift/list-host-etc/cleanup.sh and
evals/tasks/openshift/check-node-kernel/cleanup.sh) duplicate the same kubectl
deletion logic; extract that shared logic into a single script (e.g.,
evals/tasks/openshift/cleanup.sh) preserving the shebang and set -euo pipefail,
move the kubectl delete command into it, make the new file executable, then
replace the two originals with either (a) symlinks pointing to the new
evals/tasks/openshift/cleanup.sh or (b) update the task YAMLs to reference the
shared script path; ensure permissions and behavior remain identical
(ignore-not-found, wait=false) after the change.
In `@evals/tasks/openshift/list-host-etc/verify.sh`:
- Around line 1-10: The verify.sh script duplicates the same node-debug pod
check (using the label selector app.kubernetes.io/component=node-debug and the
remaining variable computed via kubectl) found in the other task's verify.sh;
extract that logic into a single shared verification script or library (e.g., a
common script that exports a function like check_node_debug_pods) and have both
verify.sh files source and call that shared function, or if the framework
doesn't support shared scripts, add a short comment documenting intentional
duplication.
In `@pkg/ocp/nodesdebug/nodes_debug.go`:
- Line 176: The debug pod currently uses a perpetual sleep command stored in
sleepCommand (sleepCommand := []string{"sleep", "infinity"}) which can leave
pods running forever if deferred cleanup fails; change this to a finite,
reasonable timeout (e.g., sleep for a configurable duration like 1h or derived
from a context/timeout value) and ensure the value is documented/configurable
where sleepCommand is defined (or within the function that creates the debug
pod) so pods automatically exit as a safety net if cleanup doesn't run.
- Around line 155-160: The log retrieval is using context.Background() which
ignores cancellation of the parent operation; change the call to n.retrieveLogs
to pass the parent context (the ctx used in the surrounding function) instead of
context.Background() so log retrieval is cancelled when the parent is cancelled;
update the call in the error handling branch that references retrieveLogs(ctx?,
ns, created.Name) and ensure any variable named execErr, created.Name and ns are
preserved in the returned error formatting.
- Around line 215-219: The Tolerations slice currently contains redundant and
overly-broad entries (corev1.Toleration with Operator: corev1.TolerationOpExists
plus separate entries for corev1.TaintEffectNoSchedule and
corev1.TaintEffectNoExecute) which effectively tolerates all taints and may
place debug pods on incompatible nodes; update the Tolerations on the Pod spec
(the Tolerations []corev1.Toleration literal in nodes_debug.go) by either
keeping a single corev1.Toleration{Operator: corev1.TolerationOpExists} and
removing the two redundant entries, or narrow them to explicit effects/keys you
actually need (e.g., specify Effect or Key) and add a clear code comment
documenting the deliberate broad tolerance behavior if you keep it.
In `@pkg/toolsets/openshift/nodes/nodes_test.go`:
- Around line 1-75: Add unit tests to nodes_test.go that cover the success and
propagation paths: in NodesHandlerSuite add runs that (1) mock the underlying
nodesdebug behavior (e.g., via a setupMockEnv or a fake client injected via
ToolHandlerParams.KubernetesClient) and assert nodesDebugExec(params) returns no
error, nil result.Error and expected stdout when command is ["echo","test"]; (2)
simulate the nodesdebug layer returning an execution error and assert that
nodesDebugExec returns a result with that error propagated; (3) pass a "timeout"
argument and assert the fake client received the timeout value; and (4) pass
"namespace" and "image" args and assert they are forwarded to the nodesdebug
call—use the existing staticRequest, NodesHandlerSuite and nodesDebugExec
identifiers to locate where to add these tests and wire the mock client via
ToolHandlerParams.Context/KubernetesClient.
In `@pkg/toolsets/openshift/nodes/nodes.go`:
- Around line 62-109: In nodesDebugExec, validate the parsed nodeName for
obvious typos before proceeding: trim leading/trailing whitespace
(strings.TrimSpace) and reject empty results, then enforce a Kubernetes
DNS-1123-like pattern (e.g. regex ^[a-z0-9]([-a-z0-9]*[a-z0-9])?$ and maximum
length check) to catch invalid characters or lengths; if validation fails return
api.NewToolCallResult("", errors.New("invalid node name: <brief reason>")), nil.
Update the existing nodeArg handling in nodesDebugExec to perform these checks
right after nodeName is extracted and use clear error messages for
whitespace/format/length failures.
- Line 100: The NodesDebugExec call currently passes empty strings for hostPath
and mountPath (stdout, stderr, execErr := client.NodesDebugExec(params.Context,
namespace, nodeName, image, command, "", "", timeout)), preventing custom host
mounts; update the InputSchema to include optional host_path and mount_path
fields, parse them from the CLI/args like namespace and image, and pass those
variables instead of "" into NodesDebugExec so callers can mount arbitrary host
directories (e.g., hostPath -> "/var/log", mountPath -> "/logs"); ensure
defaults remain "/" -> "/host" when the new fields are omitted.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b26f1bea-6a61-4dc9-9705-6623d36ebb04
📒 Files selected for processing (21)
.github/workflows/mcpchecker.yamlREADME.mddocs/configuration.mdevals/claude-code/eval.yamlevals/gemini-agent/eval.yamlevals/openai-agent/eval.yamlevals/tasks/openshift/README.mdevals/tasks/openshift/check-node-kernel/cleanup.shevals/tasks/openshift/check-node-kernel/task.yamlevals/tasks/openshift/check-node-kernel/verify.shevals/tasks/openshift/list-host-etc/cleanup.shevals/tasks/openshift/list-host-etc/task.yamlevals/tasks/openshift/list-host-etc/verify.shinternal/tools/update-readme/main.gopkg/config/config_default_overrides.gopkg/ocp/nodesdebug/nodes_debug.gopkg/ocp/nodesdebug/nodes_debug_test.gopkg/ocp/nodesdebug/nodes_debug_testhelpers_test.gopkg/toolsets/openshift/nodes/nodes.gopkg/toolsets/openshift/nodes/nodes_test.gopkg/toolsets/openshift/toolset.go
009980b to
f24340d
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/ocp/nodesdebug/nodes_debug_test.go (1)
22-261: 🏗️ Heavy liftConsider adopting envtest for integration test coverage.
The tests currently use a fake client (
NodeDebugTestEnv) for all test cases. As per coding guidelines, tests should use real implementations and integration testing where possible, avoiding mocks, and specifically use thesetup-envtesttooling fromsigs.k8s.io/controller-runtimefor test setup.While envtest cannot test actual pod execution (no container runtime), it can provide a real Kubernetes API server to test:
- Pod creation/deletion API calls and validation
- Resource spec constraints and API-level errors
- Namespace and node targeting behavior
Consider a hybrid approach:
- Use envtest-based tests for API interactions (pod CRUD, input validation, spec generation)
- Retain fake-based tests for execution behavior testing (since envtest cannot run containers)
This would provide stronger integration coverage and align with the repository's testing guidelines.
As per coding guidelines: "Use real implementations and integration testing where possible - avoid mocks" and "Use the
setup-envtesttooling fromsigs.k8s.io/controller-runtimefor test setup and ensure network access is available for downloading the Kubernetesenvtestenvironment"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/ocp/nodesdebug/nodes_debug_test.go` around lines 22 - 261, Tests in NodesDebugSuite currently rely entirely on the fake test harness (NewNodeDebugTestEnv and its Pods behavior) for all cases; update test coverage to use envtest (setup-envtest from sigs.k8s.io/controller-runtime) for API-level validation and pod CRUD checks while keeping the fake env for runtime exec behavior: create new test variants (e.g., TestNodesDebugExec_APIValidation or TestNodesDebugExec_CreationWithEnvtest) that initialize an envtest control plane, use the real client to call NodesDebugExec to assert pod creation, namespace, volume and securityContext fields and node targeting, and retain existing NewNodeDebugTestEnv-based tests only for execution/exec error scenarios (where container runtime is required); ensure the new envtest tests wire up the same NodeDebug implementation (the nodeDebug helper) but point it at the envtest client and clean up namespaced resources after each run.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/ocp/nodesdebug/nodes_debug_test.go`:
- Around line 22-261: Tests in NodesDebugSuite currently rely entirely on the
fake test harness (NewNodeDebugTestEnv and its Pods behavior) for all cases;
update test coverage to use envtest (setup-envtest from
sigs.k8s.io/controller-runtime) for API-level validation and pod CRUD checks
while keeping the fake env for runtime exec behavior: create new test variants
(e.g., TestNodesDebugExec_APIValidation or
TestNodesDebugExec_CreationWithEnvtest) that initialize an envtest control
plane, use the real client to call NodesDebugExec to assert pod creation,
namespace, volume and securityContext fields and node targeting, and retain
existing NewNodeDebugTestEnv-based tests only for execution/exec error scenarios
(where container runtime is required); ensure the new envtest tests wire up the
same NodeDebug implementation (the nodeDebug helper) but point it at the envtest
client and clean up namespaced resources after each run.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 264958aa-1e2a-47f2-b269-fca1dd97322b
📒 Files selected for processing (17)
.github/workflows/mcpchecker.yamlevals/claude-code/eval.yamlevals/gemini-agent/eval.yamlevals/openai-agent/eval.yamlevals/tasks/openshift/README.mdevals/tasks/openshift/check-node-kernel/cleanup.shevals/tasks/openshift/check-node-kernel/task.yamlevals/tasks/openshift/check-node-kernel/verify.shevals/tasks/openshift/list-host-etc/cleanup.shevals/tasks/openshift/list-host-etc/task.yamlevals/tasks/openshift/list-host-etc/verify.shpkg/ocp/nodesdebug/nodes_debug.gopkg/ocp/nodesdebug/nodes_debug_test.gopkg/ocp/nodesdebug/nodes_debug_testhelpers_test.gopkg/toolsets/openshift/nodes/nodes.gopkg/toolsets/openshift/nodes/nodes_test.gopkg/toolsets/openshift/toolset.go
✅ Files skipped from review due to trivial changes (4)
- evals/tasks/openshift/list-host-etc/task.yaml
- evals/tasks/openshift/list-host-etc/cleanup.sh
- evals/claude-code/eval.yaml
- .github/workflows/mcpchecker.yaml
🚧 Files skipped from review as they are similar to previous changes (12)
- evals/tasks/openshift/check-node-kernel/task.yaml
- evals/tasks/openshift/list-host-etc/verify.sh
- evals/tasks/openshift/check-node-kernel/cleanup.sh
- pkg/toolsets/openshift/toolset.go
- evals/tasks/openshift/check-node-kernel/verify.sh
- evals/openai-agent/eval.yaml
- evals/gemini-agent/eval.yaml
- pkg/toolsets/openshift/nodes/nodes_test.go
- evals/tasks/openshift/README.md
- pkg/toolsets/openshift/nodes/nodes.go
- pkg/ocp/nodesdebug/nodes_debug_testhelpers_test.go
- pkg/ocp/nodesdebug/nodes_debug.go
Add a privileged debug pod tool that mimics `oc debug node/<name>` for running commands on OpenShift nodes. The tool creates a temporary pod with the UBI9 toolbox image, executes the command, collects output, and cleans up automatically. Key design decisions based on PR review feedback: - All new code lives in pkg/ocp/ and pkg/toolsets/openshift/ only, no upstream packages (pkg/kubernetes/, pkg/mcp/) are modified - Tool is in the "openshift" toolset (not "core") to avoid conflicts - NodeDebugClient interface decouples from concrete k8s client for testability - Error messages include command output/logs for better diagnostics - Default timeout is 1 minute (within MCP client timeout limits) - Gated behind ReadOnly config (read_only=false required to use) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@evals/tasks/openshift/check-node-kernel/cleanup.sh`:
- Line 4: The cleanup command currently uses "kubectl delete pods -A -l
app.kubernetes.io/component=node-debug --ignore-not-found --wait=false" which
returns immediately and can leave node-debug pods around; update the command to
wait for deletion (e.g. use "--wait=true" or remove the flag and add a sensible
"--timeout=<duration>" such as "--timeout=2m") so that the call blocks until
pods are deleted or the timeout elapses, ensuring deterministic cleanup for
subsequent tasks.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: bfdbdb07-b4f3-4b4c-af79-f0a09f5d2d08
📒 Files selected for processing (22)
.github/workflows/mcpchecker.yamlREADME.mddocs/configuration.mdevals/claude-code/eval.yamlevals/gemini-agent/eval.yamlevals/openai-agent/eval.yamlevals/tasks/openshift/README.mdevals/tasks/openshift/check-node-kernel/cleanup.shevals/tasks/openshift/check-node-kernel/task.yamlevals/tasks/openshift/check-node-kernel/verify.shevals/tasks/openshift/list-host-etc/cleanup.shevals/tasks/openshift/list-host-etc/task.yamlevals/tasks/openshift/list-host-etc/verify.shinternal/tools/update-readme/main.gopkg/config/config_default_overrides.gopkg/mcp/testdata/toolsets-openshift-tools.jsonpkg/ocp/nodesdebug/nodes_debug.gopkg/ocp/nodesdebug/nodes_debug_test.gopkg/ocp/nodesdebug/nodes_debug_testhelpers_test.gopkg/toolsets/openshift/nodes/nodes.gopkg/toolsets/openshift/nodes/nodes_test.gopkg/toolsets/openshift/toolset.go
✅ Files skipped from review due to trivial changes (2)
- evals/tasks/openshift/list-host-etc/cleanup.sh
- evals/openai-agent/eval.yaml
🚧 Files skipped from review as they are similar to previous changes (19)
- docs/configuration.md
- evals/tasks/openshift/list-host-etc/verify.sh
- pkg/config/config_default_overrides.go
- internal/tools/update-readme/main.go
- evals/tasks/openshift/check-node-kernel/task.yaml
- evals/claude-code/eval.yaml
- README.md
- evals/tasks/openshift/list-host-etc/task.yaml
- pkg/toolsets/openshift/nodes/nodes_test.go
- .github/workflows/mcpchecker.yaml
- evals/tasks/openshift/check-node-kernel/verify.sh
- pkg/mcp/testdata/toolsets-openshift-tools.json
- evals/tasks/openshift/README.md
- pkg/toolsets/openshift/toolset.go
- pkg/ocp/nodesdebug/nodes_debug.go
- pkg/ocp/nodesdebug/nodes_debug_test.go
- evals/gemini-agent/eval.yaml
- pkg/ocp/nodesdebug/nodes_debug_testhelpers_test.go
- pkg/toolsets/openshift/nodes/nodes.go
d9df4eb to
b4e45d7
Compare
|
@coderabbitai resume |
|
✅ Action performedReviews resumed. Review finished.
|
There was a problem hiding this comment.
Actionable comments posted: 12
🧹 Nitpick comments (1)
pkg/cluster-diagnostics/nodesdebug/nodes_debug_testhelpers_test.go (1)
60-62: ⚡ Quick winPreserve log request options in the fake to keep test behavior faithful.
Line 60 accepts
tailbut drops it, and Line 118 ignoresname/opts. This makes log-path tests less reliable for catching regressions around log option wiring.Proposed refactor
func (f *FakeNodeDebugClient) PodsLog(ctx context.Context, namespace, name, container string, previous bool, tail int64) (string, error) { - req := f.pods.GetLogs(name, &corev1.PodLogOptions{Container: container, Previous: previous}) + var tailLines *int64 + if tail > 0 { + tailLines = &tail + } + req := f.pods.GetLogs(name, &corev1.PodLogOptions{ + Container: container, + Previous: previous, + TailLines: tailLines, + }) res := req.Do(ctx) @@ type FakePodInterface struct { corev1client.PodInterface @@ ExecError error + LastLogName string + LastLogOpts *corev1.PodLogOptions } @@ func (f *FakePodInterface) GetLogs(name string, opts *corev1.PodLogOptions) *restclient.Request { + f.LastLogName = name + if opts != nil { + c := *opts + f.LastLogOpts = &c + } body := io.NopCloser(strings.NewReader(f.Logs))Also applies to: 118-129
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/cluster-diagnostics/nodesdebug/nodes_debug_testhelpers_test.go` around lines 60 - 62, The fake log implementation in nodes_debug_testhelpers_test.go currently accepts parameters like tail, name, and opts but drops them, which breaks log-path tests; update the fake (the test helper function that handles log requests) to preserve and store/forward the incoming log request options (tail, name, opts) exactly as passed so test assertions can verify wiring, i.e., change the fake's handler to keep the tail value and the name/opts map/struct instead of ignoring them and adjust any recorded request struct used by tests to include those fields.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@evals/tasks/cluster-diagnostics/check-node-kernel/task.yaml`:
- Around line 10-12: The verification currently only checks for the substring
"Linux", which is too permissive; update the verifier (the inline check or
verify.sh referenced in the task) to require a version-bearing kernel string
instead of just "Linux". Replace the simple contains test with a regex check
that matches kernel-version patterns (for example, /Linux\s+\d+\.\d+(\.\d+)?/ or
similar) and fail when the output doesn't include digits/dots after "Linux";
ensure this change is applied to the verify logic that currently asserts
"Linux".
In `@evals/tasks/cluster-diagnostics/check-node-kernel/verify.sh`:
- Line 4: The kubectl pod-count invocation in verify.sh is currently silencing
errors with a 2>/dev/null while the script runs under set -euo pipefail, which
hides why kubectl may fail; remove the stderr redirection (or instead capture
stderr into a variable and print a clear error) so failures are visible in CI
logs—locate the kubectl command in verify.sh that uses 2>/dev/null and either
delete the redirection or wrap the call to print the captured error before
exiting.
In `@evals/tasks/cluster-diagnostics/list-host-etc/cleanup.sh`:
- Line 4: The current cleanup.sh deletes all "node-debug" pods across all
namespaces; change the kubectl delete invocation in cleanup.sh to target only
the task's namespace or a task-specific label instead of using --all-namespaces
— e.g., use -n "$TASK_NAMESPACE" to restrict to the current task namespace or
add -l "task=$TASK_ID" (or run/$RUN_ID) to select pods by a task/run-specific
label; update any environment variable names (TASK_NAMESPACE, TASK_ID)
referenced in the script and ensure the node-debug pod deletion uses that
namespace/label selector so other evals' debug pods are not removed.
In `@evals/tasks/cluster-diagnostics/list-host-etc/verify.sh`:
- Line 4: The verify.sh script currently suppresses kubectl stderr with
"2>/dev/null", which hides useful error details; remove that stderr redirection
from any kubectl invocations in verify.sh (or alternatively capture stderr into
a variable and echo it on failure) so that authentication/context/API errors are
visible—look for kubectl commands in verify.sh and either delete "2>/dev/null"
or wrap the call to capture >&2 into a variable and print the variable before
exiting.
In `@pkg/cluster-diagnostics/nodesdebug/nodes_debug_test.go`:
- Around line 1-20: The test currently lives in package nodesdebug and
constructs the internal NodeDebug type directly; change the test package to
nodesdebug_test and stop referencing NodeDebug internals — instead exercise only
exported entry points (e.g., call exported functions like RunNodeDebug or
NewPublicNodeDebug); if no exported constructor exists, add an exported
constructor (e.g., NewNodeDebug or NewPublicNodeDebug) to the package and use
that from the test; also update imports in the test to import the package as
nodesdebug and replace direct struct instantiation with calls to the exported
API.
- Around line 187-191: The test indexes VolumeMounts[0] after calling
s.Require().NotNil(...) which still allows an empty slice and can panic; before
any access add an explicit length guard (e.g.
s.Require().Len(<theObject>.VolumeMounts, 1) or
s.Require().NotEmpty(<theObject>.VolumeMounts)) immediately before the existing
access to VolumeMounts[0] so the test fails cleanly instead of panicking; keep
the existing NotNil check but insert the s.Require().Len(...) call (referencing
the VolumeMounts slice and the test's s.Require() helper) just prior to the
index usage.
In `@pkg/cluster-diagnostics/nodesdebug/nodes_debug_testhelpers_test.go`:
- Around line 99-116: The fake pod client's Get and Delete handlers in
nodes_debug_testhelpers_test.go currently ignore the requested pod name and
operate on the last-created pod, allowing false-positive tests; update the fake
methods (e.g., the fake Get and Delete implementations used in tests) to verify
that the requested pod name equals the stored pod's Name and if not, return a
NotFound error (or nil pod + error) instead of operating on the wrong pod,
ensuring callers using an incorrect pod name fail the test reliably.
In `@pkg/cluster-diagnostics/nodesdebug/nodes_debug.go`:
- Around line 156-164: retrieveLogs' returned error is currently discarded;
update the call in nodes_debug.go to log that error instead of ignoring it by
capturing the returned err from retrieveLogs and emitting a descriptive log
entry including node/pod identifiers and the error (use the file's existing
logger variable, e.g., d.logger or log, and include the error in the message) so
the diagnostic context is preserved for debugging.
In `@pkg/mcp/toolsets_test.go`:
- Around line 246-253: Replace the non-fatal assertion s.NoError(err) used when
creating the dependent test state with a fatal assertion so downstream checks
don't run on invalid state: change the call that checks the returned err (the
assertion currently using s.NoError(err) where err is produced on the setup at
line ~246) to s.Require().NoError(err) so the test stops on failure before the
dependent assertions around the later checks (currently at line ~252) execute.
In `@pkg/toolsets/cluster-diagnostics/nodes/nodes_test.go`:
- Around line 22-70: The test currently calls the unexported nodesDebugExec
directly; change it to exercise the public toolset API by registering the
handler via the toolset registration function and invoking it using the MCP tool
invocation path consumers use (instead of calling nodesDebugExec), then
replace/augment existing error-only cases with at least one positive successful
execution test that asserts the returned stdout/stderr and exit status; locate
usages of nodesDebugExec, the tool handler registration code (the public
Register/Handle function for this tool), and the MCP tool call helper in the
test suite (e.g., NodesHandlerSuite) to implement these changes and optionally
add an envtest-based integration test if you need end-to-end pod creation
validation.
In `@pkg/toolsets/cluster-diagnostics/nodes/nodes.go`:
- Around line 41-45: The handler that reads the "timeout_seconds" request
parameter must enforce the documented contract: if the parameter is missing set
a default of 60, reject non-integer/fractional values, and enforce a minimum of
1. Update the code in the request handler in nodes.go (the code that currently
parses "timeout_seconds" around the existing parsing block) to: 1) check for
empty/missing and set timeoutSecs = 60; 2) attempt to parse the value as an
integer (use integer parsing, not float) and if parsing fails return a 400
error; and 3) if parsed value < 1 clamp or reject per project convention (prefer
returning 400 for invalid input or set to 1 if you choose to clamp). Ensure the
variable used downstream is an int and that any timeouts use this validated
integer value.
- Around line 115-118: The code that parses the node `command` currently only
handles values of type []interface{} and thus rejects valid []string inputs;
update the parsing logic (the `command` handling block in nodes.go) to accept
both []interface{} and []string (and also a single string) by using a type
switch on the raw `command` value, converting []interface{} entries to strings,
using []string directly when present, and trimming/validating each entry before
storing into the existing command slice; ensure nil or empty inputs are handled
gracefully and keep the final type as []string used elsewhere.
---
Nitpick comments:
In `@pkg/cluster-diagnostics/nodesdebug/nodes_debug_testhelpers_test.go`:
- Around line 60-62: The fake log implementation in
nodes_debug_testhelpers_test.go currently accepts parameters like tail, name,
and opts but drops them, which breaks log-path tests; update the fake (the test
helper function that handles log requests) to preserve and store/forward the
incoming log request options (tail, name, opts) exactly as passed so test
assertions can verify wiring, i.e., change the fake's handler to keep the tail
value and the name/opts map/struct instead of ignoring them and adjust any
recorded request struct used by tests to include those fields.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 24fbdc2f-698b-486a-920f-5e27b8b5c106
📒 Files selected for processing (22)
README.mddocs/configuration.mdevals/claude-code/eval.yamlevals/gemini-agent/eval.yamlevals/openai-agent/eval.yamlevals/tasks/cluster-diagnostics/README.mdevals/tasks/cluster-diagnostics/check-node-kernel/cleanup.shevals/tasks/cluster-diagnostics/check-node-kernel/task.yamlevals/tasks/cluster-diagnostics/check-node-kernel/verify.shevals/tasks/cluster-diagnostics/list-host-etc/cleanup.shevals/tasks/cluster-diagnostics/list-host-etc/task.yamlevals/tasks/cluster-diagnostics/list-host-etc/verify.shinternal/tools/update-readme/main.gopkg/cluster-diagnostics/nodesdebug/nodes_debug.gopkg/cluster-diagnostics/nodesdebug/nodes_debug_test.gopkg/cluster-diagnostics/nodesdebug/nodes_debug_testhelpers_test.gopkg/mcp/modules.gopkg/mcp/testdata/toolsets-cluster-diagnostics-tools.jsonpkg/mcp/toolsets_test.gopkg/toolsets/cluster-diagnostics/nodes/nodes.gopkg/toolsets/cluster-diagnostics/nodes/nodes_test.gopkg/toolsets/cluster-diagnostics/toolset.go
✅ Files skipped from review due to trivial changes (4)
- docs/configuration.md
- evals/tasks/cluster-diagnostics/list-host-etc/task.yaml
- evals/tasks/cluster-diagnostics/README.md
- evals/openai-agent/eval.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/tools/update-readme/main.go
- README.md
There was a problem hiding this comment.
Caution
Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 12
🧹 Nitpick comments (1)
pkg/cluster-diagnostics/nodesdebug/nodes_debug_testhelpers_test.go (1)
60-62: ⚡ Quick winPreserve log request options in the fake to keep test behavior faithful.
Line 60 accepts
tailbut drops it, and Line 118 ignoresname/opts. This makes log-path tests less reliable for catching regressions around log option wiring.Proposed refactor
func (f *FakeNodeDebugClient) PodsLog(ctx context.Context, namespace, name, container string, previous bool, tail int64) (string, error) { - req := f.pods.GetLogs(name, &corev1.PodLogOptions{Container: container, Previous: previous}) + var tailLines *int64 + if tail > 0 { + tailLines = &tail + } + req := f.pods.GetLogs(name, &corev1.PodLogOptions{ + Container: container, + Previous: previous, + TailLines: tailLines, + }) res := req.Do(ctx) @@ type FakePodInterface struct { corev1client.PodInterface @@ ExecError error + LastLogName string + LastLogOpts *corev1.PodLogOptions } @@ func (f *FakePodInterface) GetLogs(name string, opts *corev1.PodLogOptions) *restclient.Request { + f.LastLogName = name + if opts != nil { + c := *opts + f.LastLogOpts = &c + } body := io.NopCloser(strings.NewReader(f.Logs))Also applies to: 118-129
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/cluster-diagnostics/nodesdebug/nodes_debug_testhelpers_test.go` around lines 60 - 62, The fake log implementation in nodes_debug_testhelpers_test.go currently accepts parameters like tail, name, and opts but drops them, which breaks log-path tests; update the fake (the test helper function that handles log requests) to preserve and store/forward the incoming log request options (tail, name, opts) exactly as passed so test assertions can verify wiring, i.e., change the fake's handler to keep the tail value and the name/opts map/struct instead of ignoring them and adjust any recorded request struct used by tests to include those fields.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@evals/tasks/cluster-diagnostics/check-node-kernel/task.yaml`:
- Around line 10-12: The verification currently only checks for the substring
"Linux", which is too permissive; update the verifier (the inline check or
verify.sh referenced in the task) to require a version-bearing kernel string
instead of just "Linux". Replace the simple contains test with a regex check
that matches kernel-version patterns (for example, /Linux\s+\d+\.\d+(\.\d+)?/ or
similar) and fail when the output doesn't include digits/dots after "Linux";
ensure this change is applied to the verify logic that currently asserts
"Linux".
In `@evals/tasks/cluster-diagnostics/check-node-kernel/verify.sh`:
- Line 4: The kubectl pod-count invocation in verify.sh is currently silencing
errors with a 2>/dev/null while the script runs under set -euo pipefail, which
hides why kubectl may fail; remove the stderr redirection (or instead capture
stderr into a variable and print a clear error) so failures are visible in CI
logs—locate the kubectl command in verify.sh that uses 2>/dev/null and either
delete the redirection or wrap the call to print the captured error before
exiting.
In `@evals/tasks/cluster-diagnostics/list-host-etc/cleanup.sh`:
- Line 4: The current cleanup.sh deletes all "node-debug" pods across all
namespaces; change the kubectl delete invocation in cleanup.sh to target only
the task's namespace or a task-specific label instead of using --all-namespaces
— e.g., use -n "$TASK_NAMESPACE" to restrict to the current task namespace or
add -l "task=$TASK_ID" (or run/$RUN_ID) to select pods by a task/run-specific
label; update any environment variable names (TASK_NAMESPACE, TASK_ID)
referenced in the script and ensure the node-debug pod deletion uses that
namespace/label selector so other evals' debug pods are not removed.
In `@evals/tasks/cluster-diagnostics/list-host-etc/verify.sh`:
- Line 4: The verify.sh script currently suppresses kubectl stderr with
"2>/dev/null", which hides useful error details; remove that stderr redirection
from any kubectl invocations in verify.sh (or alternatively capture stderr into
a variable and echo it on failure) so that authentication/context/API errors are
visible—look for kubectl commands in verify.sh and either delete "2>/dev/null"
or wrap the call to capture >&2 into a variable and print the variable before
exiting.
In `@pkg/cluster-diagnostics/nodesdebug/nodes_debug_test.go`:
- Around line 1-20: The test currently lives in package nodesdebug and
constructs the internal NodeDebug type directly; change the test package to
nodesdebug_test and stop referencing NodeDebug internals — instead exercise only
exported entry points (e.g., call exported functions like RunNodeDebug or
NewPublicNodeDebug); if no exported constructor exists, add an exported
constructor (e.g., NewNodeDebug or NewPublicNodeDebug) to the package and use
that from the test; also update imports in the test to import the package as
nodesdebug and replace direct struct instantiation with calls to the exported
API.
- Around line 187-191: The test indexes VolumeMounts[0] after calling
s.Require().NotNil(...) which still allows an empty slice and can panic; before
any access add an explicit length guard (e.g.
s.Require().Len(<theObject>.VolumeMounts, 1) or
s.Require().NotEmpty(<theObject>.VolumeMounts)) immediately before the existing
access to VolumeMounts[0] so the test fails cleanly instead of panicking; keep
the existing NotNil check but insert the s.Require().Len(...) call (referencing
the VolumeMounts slice and the test's s.Require() helper) just prior to the
index usage.
In `@pkg/cluster-diagnostics/nodesdebug/nodes_debug_testhelpers_test.go`:
- Around line 99-116: The fake pod client's Get and Delete handlers in
nodes_debug_testhelpers_test.go currently ignore the requested pod name and
operate on the last-created pod, allowing false-positive tests; update the fake
methods (e.g., the fake Get and Delete implementations used in tests) to verify
that the requested pod name equals the stored pod's Name and if not, return a
NotFound error (or nil pod + error) instead of operating on the wrong pod,
ensuring callers using an incorrect pod name fail the test reliably.
In `@pkg/cluster-diagnostics/nodesdebug/nodes_debug.go`:
- Around line 156-164: retrieveLogs' returned error is currently discarded;
update the call in nodes_debug.go to log that error instead of ignoring it by
capturing the returned err from retrieveLogs and emitting a descriptive log
entry including node/pod identifiers and the error (use the file's existing
logger variable, e.g., d.logger or log, and include the error in the message) so
the diagnostic context is preserved for debugging.
In `@pkg/mcp/toolsets_test.go`:
- Around line 246-253: Replace the non-fatal assertion s.NoError(err) used when
creating the dependent test state with a fatal assertion so downstream checks
don't run on invalid state: change the call that checks the returned err (the
assertion currently using s.NoError(err) where err is produced on the setup at
line ~246) to s.Require().NoError(err) so the test stops on failure before the
dependent assertions around the later checks (currently at line ~252) execute.
In `@pkg/toolsets/cluster-diagnostics/nodes/nodes_test.go`:
- Around line 22-70: The test currently calls the unexported nodesDebugExec
directly; change it to exercise the public toolset API by registering the
handler via the toolset registration function and invoking it using the MCP tool
invocation path consumers use (instead of calling nodesDebugExec), then
replace/augment existing error-only cases with at least one positive successful
execution test that asserts the returned stdout/stderr and exit status; locate
usages of nodesDebugExec, the tool handler registration code (the public
Register/Handle function for this tool), and the MCP tool call helper in the
test suite (e.g., NodesHandlerSuite) to implement these changes and optionally
add an envtest-based integration test if you need end-to-end pod creation
validation.
In `@pkg/toolsets/cluster-diagnostics/nodes/nodes.go`:
- Around line 41-45: The handler that reads the "timeout_seconds" request
parameter must enforce the documented contract: if the parameter is missing set
a default of 60, reject non-integer/fractional values, and enforce a minimum of
1. Update the code in the request handler in nodes.go (the code that currently
parses "timeout_seconds" around the existing parsing block) to: 1) check for
empty/missing and set timeoutSecs = 60; 2) attempt to parse the value as an
integer (use integer parsing, not float) and if parsing fails return a 400
error; and 3) if parsed value < 1 clamp or reject per project convention (prefer
returning 400 for invalid input or set to 1 if you choose to clamp). Ensure the
variable used downstream is an int and that any timeouts use this validated
integer value.
- Around line 115-118: The code that parses the node `command` currently only
handles values of type []interface{} and thus rejects valid []string inputs;
update the parsing logic (the `command` handling block in nodes.go) to accept
both []interface{} and []string (and also a single string) by using a type
switch on the raw `command` value, converting []interface{} entries to strings,
using []string directly when present, and trimming/validating each entry before
storing into the existing command slice; ensure nil or empty inputs are handled
gracefully and keep the final type as []string used elsewhere.
---
Nitpick comments:
In `@pkg/cluster-diagnostics/nodesdebug/nodes_debug_testhelpers_test.go`:
- Around line 60-62: The fake log implementation in
nodes_debug_testhelpers_test.go currently accepts parameters like tail, name,
and opts but drops them, which breaks log-path tests; update the fake (the test
helper function that handles log requests) to preserve and store/forward the
incoming log request options (tail, name, opts) exactly as passed so test
assertions can verify wiring, i.e., change the fake's handler to keep the tail
value and the name/opts map/struct instead of ignoring them and adjust any
recorded request struct used by tests to include those fields.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 24fbdc2f-698b-486a-920f-5e27b8b5c106
📒 Files selected for processing (22)
README.mddocs/configuration.mdevals/claude-code/eval.yamlevals/gemini-agent/eval.yamlevals/openai-agent/eval.yamlevals/tasks/cluster-diagnostics/README.mdevals/tasks/cluster-diagnostics/check-node-kernel/cleanup.shevals/tasks/cluster-diagnostics/check-node-kernel/task.yamlevals/tasks/cluster-diagnostics/check-node-kernel/verify.shevals/tasks/cluster-diagnostics/list-host-etc/cleanup.shevals/tasks/cluster-diagnostics/list-host-etc/task.yamlevals/tasks/cluster-diagnostics/list-host-etc/verify.shinternal/tools/update-readme/main.gopkg/cluster-diagnostics/nodesdebug/nodes_debug.gopkg/cluster-diagnostics/nodesdebug/nodes_debug_test.gopkg/cluster-diagnostics/nodesdebug/nodes_debug_testhelpers_test.gopkg/mcp/modules.gopkg/mcp/testdata/toolsets-cluster-diagnostics-tools.jsonpkg/mcp/toolsets_test.gopkg/toolsets/cluster-diagnostics/nodes/nodes.gopkg/toolsets/cluster-diagnostics/nodes/nodes_test.gopkg/toolsets/cluster-diagnostics/toolset.go
✅ Files skipped from review due to trivial changes (4)
- docs/configuration.md
- evals/tasks/cluster-diagnostics/list-host-etc/task.yaml
- evals/tasks/cluster-diagnostics/README.md
- evals/openai-agent/eval.yaml
🚧 Files skipped from review as they are similar to previous changes (2)
- internal/tools/update-readme/main.go
- README.md
🛑 Comments failed to post (12)
evals/tasks/cluster-diagnostics/check-node-kernel/task.yaml (1)
10-12:
⚠️ Potential issue | 🟠 Major | ⚡ Quick winStrengthen verification to require an actual kernel version, not just
"Linux"On Line 11,
contains: "Linux"is too permissive for the prompt on Line 9; many non-answer outputs will still pass. Tightenverify.sh(or the inline check) to validate a version-bearing kernel string so the task measures the intended behavior.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@evals/tasks/cluster-diagnostics/check-node-kernel/task.yaml` around lines 10 - 12, The verification currently only checks for the substring "Linux", which is too permissive; update the verifier (the inline check or verify.sh referenced in the task) to require a version-bearing kernel string instead of just "Linux". Replace the simple contains test with a regex check that matches kernel-version patterns (for example, /Linux\s+\d+\.\d+(\.\d+)?/ or similar) and fail when the output doesn't include digits/dots after "Linux"; ensure this change is applied to the verify logic that currently asserts "Linux".evals/tasks/cluster-diagnostics/check-node-kernel/verify.sh (1)
4-4:
⚠️ Potential issue | 🟠 Major | ⚡ Quick winAvoid silent failure in the pod count command.
With
set -euo pipefail, a failingkubectlexits the script immediately;2>/dev/nullhides the root cause, making eval failures opaque in CI logs. Keep stderr visible (or handle and print a clear error) so verification failures are diagnosable.Suggested patch
-remaining=$(kubectl get pods -A -l app.kubernetes.io/component=node-debug --no-headers 2>/dev/null | wc -l | tr -d ' ') +remaining=$(kubectl get pods -A -l app.kubernetes.io/component=node-debug --no-headers | wc -l | tr -d ' ')🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@evals/tasks/cluster-diagnostics/check-node-kernel/verify.sh` at line 4, The kubectl pod-count invocation in verify.sh is currently silencing errors with a 2>/dev/null while the script runs under set -euo pipefail, which hides why kubectl may fail; remove the stderr redirection (or instead capture stderr into a variable and print a clear error) so failures are visible in CI logs—locate the kubectl command in verify.sh that uses 2>/dev/null and either delete the redirection or wrap the call to print the captured error before exiting.evals/tasks/cluster-diagnostics/list-host-etc/cleanup.sh (1)
4-4:
⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftCleanup scope is too broad for shared clusters.
Line 4 deletes all
node-debugpods across all namespaces. That can terminate unrelated debug sessions and make concurrent evals flaky. Scope deletion to the task namespace and/or a task/run-specific label.Suggested direction
-kubectl delete pods -A -l app.kubernetes.io/component=node-debug --ignore-not-found --wait=true --timeout=120s +# Example: scope to a known namespace + task/run label populated by task setup +kubectl delete pods -n "${EVAL_NAMESPACE:?missing EVAL_NAMESPACE}" \ + -l 'app.kubernetes.io/component=node-debug,eval.task=list-host-etc,eval.run-id='"${EVAL_RUN_ID:?missing EVAL_RUN_ID}" \ + --ignore-not-found --wait=true --timeout=120s🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@evals/tasks/cluster-diagnostics/list-host-etc/cleanup.sh` at line 4, The current cleanup.sh deletes all "node-debug" pods across all namespaces; change the kubectl delete invocation in cleanup.sh to target only the task's namespace or a task-specific label instead of using --all-namespaces — e.g., use -n "$TASK_NAMESPACE" to restrict to the current task namespace or add -l "task=$TASK_ID" (or run/$RUN_ID) to select pods by a task/run-specific label; update any environment variable names (TASK_NAMESPACE, TASK_ID) referenced in the script and ensure the node-debug pod deletion uses that namespace/label selector so other evals' debug pods are not removed.evals/tasks/cluster-diagnostics/list-host-etc/verify.sh (1)
4-4:
⚠️ Potential issue | 🟡 Minor | ⚡ Quick winDon’t suppress
kubectlstderr in verification scripts.
2>/dev/nullhides why verification failed (e.g., auth/context/API errors), which makes CI failures hard to triage. Keep stderr visible, or catch and print a clear error before exiting.Suggested patch
-remaining=$(kubectl get pods -A -l app.kubernetes.io/component=node-debug --no-headers 2>/dev/null | wc -l | tr -d ' ') +remaining=$(kubectl get pods -A -l app.kubernetes.io/component=node-debug --no-headers | wc -l | tr -d ' ')📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.remaining=$(kubectl get pods -A -l app.kubernetes.io/component=node-debug --no-headers | wc -l | tr -d ' ')🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@evals/tasks/cluster-diagnostics/list-host-etc/verify.sh` at line 4, The verify.sh script currently suppresses kubectl stderr with "2>/dev/null", which hides useful error details; remove that stderr redirection from any kubectl invocations in verify.sh (or alternatively capture stderr into a variable and echo it on failure) so that authentication/context/API errors are visible—look for kubectl commands in verify.sh and either delete "2>/dev/null" or wrap the call to capture >&2 into a variable and print the variable before exiting.pkg/cluster-diagnostics/nodesdebug/nodes_debug_test.go (2)
1-20:
⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftSwitch this suite to black-box API testing.
At Line 1 and Lines 18-20, the test runs in
package nodesdebugand directly constructsNodeDebug, which couples tests to internals. Move this topackage nodesdebug_testand exercise exported entry points only (or expose a constructor if needed).As per coding guidelines, tests should “Test the public API only - tests should be black-box and not access internal/private functions”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/cluster-diagnostics/nodesdebug/nodes_debug_test.go` around lines 1 - 20, The test currently lives in package nodesdebug and constructs the internal NodeDebug type directly; change the test package to nodesdebug_test and stop referencing NodeDebug internals — instead exercise only exported entry points (e.g., call exported functions like RunNodeDebug or NewPublicNodeDebug); if no exported constructor exists, add an exported constructor (e.g., NewNodeDebug or NewPublicNodeDebug) to the package and use that from the test; also update imports in the test to import the package as nodesdebug and replace direct struct instantiation with calls to the exported API.Source: Coding guidelines
187-191:
⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd a length guard before indexing
VolumeMounts[0].At Lines 189-190,
NotNildoes not guarantee non-empty; an empty slice can still panic on[0]. Uses.Require().Len(..., 1)first.Suggested fix
- s.Require().NotNil(created.Spec.Containers[0].VolumeMounts) - s.Equal(tt.expectedMountPath, created.Spec.Containers[0].VolumeMounts[0].MountPath) + s.Require().Len(created.Spec.Containers[0].VolumeMounts, 1) + s.Equal(tt.expectedMountPath, created.Spec.Containers[0].VolumeMounts[0].MountPath)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.s.Require().Len(created.Spec.Containers[0].VolumeMounts, 1) s.Equal(tt.expectedMountPath, created.Spec.Containers[0].VolumeMounts[0].MountPath)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/cluster-diagnostics/nodesdebug/nodes_debug_test.go` around lines 187 - 191, The test indexes VolumeMounts[0] after calling s.Require().NotNil(...) which still allows an empty slice and can panic; before any access add an explicit length guard (e.g. s.Require().Len(<theObject>.VolumeMounts, 1) or s.Require().NotEmpty(<theObject>.VolumeMounts)) immediately before the existing access to VolumeMounts[0] so the test fails cleanly instead of panicking; keep the existing NotNil check but insert the s.Require().Len(...) call (referencing the VolumeMounts slice and the test's s.Require() helper) just prior to the index usage.pkg/cluster-diagnostics/nodesdebug/nodes_debug_testhelpers_test.go (1)
99-116:
⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate pod identity in fake
Get/Deleteto avoid false-positive tests.Line 99 and Line 113 ignore the requested pod name and operate on whatever was created last. That can let tests pass even when caller code uses the wrong pod name.
Proposed fix
func (f *FakePodInterface) Get(_ context.Context, _ string, _ metav1.GetOptions) (*corev1.Pod, error) { if f.Created == nil { return nil, fmt.Errorf("pod not created yet") } + if f.Created.Name != name { + return nil, fmt.Errorf("unexpected pod name: got %q, want %q", name, f.Created.Name) + } pod := f.Created.DeepCopy() @@ func (f *FakePodInterface) Delete(_ context.Context, _ string, _ metav1.DeleteOptions) error { + if f.Created != nil && f.Created.Name != name { + return fmt.Errorf("unexpected pod name: got %q, want %q", name, f.Created.Name) + } f.Deleted = true return nil }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.func (f *FakePodInterface) Get(_ context.Context, name string, _ metav1.GetOptions) (*corev1.Pod, error) { if f.Created == nil { return nil, fmt.Errorf("pod not created yet") } if f.Created.Name != name { return nil, fmt.Errorf("unexpected pod name: got %q, want %q", name, f.Created.Name) } pod := f.Created.DeepCopy() return pod, nil } func (f *FakePodInterface) Delete(_ context.Context, name string, _ metav1.DeleteOptions) error { if f.Created != nil && f.Created.Name != name { return fmt.Errorf("unexpected pod name: got %q, want %q", name, f.Created.Name) } f.Deleted = true return nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/cluster-diagnostics/nodesdebug/nodes_debug_testhelpers_test.go` around lines 99 - 116, The fake pod client's Get and Delete handlers in nodes_debug_testhelpers_test.go currently ignore the requested pod name and operate on the last-created pod, allowing false-positive tests; update the fake methods (e.g., the fake Get and Delete implementations used in tests) to verify that the requested pod name equals the stored pod's Name and if not, return a NotFound error (or nil pod + error) instead of operating on the wrong pod, ensuring callers using an incorrect pod name fail the test reliably.pkg/cluster-diagnostics/nodesdebug/nodes_debug.go (1)
156-164:
⚠️ Potential issue | 🟡 Minor | ⚡ Quick winLog the error from
retrieveLogsinstead of discarding it.Per coding guidelines, errors should not be ignored in production code. When log retrieval fails, the diagnostic context is lost silently, making it harder to debug why the exec failed.
Suggested fix
stdout, stderr, execErr := n.PodsExec(timeoutCtx, ns, created.Name, NodeDebugContainerName, command) if execErr != nil { // Retrieve logs on exec errors — the pod may have produced partial output. - logs, _ := n.retrieveLogs(context.Background(), ns, created.Name) + logs, logErr := n.retrieveLogs(context.Background(), ns, created.Name) + if logErr != nil { + klog.Warningf("failed to retrieve logs for debug pod %s: %v", created.Name, logErr) + } if logs != "" { return "", "", fmt.Errorf("failed to execute command in debug pod %s in namespace %s: %w\nlogs:\n%s", created.Name, ns, execErr, logs) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/cluster-diagnostics/nodesdebug/nodes_debug.go` around lines 156 - 164, retrieveLogs' returned error is currently discarded; update the call in nodes_debug.go to log that error instead of ignoring it by capturing the returned err from retrieveLogs and emitting a descriptive log entry including node/pod identifiers and the error (use the file's existing logger variable, e.g., d.logger or log, and include the error in the message) so the diagnostic context is preserved for debugging.Source: Coding guidelines
pkg/mcp/toolsets_test.go (1)
246-253:
⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse
s.Require().NoError(err)before dependent assertions.
Line 246produceserrthat must be fatal for this test path; usings.NoError(err)allowsLine 252to continue with invalid state and can obscure the primary failure.Suggested fix
s.InitMcpClient() tools, err := s.ListTools() + s.Require().NoError(err, "Expected no error from ListTools") + s.Require().NotNil(tools, "Expected tools from ListTools") + s.Run("ListTools returns tools", func() { - s.NotNil(tools, "Expected tools from ListTools") - s.NoError(err, "Expected no error from ListTools") + s.NotNil(tools, "Expected tools from ListTools") }) s.Run("ListTools returns correct Tool metadata", func() { s.assertJsonSnapshot("toolsets-openshift-mustgather-tools.json", tools.Tools) })As per coding guidelines, “Use
s.Require().NoError(err)for operations in tests that must succeed for the test to be valid”.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/mcp/toolsets_test.go` around lines 246 - 253, Replace the non-fatal assertion s.NoError(err) used when creating the dependent test state with a fatal assertion so downstream checks don't run on invalid state: change the call that checks the returned err (the assertion currently using s.NoError(err) where err is produced on the setup at line ~246) to s.Require().NoError(err) so the test stops on failure before the dependent assertions around the later checks (currently at line ~252) execute.Source: Coding guidelines
pkg/toolsets/cluster-diagnostics/nodes/nodes_test.go (1)
22-70: 🛠️ Refactor suggestion | 🟠 Major | 🏗️ Heavy lift
Test the public API instead of the private
nodesDebugExecfunction.This test directly invokes the unexported
nodesDebugExecfunction. As per coding guidelines, tests should be black-box and only test the public API surface, not internal/private functions. The proper approach is to test through the public toolset API that consumers will use.Additionally, this test suite only covers error validation paths but has no positive test cases for successful command execution. As per coding guidelines, aim for high test coverage including both error paths and successful scenarios.
♻️ Recommended approach
- Test through the public API: Register the tool handler through the toolset interface and invoke it via the MCP tool call mechanism that external consumers use
- Add positive test cases: Include at least one successful execution scenario that validates the tool returns expected output when given valid inputs
- Consider integration tests: For a tool that creates debug pods, consider using
envtest(per guidelines) to verify end-to-end behavior with a real Kubernetes APIExample structure:
func (s *NodesHandlerSuite) TestSuccessfulExecution() { // Setup toolset and register handler through public API // Invoke tool via public interface // Assert successful result with expected stdout }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/toolsets/cluster-diagnostics/nodes/nodes_test.go` around lines 22 - 70, The test currently calls the unexported nodesDebugExec directly; change it to exercise the public toolset API by registering the handler via the toolset registration function and invoking it using the MCP tool invocation path consumers use (instead of calling nodesDebugExec), then replace/augment existing error-only cases with at least one positive successful execution test that asserts the returned stdout/stderr and exit status; locate usages of nodesDebugExec, the tool handler registration code (the public Register/Handle function for this tool), and the MCP tool call helper in the test suite (e.g., NodesHandlerSuite) to implement these changes and optionally add an envtest-based integration test if you need end-to-end pod creation validation.Source: Coding guidelines
pkg/toolsets/cluster-diagnostics/nodes/nodes.go (2)
41-45:
⚠️ Potential issue | 🟠 Major | ⚡ Quick winEnforce
timeout_secondscontract in handler (default + integer/min validation).Line 43 documents a 60s default and Line 44 sets minimum 1 in schema, but Lines 85-97 don’t enforce either at runtime. This can accept fractional/invalid values and leaves default behavior implicit in downstream code.
Suggested fix
import ( "errors" "fmt" + "math" "time" @@ - var timeout time.Duration + timeout := 60 * time.Second if timeoutRaw, exists := params.GetArguments()["timeout_seconds"]; exists && timeoutRaw != nil { switch v := timeoutRaw.(type) { case float64: - timeout = time.Duration(int64(v)) * time.Second + if v < 1 || math.Trunc(v) != v { + return api.NewToolCallResult("", errors.New("timeout_seconds must be an integer >= 1")), nil + } + timeout = time.Duration(int64(v)) * time.Second case int: + if v < 1 { + return api.NewToolCallResult("", errors.New("timeout_seconds must be >= 1")), nil + } timeout = time.Duration(v) * time.Second case int64: + if v < 1 { + return api.NewToolCallResult("", errors.New("timeout_seconds must be >= 1")), nil + } timeout = time.Duration(v) * time.Second default: return api.NewToolCallResult("", errors.New("timeout_seconds must be a numeric value")), nil } }Also applies to: 85-97
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/toolsets/cluster-diagnostics/nodes/nodes.go` around lines 41 - 45, The handler that reads the "timeout_seconds" request parameter must enforce the documented contract: if the parameter is missing set a default of 60, reject non-integer/fractional values, and enforce a minimum of 1. Update the code in the request handler in nodes.go (the code that currently parses "timeout_seconds" around the existing parsing block) to: 1) check for empty/missing and set timeoutSecs = 60; 2) attempt to parse the value as an integer (use integer parsing, not float) and if parsing fails return a 400 error; and 3) if parsed value < 1 clamp or reject per project convention (prefer returning 400 for invalid input or set to 1 if you choose to clamp). Ensure the variable used downstream is an int and that any timeouts use this validated integer value.
115-118:
⚠️ Potential issue | 🟡 Minor | ⚡ Quick win
commandparsing is too strict for valid in-process inputs.Lines 115-118 only accept
[]interface{}. If a caller passes[]string(still valid per schema intent), this path incorrectly rejects it.Suggested fix
func toStringSlice(arg any) ([]string, error) { if arg == nil { return nil, errors.New("command is required") } - raw, ok := arg.([]interface{}) - if !ok { - return nil, errors.New("command must be an array of strings") - } - if len(raw) == 0 { + switch v := arg.(type) { + case []string: + if len(v) == 0 { + return nil, errors.New("command array cannot be empty") + } + return v, nil + case []interface{}: + if len(v) == 0 { + return nil, errors.New("command array cannot be empty") + } + command := make([]string, 0, len(v)) + for _, item := range v { + str, ok := item.(string) + if !ok { + return nil, errors.New("command items must be strings") + } + command = append(command, str) + } + return command, nil + default: + return nil, errors.New("command must be an array of strings") + } - return nil, errors.New("command array cannot be empty") - } - command := make([]string, 0, len(raw)) - for _, item := range raw { - str, ok := item.(string) - if !ok { - return nil, errors.New("command items must be strings") - } - command = append(command, str) - } - return command, nil }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/toolsets/cluster-diagnostics/nodes/nodes.go` around lines 115 - 118, The code that parses the node `command` currently only handles values of type []interface{} and thus rejects valid []string inputs; update the parsing logic (the `command` handling block in nodes.go) to accept both []interface{} and []string (and also a single string) by using a type switch on the raw `command` value, converting []interface{} entries to strings, using []string directly when present, and trimming/validating each entry before storing into the existing command slice; ensure nil or empty inputs are handled gracefully and keep the final type as []string used elsewhere.
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
pkg/toolsets/cluster-diagnostics/nodes/nodes_test.go (1)
22-69: ⚡ Quick winRoute these tests through the exported tool registration path.
These cases call
nodesDebugExecdirectly, which couples the suite to a private helper instead of the toolset’s public surface. Exercising the handler viaInitNodes()keeps the test black-box and also covers the registration contract that callers actually use.As per coding guidelines, "
**/*_test.go: Test the public API only - tests should be black-box and not access internal/private functions`."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/toolsets/cluster-diagnostics/nodes/nodes_test.go` around lines 22 - 69, Tests currently call the private helper nodesDebugExec directly; change them to exercise the public registration path by calling InitNodes() to obtain the exported tool/handler and invoking it with the same api.ToolHandlerParams/staticRequest inputs (instead of nodesDebugExec). Replace each nodesDebugExec(params) invocation with the registered handler's execution method (e.g., the returned Tool.Handler or Tool.Handle/Execute function from InitNodes()) so the suite remains black-box and still asserts the same error results for missing node, bad command type, missing command, and empty command array.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@evals/tasks/cluster-diagnostics/list-host-etc/task.yaml`:
- Around line 3-4: The task's metadata label metadata.labels.suite is set to
"openshift" so the task is not matched by the new taskSets that filter on
"suite: cluster-diagnostics"; update metadata.labels.suite from "openshift" to
"cluster-diagnostics" (i.e., change the label value in task.yaml) so the task is
discoverable by the cluster-diagnostics runs.
In `@pkg/cluster-diagnostics/nodesdebug/nodes_debug.go`:
- Around line 156-163: The current error branch for PodsExec discards
stdout/stderr when execErr occurs; change the branch handling in function(s)
that call PodsExec (referencing PodsExec, retrieveLogs, NodeDebugContainerName,
created.Name and ns) so that you still return stdout and stderr alongside the
wrapped error (and include logs if available) instead of empty strings—i.e., on
execErr return stdout, stderr, and an error that wraps execErr and appends any
retrieved pod logs for context.
In `@pkg/toolsets/cluster-diagnostics/nodes/nodes.go`:
- Around line 85-96: The timeout parsing currently accepts fractional and
non-positive values from params.GetArguments()["timeout_seconds"]; update the
switch in the timeout parsing block so that: when timeoutRaw is float64 you
reject it unless it represents an exact integer (e.g., math.Mod(v,1)==0) and its
integer value is >= 1; when timeoutRaw is int or int64 ensure the value is >= 1;
for any fractional or non-positive values return api.NewToolCallResult("",
errors.New("timeout_seconds must be an integer >= 1")) (same for unsupported
types) so NodesDebugExec receives only valid positive integer seconds. Ensure
you reference the same variables (timeoutRaw, timeout, params.GetArguments())
and keep the existing return path shape.
---
Nitpick comments:
In `@pkg/toolsets/cluster-diagnostics/nodes/nodes_test.go`:
- Around line 22-69: Tests currently call the private helper nodesDebugExec
directly; change them to exercise the public registration path by calling
InitNodes() to obtain the exported tool/handler and invoking it with the same
api.ToolHandlerParams/staticRequest inputs (instead of nodesDebugExec). Replace
each nodesDebugExec(params) invocation with the registered handler's execution
method (e.g., the returned Tool.Handler or Tool.Handle/Execute function from
InitNodes()) so the suite remains black-box and still asserts the same error
results for missing node, bad command type, missing command, and empty command
array.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 24fbdc2f-698b-486a-920f-5e27b8b5c106
📒 Files selected for processing (22)
README.mddocs/configuration.mdevals/claude-code/eval.yamlevals/gemini-agent/eval.yamlevals/openai-agent/eval.yamlevals/tasks/cluster-diagnostics/README.mdevals/tasks/cluster-diagnostics/check-node-kernel/cleanup.shevals/tasks/cluster-diagnostics/check-node-kernel/task.yamlevals/tasks/cluster-diagnostics/check-node-kernel/verify.shevals/tasks/cluster-diagnostics/list-host-etc/cleanup.shevals/tasks/cluster-diagnostics/list-host-etc/task.yamlevals/tasks/cluster-diagnostics/list-host-etc/verify.shinternal/tools/update-readme/main.gopkg/cluster-diagnostics/nodesdebug/nodes_debug.gopkg/cluster-diagnostics/nodesdebug/nodes_debug_test.gopkg/cluster-diagnostics/nodesdebug/nodes_debug_testhelpers_test.gopkg/mcp/modules.gopkg/mcp/testdata/toolsets-cluster-diagnostics-tools.jsonpkg/mcp/toolsets_test.gopkg/toolsets/cluster-diagnostics/nodes/nodes.gopkg/toolsets/cluster-diagnostics/nodes/nodes_test.gopkg/toolsets/cluster-diagnostics/toolset.go
✅ Files skipped from review due to trivial changes (8)
- evals/tasks/cluster-diagnostics/list-host-etc/cleanup.sh
- evals/tasks/cluster-diagnostics/check-node-kernel/cleanup.sh
- evals/tasks/cluster-diagnostics/list-host-etc/verify.sh
- evals/tasks/cluster-diagnostics/check-node-kernel/verify.sh
- pkg/mcp/modules.go
- evals/tasks/cluster-diagnostics/README.md
- docs/configuration.md
- pkg/mcp/testdata/toolsets-cluster-diagnostics-tools.json
🚧 Files skipped from review as they are similar to previous changes (2)
- README.md
- internal/tools/update-readme/main.go
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pkg/toolsets/cluster-diagnostics/nodes/nodes.go (2)
114-117: ⚡ Quick winConsider including stderr when both stdout and stderr contain data.
The current logic prefers
stdoutand only falls back tostderrwhenstdoutis empty. When a command writes diagnostic information tostderrwhile also producing normal output tostdout, the diagnostic information is lost. Consider combining both streams when both are non-empty, or at minimum documenting this design choice.💡 Potential improvement
output := stdout if stdout == "" { output = stderr +} else if stderr != "" { + output = stdout + "\n\n--- stderr ---\n" + stderr }Alternatively, if preferring stdout-only is intentional, add a comment explaining the rationale.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/toolsets/cluster-diagnostics/nodes/nodes.go` around lines 114 - 117, The current selection logic in nodes.go (variables stdout, stderr and output) discards stderr when stdout is non-empty; change it so diagnostic stderr isn't lost by combining both streams when both are non-empty (e.g., set output to stdout + "\n" + stderr or similar), or if you intentionally prefer stdout-only, add a brief comment next to the output assignment explaining that choice and why stderr is ignored; update any downstream consumers of output (where output is used) to handle the combined content if you choose to concatenate.
110-110: Drop the concern about emptyhostPath/mountPath; focus on stdout/stderr handling.
hostPath/mountPathempty strings are intentionally supported:validatePath("")returnsnil, andcreateDebugPoddefaults them to"/"and"/host"when empty.pkg/toolsets/cluster-diagnostics/nodes/nodes.goreturns onlystdoutwhen it’s non-empty (stderris used only whenstdout == ""), so commands that emit diagnostics tostderrwhile also producingstdoutwill lose that information—consider combining/appendingstderrwhen both streams are present.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/toolsets/cluster-diagnostics/nodes/nodes.go` at line 110, The current call to client.NodesDebugExec in NodesDebugExec handling returns stdout and stderr separately but downstream code in nodes.go only returns stdout when non-empty, discarding stderr; update the logic that consumes stdout, stderr, execErr (from client.NodesDebugExec) to preserve both streams by appending or combining stderr into the returned output when stdout is non-empty (e.g., "stdout\n[stderr]\n" or similar), ensuring execErr is still returned/propagated; locate the handling around the stdout, stderr, execErr variables in nodes.go and change the return/response logic to include stderr whenever it contains data instead of dropping it.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@pkg/toolsets/cluster-diagnostics/nodes/nodes.go`:
- Around line 114-117: The current selection logic in nodes.go (variables
stdout, stderr and output) discards stderr when stdout is non-empty; change it
so diagnostic stderr isn't lost by combining both streams when both are
non-empty (e.g., set output to stdout + "\n" + stderr or similar), or if you
intentionally prefer stdout-only, add a brief comment next to the output
assignment explaining that choice and why stderr is ignored; update any
downstream consumers of output (where output is used) to handle the combined
content if you choose to concatenate.
- Line 110: The current call to client.NodesDebugExec in NodesDebugExec handling
returns stdout and stderr separately but downstream code in nodes.go only
returns stdout when non-empty, discarding stderr; update the logic that consumes
stdout, stderr, execErr (from client.NodesDebugExec) to preserve both streams by
appending or combining stderr into the returned output when stdout is non-empty
(e.g., "stdout\n[stderr]\n" or similar), ensuring execErr is still
returned/propagated; locate the handling around the stdout, stderr, execErr
variables in nodes.go and change the return/response logic to include stderr
whenever it contains data instead of dropping it.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7d3ea014-549a-484b-9971-6df74508ce89
📒 Files selected for processing (20)
README.mddocs/configuration.mdevals/claude-code/eval.yamlevals/gemini-agent/eval.yamlevals/openai-agent/eval.yamlevals/tasks/cluster-diagnostics/README.mdevals/tasks/cluster-diagnostics/check-kubelet-active/cleanup.shevals/tasks/cluster-diagnostics/check-kubelet-active/task.yamlevals/tasks/cluster-diagnostics/list-host-etc/cleanup.shevals/tasks/cluster-diagnostics/list-host-etc/task.yamlinternal/tools/update-readme/main.gopkg/cluster-diagnostics/nodesdebug/nodes_debug.gopkg/cluster-diagnostics/nodesdebug/nodes_debug_test.gopkg/cluster-diagnostics/nodesdebug/nodes_debug_testhelpers_test.gopkg/mcp/modules.gopkg/mcp/testdata/toolsets-cluster-diagnostics-tools.jsonpkg/mcp/toolsets_test.gopkg/toolsets/cluster-diagnostics/nodes/nodes.gopkg/toolsets/cluster-diagnostics/nodes/nodes_test.gopkg/toolsets/cluster-diagnostics/toolset.go
✅ Files skipped from review due to trivial changes (4)
- evals/tasks/cluster-diagnostics/list-host-etc/task.yaml
- docs/configuration.md
- evals/tasks/cluster-diagnostics/README.md
- evals/tasks/cluster-diagnostics/check-kubelet-active/cleanup.sh
🚧 Files skipped from review as they are similar to previous changes (13)
- pkg/mcp/toolsets_test.go
- evals/claude-code/eval.yaml
- evals/gemini-agent/eval.yaml
- pkg/mcp/testdata/toolsets-cluster-diagnostics-tools.json
- pkg/mcp/modules.go
- evals/openai-agent/eval.yaml
- pkg/toolsets/cluster-diagnostics/nodes/nodes_test.go
- evals/tasks/cluster-diagnostics/list-host-etc/cleanup.sh
- pkg/toolsets/cluster-diagnostics/toolset.go
- README.md
- pkg/cluster-diagnostics/nodesdebug/nodes_debug_testhelpers_test.go
- pkg/cluster-diagnostics/nodesdebug/nodes_debug.go
- pkg/cluster-diagnostics/nodesdebug/nodes_debug_test.go
| { | ||
| Tool: api.Tool{ | ||
| Name: "nodes_debug_exec", | ||
| Description: "Run commands on an OpenShift node using a privileged debug pod with comprehensive troubleshooting utilities. The debug pod uses the UBI9 toolbox image which includes: systemd tools (systemctl, journalctl), networking tools (ss, ip, ping, traceroute, nmap), process tools (ps, top, lsof, strace), file system tools (find, tar, rsync), and debugging tools (gdb). The host filesystem is mounted at /host, allowing commands to chroot /host if needed to access node-level resources. Output is truncated to the most recent 100 lines, so prefer filters like grep when expecting large logs.", |
There was a problem hiding this comment.
@arkadeepsen maybe I am missing it, but I don't see the 100-line truncation logic? Can we either:
- add the truncation logic
- remove that from the tool description
My inclination would be to go with 2 as most harnesses are adding capabilities to better handle large tool outputs (e.g. claude code dumps it into a file which it can then read from selectively, rather than putting the full output into the context)
There was a problem hiding this comment.
I have removed the truncation part from the description.
| // pollForRunning polls the debug pod until it becomes running or times out. | ||
| func (n *NodeDebug) pollForRunning(ctx context.Context, namespace string, podName string, timeout time.Duration) error { | ||
| // Poll until the pod is running or the timeout is reached. | ||
| err := wait.PollUntilContextTimeout(ctx, 2*time.Second, timeout, true, func(ctx context.Context) (bool, error) { |
There was a problem hiding this comment.
can we exit early here if the pod is obviously failed?
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/cluster-diagnostics/nodesdebug/nodes_debug.go`:
- Around line 157-160: The retrieveLogs method call is ignoring its error return
value using a blank identifier on line 159, which violates production code
guidelines. Instead of using the blank identifier to discard the error, capture
the error returned from n.retrieveLogs and handle it appropriately by logging it
or including it in your error context, since this code is already in an error
handling path and suppressing the error can hide important debugging
information.
- Around line 269-272: The error formatting in the PodFailed case at the
fmt.Errorf call uses the %w verb with pod.Status.Message, which is a string
type, not an error type. The %w verb is designed only for error wrapping and
should not be used with strings. Replace the %w verb with %s to properly format
the string value of pod.Status.Message in the error message.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: d7904a84-b070-4dc5-93b6-f2f8cbcfcfdf
📒 Files selected for processing (20)
README.mddocs/configuration.mdevals/claude-code/eval.yamlevals/gemini-agent/eval.yamlevals/openai-agent/eval.yamlevals/tasks/cluster-diagnostics/README.mdevals/tasks/cluster-diagnostics/check-kubelet-active/cleanup.shevals/tasks/cluster-diagnostics/check-kubelet-active/task.yamlevals/tasks/cluster-diagnostics/list-host-etc/cleanup.shevals/tasks/cluster-diagnostics/list-host-etc/task.yamlinternal/tools/update-readme/main.gopkg/cluster-diagnostics/nodesdebug/nodes_debug.gopkg/cluster-diagnostics/nodesdebug/nodes_debug_test.gopkg/cluster-diagnostics/nodesdebug/nodes_debug_testhelpers_test.gopkg/mcp/modules.gopkg/mcp/testdata/toolsets-cluster-diagnostics-tools.jsonpkg/mcp/toolsets_test.gopkg/toolsets/cluster-diagnostics/nodes/nodes.gopkg/toolsets/cluster-diagnostics/nodes/nodes_test.gopkg/toolsets/cluster-diagnostics/toolset.go
✅ Files skipped from review due to trivial changes (5)
- evals/tasks/cluster-diagnostics/check-kubelet-active/cleanup.sh
- pkg/mcp/modules.go
- pkg/mcp/testdata/toolsets-cluster-diagnostics-tools.json
- docs/configuration.md
- README.md
🚧 Files skipped from review as they are similar to previous changes (12)
- evals/openai-agent/eval.yaml
- evals/tasks/cluster-diagnostics/list-host-etc/cleanup.sh
- evals/claude-code/eval.yaml
- pkg/mcp/toolsets_test.go
- evals/tasks/cluster-diagnostics/list-host-etc/task.yaml
- evals/tasks/cluster-diagnostics/check-kubelet-active/task.yaml
- pkg/cluster-diagnostics/nodesdebug/nodes_debug_testhelpers_test.go
- internal/tools/update-readme/main.go
- evals/gemini-agent/eval.yaml
- pkg/toolsets/cluster-diagnostics/nodes/nodes_test.go
- pkg/toolsets/cluster-diagnostics/nodes/nodes.go
- pkg/cluster-diagnostics/nodesdebug/nodes_debug_test.go
A new toolset, cluster-diagnostics, is created and the node_debug_exec tool is now added as part of it. The code is now organized under packages related to node (pkg/cluster-diagnostics/nodesdebug and pkg/toolsets/cluster-diagnostics/nodes). Instead of directly creating debug pod with the passed command, pod is created with sleep command. Then, the passed command is executed inside the pod using pod_exec. Logs are now only fetched when there's pod_exec error. This way the stdout and stderr data can be easily obtained instead of processing the output from the pod logs. Though, the node_debug_exec tool doesn't accept hostPath and mountPath as inputs, the internal function now accepts these params which can be passed by other tools which need node debug.
|
@arkadeepsen: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This PR adds the add_debug_exec tool.
Cherry-picked the commit from #38 and refactored it. The refactor is added to a separate commit.
Summary by CodeRabbit
cluster-diagnosticstoolset with node debug execution support for privileged diagnostics.cluster-diagnosticsandopenshift.cluster-diagnosticseval suite documentation and added OpenShiftmust-gatherprompt guidance.cluster-diagnosticseval task coverage plus new unit/integration-style tests and pod cleanup scripts for task runs.