Summary
The agent e2e NodeReboot validation can flake when a single MachineOperation is reconciled more than once. In run 27785963400, job agent e2e (host ubuntu2404, node azlinux3) failed in Validate node restart operation, but the collected artifacts showed the operation eventually completed and the node became Ready.
Evidence
- Workflow run: https://github.com/Azure/unbounded/actions/runs/27785963400
- Failed job:
agent e2e (host ubuntu2404, node azlinux3)
- Artifact:
agent-e2e-kind-ubuntu2404-azlinux3-logs
machineoperations.txt showed e2e-node-reboot-1781813571 reached Complete.
nodes.txt and nodes-describe.txt showed the agent node Ready by collection time.
vm-unbounded-agent-daemon.log showed restarting active node [machine=kube1] twice for the same NodeReboot operation.
Suspected Cause
A duplicate queued reconcile can observe the operation after MarkInProgress but before/around completion and execute it again because the reconciler only skips terminal phases. Slower Azure Linux restart timing appears to make this easier to hit.
Suggested Follow-up
Make MachineOperation execution idempotent, for example by skipping operations already in InProgress, while preserving explicit Pending semantics if needed.
Summary
The agent e2e
NodeRebootvalidation can flake when a singleMachineOperationis reconciled more than once. In run27785963400, jobagent e2e (host ubuntu2404, node azlinux3)failed inValidate node restart operation, but the collected artifacts showed the operation eventually completed and the node became Ready.Evidence
agent e2e (host ubuntu2404, node azlinux3)agent-e2e-kind-ubuntu2404-azlinux3-logsmachineoperations.txtshowede2e-node-reboot-1781813571reachedComplete.nodes.txtandnodes-describe.txtshowed the agent node Ready by collection time.vm-unbounded-agent-daemon.logshowedrestarting active node [machine=kube1]twice for the same NodeReboot operation.Suspected Cause
A duplicate queued reconcile can observe the operation after
MarkInProgressbut before/around completion and execute it again because the reconciler only skips terminal phases. Slower Azure Linux restart timing appears to make this easier to hit.Suggested Follow-up
Make MachineOperation execution idempotent, for example by skipping operations already in
InProgress, while preserving explicitPendingsemantics if needed.