Skip to content

Flake: agent e2e NodeReboot can execute twice and time out #299

Description

@bcho

Summary

The agent e2e NodeReboot validation can flake when a single MachineOperation is reconciled more than once. In run 27785963400, job agent e2e (host ubuntu2404, node azlinux3) failed in Validate node restart operation, but the collected artifacts showed the operation eventually completed and the node became Ready.

Evidence

  • Workflow run: https://github.com/Azure/unbounded/actions/runs/27785963400
  • Failed job: agent e2e (host ubuntu2404, node azlinux3)
  • Artifact: agent-e2e-kind-ubuntu2404-azlinux3-logs
  • machineoperations.txt showed e2e-node-reboot-1781813571 reached Complete.
  • nodes.txt and nodes-describe.txt showed the agent node Ready by collection time.
  • vm-unbounded-agent-daemon.log showed restarting active node [machine=kube1] twice for the same NodeReboot operation.

Suspected Cause

A duplicate queued reconcile can observe the operation after MarkInProgress but before/around completion and execute it again because the reconciler only skips terminal phases. Slower Azure Linux restart timing appears to make this easier to hit.

Suggested Follow-up

Make MachineOperation execution idempotent, for example by skipping operations already in InProgress, while preserving explicit Pending semantics if needed.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions