action/submit: Retry if provision fails by jpm-canonical · Pull Request #1143 · canonical/testflinger

jpm-canonical · 2026-06-11T11:39:39Z

Description

This PR addresses two common problems we face:

It is very common that our CI tests with testflinger fail, because the provisioning of the machines fail. This requires us to manually re-run the tests, hoping we get a different agent on the same queue, which provisions successfully.

2. We also often see provisioning taking very long and fails. We see on average that a successful provisioning take <20 minutes, so whenever it takes longer, we already know it will fail. It is currently not possible to cancel and rerun a specific run in a github workflow job matrix. Removed in favour of alternative fix in maas2 connector (feature/dev-maas-more-detail).

Resolved issues

This PR solves these two issues by introducing:

An input variable setting the number of times the job will be re-submitted if it failed before reaching the test phase.

~~2. Add a timeout for the provisioning step. If the configured timeout is reached, the testflinger job is cancelled, and the retry logic can much quicker retry.~~

Documentation

Action README is updated.

Web service API changes

none

Tests

Manually tested:

No provisioning: https://github.com/canonical/inference-snaps-testing/actions/runs/27683162149
Provisioning no fail: https://github.com/canonical/inference-snaps-testing/actions/runs/27809664602/job/82296920146
Provisioning with retries: https://github.com/canonical/inference-snaps-testing/actions/runs/27809598276/job/82296717462

With retries set to 3, we see:

ajzobro · 2026-06-12T19:48:03Z

Hello, thank you for your feedback with respect to these systemic issues for the lab.

Please note that the GH actions were all updated to ensure that env vars are used for sensitive data and there appear to be conflicts that need to be addressed as a result of that merging in first.

Using the maas2 provision type does have a tendency to succeed in under 20 minutes if it will succeed at all, this is generally true. However this PR does not seek to address or resolve the root cause. We have another branch with maas device connector changes which are intended to address the same issue that your timeout attempts to address: feature/dev-maas-more-detail

Given that the timeout is not the best solution for this problem, I would ask that we consider your other changes separately from the addition of a timeout.

jpm-canonical · 2026-06-17T12:57:50Z

This branch has been rebased on main, and the timeout has been removed.

Three tests were run using the current head, and listed in the PR description.

Something I noticed from the test output is that we use "retries", so there will be N+1 tries. Technically this is correct, but this might be misinterpreted. Should we perhaps change the input variable name to provision-max-attempts, defaulting to 1, so that there will only be N provision attempts? In that case we also need to define what happens when it is set to 0 (no provisioning?).

ajzobro · 2026-06-17T13:32:54Z

I believe it is important to distinguish between re-queuing (pre agent selection) the work (what this appears to do) and retrying to provision (an action taken on a given agent).

This may be helpful in a system with non-working assets left online, and thus may be useful today as a stop-gap, but the true problem of accurate resource health and availability still needs to be solved.

That said "retries" is my vote over anything mentioning the word "provision".

* origin/main: fix: retry control_host commands (canonical#1156) fix bug with running in parallel (canonical#1145)

- Replace ((++VAR)) arithmetic with VAR=$((VAR + 1)) to avoid set -e pitfalls with arithmetic expressions evaluating to zero - Add input validation for 'retries' to catch non-numeric values early with a clear error message - Rename duplicate group names to 'Retrieve setup phase exit status' and 'Retrieve provision phase exit status' for clarity in logs - Guard testflinger results calls with error handling so network/server failures produce a meaningful error instead of a silent abort

Instead of hard-exiting when 'testflinger results' fails (e.g. transient network error), emit a warning and set the status to 1 so the retry loop can handle it like any other phase failure.

Restore original behaviour where a failed 'testflinger results' call (e.g. due to networking issues) exits the step immediately rather than being swallowed and treated as a phase failure. Keep the jq '// 1' fallback for null/missing fields which is a separate concern.

jpm-canonical changed the title ~~Wrap submit, setup, provision in a retry loop~~ action/submit: Retry if provision fails Jun 11, 2026

jpm-canonical marked this pull request as ready for review June 12, 2026 09:34

jpm-canonical added 9 commits June 17, 2026 12:07

Wrap submit, setup, provision in a retry loop

e5fb768

Add max duration for provision step

ccea5df

Fix handling of job id and poll exit code

14593f4

jq calls exit code 1 rather than null

9a56efa

Improve logging for unexpected situations

97f2316

Update readme

18b8959

Improve input descriptions

2e8fee7

Remove provision timeout

8a5ac03

Clean up provision timeout leftovers from action

d64d1b9

jpm-canonical force-pushed the retry-submit branch from ad8fa38 to d64d1b9 Compare June 17, 2026 10:25

Add whitespace for readability

bffe33a

jpm-canonical added 8 commits June 18, 2026 15:24

Rename input to retries and make applicable to setup and provision

5750899

Merge remote-tracking branch 'origin/main' into retry-submit

ee6a62f

* origin/main: fix: retry control_host commands (canonical#1156) fix bug with running in parallel (canonical#1145)

Rename step

4f512b2

Fix: treat failed results fetch as phase failure to allow retry

eb13a5d

Instead of hard-exiting when 'testflinger results' fails (e.g. transient network error), emit a warning and set the status to 1 so the retry loop can handle it like any other phase failure.

More copilot review fixes

7fca9a0

Revert input description

679b480

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

action/submit: Retry if provision fails#1143

action/submit: Retry if provision fails#1143
jpm-canonical wants to merge 18 commits into
canonical:mainfrom
jpm-canonical:retry-submit

jpm-canonical commented Jun 11, 2026 •

edited

Loading

Uh oh!

ajzobro commented Jun 12, 2026

Uh oh!

jpm-canonical commented Jun 17, 2026

Uh oh!

ajzobro commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jpm-canonical commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Resolved issues

Documentation

Web service API changes

Tests

Uh oh!

ajzobro commented Jun 12, 2026

Uh oh!

jpm-canonical commented Jun 17, 2026

Uh oh!

ajzobro commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jpm-canonical commented Jun 11, 2026 •

edited

Loading