action/submit: Retry if provision fails#1143
Conversation
|
Hello, thank you for your feedback with respect to these systemic issues for the lab. Please note that the GH actions were all updated to ensure that env vars are used for sensitive data and there appear to be conflicts that need to be addressed as a result of that merging in first. Using the Given that the timeout is not the best solution for this problem, I would ask that we consider your other changes separately from the addition of a timeout. |
ad8fa38 to
d64d1b9
Compare
|
This branch has been rebased on Three tests were run using the current head, and listed in the PR description. Something I noticed from the test output is that we use "retries", so there will be N+1 tries. Technically this is correct, but this might be misinterpreted. Should we perhaps change the input variable name to |
|
I believe it is important to distinguish between re-queuing (pre agent selection) the work (what this appears to do) and retrying to provision (an action taken on a given agent). This may be helpful in a system with non-working assets left online, and thus may be useful today as a stop-gap, but the true problem of accurate resource health and availability still needs to be solved. That said "retries" is my vote over anything mentioning the word "provision". |
* origin/main: fix: retry control_host commands (canonical#1156) fix bug with running in parallel (canonical#1145)
- Replace ((++VAR)) arithmetic with VAR=$((VAR + 1)) to avoid set -e pitfalls with arithmetic expressions evaluating to zero - Add input validation for 'retries' to catch non-numeric values early with a clear error message - Rename duplicate group names to 'Retrieve setup phase exit status' and 'Retrieve provision phase exit status' for clarity in logs - Guard testflinger results calls with error handling so network/server failures produce a meaningful error instead of a silent abort
Instead of hard-exiting when 'testflinger results' fails (e.g. transient network error), emit a warning and set the status to 1 so the retry loop can handle it like any other phase failure.
Restore original behaviour where a failed 'testflinger results' call (e.g. due to networking issues) exits the step immediately rather than being swallowed and treated as a phase failure. Keep the jq '// 1' fallback for null/missing fields which is a separate concern.
Description
This PR addresses two common problems we face:
2. We also often see provisioning taking very long and fails. We see on average that a successful provisioning take <20 minutes, so whenever it takes longer, we already know it will fail. It is currently not possible to cancel and rerun a specific run in a github workflow job matrix.Removed in favour of alternative fix in maas2 connector (feature/dev-maas-more-detail).Resolved issues
This PR solves these two issues by introducing:
2. Add a timeout for the provisioning step. If the configured timeout is reached, the testflinger job is cancelled, and the retry logic can much quicker retry.Documentation
Action README is updated.
Web service API changes
none
Tests
Manually tested:
With retries set to 3, we see:
