fix: exit with non-zero code on unhandled exceptions to enable supervisord autorestart by Copilot · Pull Request #1124 · canonical/testflinger

Copilot · 2026-05-29T17:43:42Z

Description

When the agent's broad exception handler in cmd.py caught an unhandled exception (e.g. an OSError from a missing certifi CA bundle during a virtualenv race condition), it logged the error but did not call sys.exit(1). Python therefore exited with code 0, so supervisord's default autorestart=unexpected policy treated it as a clean exit and did not restart the process, leaving the agent permanently dead.

Fix:

cmd.py: Add sys.exit(1) to the broad exception handler so any unhandled exception causes a non-zero exit, allowing supervisord to detect the crash and restart the agent.

When the agent is restarted by supervisord, a transient environment issue (e.g. certifi briefly absent during a venv update) is likely resolved by the time the process comes back up. For permanent environment corruption, supervisord will put the process into FATAL state after exhausting startretries, making it visible to operators who can trigger the update_testflinger_action charm action to rebuild the virtualenv.

Resolved issues

Documentation

No additional documentation changes required.

Web service API changes

No API changes.

Tests

Existing unit tests in test_client.py continue to pass unchanged.

…ith code 1 on unhandled exceptions

ajzobro · 2026-05-29T18:02:47Z

@copilot if the python code crashes and produces a traceback of any kind, the application needs to be restarted (crash = no longer running); now depending on the specifics of the crash (e.g. corrupted library installation as is this case) the supervisor needs to restart the application at an appropriate point such that the environment will be re-established (again, specific to this case).

Therefore the changes made to client.py don't seem to address the issue so much as the changes to cmd.py.

Now, just because cmd.py will yield a non-zero exit code, we need to ensure that external to this actions are taken to re-set-up the environment.

…to cmd.py

Copilot · 2026-05-29T18:10:25Z

@copilot if the python code crashes and produces a traceback of any kind, the application needs to be restarted (crash = no longer running); now depending on the specifics of the crash (e.g. corrupted...

Good point — the client.py OSError additions were counterproductive: they caused the agent to silently swallow the error and keep looping in a degraded state, rather than crashing and letting supervisord restart it cleanly.

I've reverted all the OSError additions from client.py (and removed the associated tests). The only surviving change is the sys.exit(1) in cmd.py, which ensures any unhandled exception — including an OSError from a missing CA bundle — produces a non-zero exit code. Supervisord's autorestart=unexpected policy then restarts the agent; for a transient race condition during a venv update the environment should be intact by the time it comes back up, and for permanent corruption the agent enters the FATAL state after startretries attempts, making it visible to operators who can trigger the update_testflinger_action charm action to rebuild the venv.

rene-oromtz · 2026-05-29T19:24:18Z

@ajzobro I feel like this could be a valid solution if the issue was really transitional. From the supervisor docs, seems that the only difference with this is that it will try to start the agent 3 times (default as we don't specify it) and then set the status to FATAL (instead of current EXITED).

We may still need the validation on the install packages at charm level as currently we are logging failures but doing nothing with the status code. I will probably leave this for an enhancement outside of copilot so we can assess the most proper solution. I'm thinking:

If package installation failed, maybe we should block the unit and allow for manual recover via update-testflinger
Or maybe add some retries before marking the unit as blocked and still require the manual fix.

What do you think?

Initial plan

c37ae47

Copilot AI assigned Copilot and ajzobro May 29, 2026

Copilot started work on behalf of ajzobro May 29, 2026 17:43 View session

Copilot AI linked an issue May 29, 2026 that may be closed by this pull request

bug: Could not find a suitable TLS CA certificate bundle #1121

Open

Copilot AI added 2 commits May 29, 2026 17:54

fix: handle OSError from missing TLS CA certificate bundle and exit w…

fb7d87b

…ith code 1 on unhandled exceptions

fix: improve test naming and extract common error fixture

ceaf507

Copilot AI changed the title ~~[WIP] Fix TLS CA certificate bundle error in agent~~ fix: handle OSError from missing TLS CA certificate bundle May 29, 2026

Copilot finished work on behalf of ajzobro May 29, 2026 17:56

Copilot AI requested a review from ajzobro May 29, 2026 17:56

Copilot started work on behalf of ajzobro May 29, 2026 18:02 View session

revert: remove OSError handling from client.py; let errors propagate …

7e297f4

…to cmd.py

Copilot AI changed the title ~~fix: handle OSError from missing TLS CA certificate bundle~~ fix: exit with non-zero code on unhandled exceptions to enable supervisord autorestart May 29, 2026

Copilot finished work on behalf of ajzobro May 29, 2026 18:11

ajzobro requested a review from rene-oromtz May 29, 2026 18:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: exit with non-zero code on unhandled exceptions to enable supervisord autorestart#1124

fix: exit with non-zero code on unhandled exceptions to enable supervisord autorestart#1124
Copilot wants to merge 4 commits into
mainfrom
copilot/bugfix-certificate-error

Copilot AI commented May 29, 2026 •

edited

Loading

Uh oh!

ajzobro commented May 29, 2026

Uh oh!

Copilot AI commented May 29, 2026

Uh oh!

rene-oromtz commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Resolved issues

Documentation

Web service API changes

Tests

Uh oh!

ajzobro commented May 29, 2026

Uh oh!

Copilot AI commented May 29, 2026

Uh oh!

rene-oromtz commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented May 29, 2026 •

edited

Loading