Skip to content

fix: exit with non-zero code on unhandled exceptions to enable supervisord autorestart#1124

Draft
Copilot wants to merge 4 commits into
mainfrom
copilot/bugfix-certificate-error
Draft

fix: exit with non-zero code on unhandled exceptions to enable supervisord autorestart#1124
Copilot wants to merge 4 commits into
mainfrom
copilot/bugfix-certificate-error

Conversation

Copilot AI commented May 29, 2026

Copy link
Copy Markdown

Description

When the agent's broad exception handler in cmd.py caught an unhandled exception (e.g. an OSError from a missing certifi CA bundle during a virtualenv race condition), it logged the error but did not call sys.exit(1). Python therefore exited with code 0, so supervisord's default autorestart=unexpected policy treated it as a clean exit and did not restart the process, leaving the agent permanently dead.

Fix:

  • cmd.py: Add sys.exit(1) to the broad exception handler so any unhandled exception causes a non-zero exit, allowing supervisord to detect the crash and restart the agent.

When the agent is restarted by supervisord, a transient environment issue (e.g. certifi briefly absent during a venv update) is likely resolved by the time the process comes back up. For permanent environment corruption, supervisord will put the process into FATAL state after exhausting startretries, making it visible to operators who can trigger the update_testflinger_action charm action to rebuild the virtualenv.

Resolved issues

Documentation

No additional documentation changes required.

Web service API changes

No API changes.

Tests

Existing unit tests in test_client.py continue to pass unchanged.

Copilot AI changed the title [WIP] Fix TLS CA certificate bundle error in agent fix: handle OSError from missing TLS CA certificate bundle May 29, 2026
Copilot AI requested a review from ajzobro May 29, 2026 17:56
@ajzobro

ajzobro commented May 29, 2026

Copy link
Copy Markdown
Collaborator

@copilot if the python code crashes and produces a traceback of any kind, the application needs to be restarted (crash = no longer running); now depending on the specifics of the crash (e.g. corrupted library installation as is this case) the supervisor needs to restart the application at an appropriate point such that the environment will be re-established (again, specific to this case).

Therefore the changes made to client.py don't seem to address the issue so much as the changes to cmd.py.

Now, just because cmd.py will yield a non-zero exit code, we need to ensure that external to this actions are taken to re-set-up the environment.

Copilot AI commented May 29, 2026

Copy link
Copy Markdown
Author

@copilot if the python code crashes and produces a traceback of any kind, the application needs to be restarted (crash = no longer running); now depending on the specifics of the crash (e.g. corrupted...

Good point — the client.py OSError additions were counterproductive: they caused the agent to silently swallow the error and keep looping in a degraded state, rather than crashing and letting supervisord restart it cleanly.

I've reverted all the OSError additions from client.py (and removed the associated tests). The only surviving change is the sys.exit(1) in cmd.py, which ensures any unhandled exception — including an OSError from a missing CA bundle — produces a non-zero exit code. Supervisord's autorestart=unexpected policy then restarts the agent; for a transient race condition during a venv update the environment should be intact by the time it comes back up, and for permanent corruption the agent enters the FATAL state after startretries attempts, making it visible to operators who can trigger the update_testflinger_action charm action to rebuild the venv.

Copilot AI changed the title fix: handle OSError from missing TLS CA certificate bundle fix: exit with non-zero code on unhandled exceptions to enable supervisord autorestart May 29, 2026
@ajzobro ajzobro requested a review from rene-oromtz May 29, 2026 18:27
@rene-oromtz

Copy link
Copy Markdown
Contributor

@ajzobro I feel like this could be a valid solution if the issue was really transitional. From the supervisor docs, seems that the only difference with this is that it will try to start the agent 3 times (default as we don't specify it) and then set the status to FATAL (instead of current EXITED).

We may still need the validation on the install packages at charm level as currently we are logging failures but doing nothing with the status code. I will probably leave this for an enhancement outside of copilot so we can assess the most proper solution. I'm thinking:

  1. If package installation failed, maybe we should block the unit and allow for manual recover via update-testflinger
  2. Or maybe add some retries before marking the unit as blocked and still require the manual fix.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Could not find a suitable TLS CA certificate bundle

3 participants