fix: exit with non-zero code on unhandled exceptions to enable supervisord autorestart#1124
fix: exit with non-zero code on unhandled exceptions to enable supervisord autorestart#1124Copilot wants to merge 4 commits into
Conversation
…ith code 1 on unhandled exceptions
|
@copilot if the python code crashes and produces a traceback of any kind, the application needs to be restarted (crash = no longer running); now depending on the specifics of the crash (e.g. corrupted library installation as is this case) the supervisor needs to restart the application at an appropriate point such that the environment will be re-established (again, specific to this case). Therefore the changes made to Now, just because |
Good point — the I've reverted all the |
|
@ajzobro I feel like this could be a valid solution if the issue was really transitional. From the supervisor docs, seems that the only difference with this is that it will try to start the agent 3 times (default as we don't specify it) and then set the status to We may still need the validation on the install packages at charm level as currently we are logging failures but doing nothing with the status code. I will probably leave this for an enhancement outside of copilot so we can assess the most proper solution. I'm thinking:
What do you think? |
Description
When the agent's broad exception handler in
cmd.pycaught an unhandled exception (e.g. anOSErrorfrom a missingcertifiCA bundle during a virtualenv race condition), it logged the error but did not callsys.exit(1). Python therefore exited with code0, so supervisord's defaultautorestart=unexpectedpolicy treated it as a clean exit and did not restart the process, leaving the agent permanently dead.Fix:
cmd.py: Addsys.exit(1)to the broad exception handler so any unhandled exception causes a non-zero exit, allowing supervisord to detect the crash and restart the agent.When the agent is restarted by supervisord, a transient environment issue (e.g.
certifibriefly absent during a venv update) is likely resolved by the time the process comes back up. For permanent environment corruption, supervisord will put the process intoFATALstate after exhaustingstartretries, making it visible to operators who can trigger theupdate_testflinger_actioncharm action to rebuild the virtualenv.Resolved issues
Documentation
No additional documentation changes required.
Web service API changes
No API changes.
Tests
Existing unit tests in
test_client.pycontinue to pass unchanged.