This morning at roughly 00:33 bridge started failing health checks, this continued until 11:10 or so when, due to complete lack of reachable status of the ec2 instace, I rebooted the instance to resolve it, this mirrors a similar outage on march 10 that was once again only resolved via a reboot of the ec2.
As far as I can tell, this was not due to the bridge container or app crashing, since it also effected the host machine (unreachable via SSM)
As far as I can tell it was not OOM, I did not see a single kernel log for: Out of memory
Killed process ...
oom-killer nor dod I see service restarts due to oom processes being kileld
What I did see was that:
2026-03-22 23:41:34 UTC: early warning from systemd-networkd
ens5: Could not set route: Connection timed out
- IMDS was still working after that
- last clearly good IMDS access was
2026-03-23 00:17:42 UTC
- first clear failure was
2026-03-23 00:17:58 UTC
connect: network is unreachable when trying to reach 169.254.169.254
So the likely issue was loss of route/connectivity to IMDS on ens5, and once that happened the box couldn’t repair itself.
I have no idea how to fix this or what caused it, I have no idea why it did not gracefully recover.
What I’ve done:
- added a small watchdog under
ops/watchdog/
- it checks IMDS and
http://127.0.0.1/ every minute
- after 5 failed minutes it restarts
systemd-networkd
- after 10 failed minutes it reboots the instance
The reason for this two tier restart system is that i dont know if restarting systemd-networkd is sufficient to restore functionality, since i couldnt test this (as ths instance was unreachable)
I also installed and tested it manually on the instance today.
We will just have to see if it happens again and if so what the logs will tell us.
This morning at roughly 00:33 bridge started failing health checks, this continued until 11:10 or so when, due to complete lack of reachable status of the ec2 instace, I rebooted the instance to resolve it, this mirrors a similar outage on march 10 that was once again only resolved via a reboot of the ec2.
As far as I can tell, this was not due to the bridge container or app crashing, since it also effected the host machine (unreachable via SSM)
As far as I can tell it was not OOM, I did not see a single kernel log for: Out of memory
Killed process ...
oom-killer nor dod I see service restarts due to oom processes being kileld
What I did see was that:
2026-03-22 23:41:34 UTC: early warning fromsystemd-networkdens5: Could not set route: Connection timed out2026-03-23 00:17:42 UTC2026-03-23 00:17:58 UTCconnect: network is unreachablewhen trying to reach169.254.169.254So the likely issue was loss of route/connectivity to IMDS on
ens5, and once that happened the box couldn’t repair itself.I have no idea how to fix this or what caused it, I have no idea why it did not gracefully recover.
What I’ve done:
ops/watchdog/http://127.0.0.1/every minutesystemd-networkdThe reason for this two tier restart system is that i dont know if restarting systemd-networkd is sufficient to restore functionality, since i couldnt test this (as ths instance was unreachable)
I also installed and tested it manually on the instance today.
We will just have to see if it happens again and if so what the logs will tell us.