Bridge network issues on ec2

This morning at roughly 00:33 bridge started failing health checks, this continued until 11:10 or so when, due to complete lack of reachable status of the ec2 instace, I rebooted the instance to resolve it, this mirrors a similar outage on march 10 that was once again only resolved via a reboot of the ec2.

As far as I can tell, this was not due to the bridge container or app crashing, since it also effected the host machine (unreachable via SSM)

As far as I can tell it was not OOM, I did not see a single kernel log for: Out of memory
Killed process ...
oom-killer nor dod I see service restarts due to oom processes being kileld

What I did see was that:

- `2026-03-22 23:41:34 UTC`: early warning from `systemd-networkd`
  - `ens5: Could not set route: Connection timed out`
- IMDS was still working after that
- last clearly good IMDS access was `2026-03-23 00:17:42 UTC`
- first clear failure was `2026-03-23 00:17:58 UTC`
  - `connect: network is unreachable` when trying to reach `169.254.169.254`

So the likely issue was loss of route/connectivity to IMDS on `ens5`, and once that happened the box couldn’t repair itself.

I have no idea how to fix this or what caused it, I have no idea why it did not gracefully recover.

What I’ve done:
- added a small watchdog under `ops/watchdog/`
- it checks IMDS and `http://127.0.0.1/` every minute
- after 5 failed minutes it restarts `systemd-networkd`
- after 10 failed minutes it reboots the instance

The reason for this two tier restart system is that i dont know if restarting systemd-networkd is sufficient to restore functionality, since i couldnt test this (as ths instance was unreachable)

I also installed and tested it manually on the instance today.

We will just have to see if it happens again and if so what the logs will tell us.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bridge network issues on ec2 #46

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Bridge network issues on ec2 #46

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions