Skip to content

Bridge network issues on ec2 #46

@alasdairwilson

Description

@alasdairwilson

This morning at roughly 00:33 bridge started failing health checks, this continued until 11:10 or so when, due to complete lack of reachable status of the ec2 instace, I rebooted the instance to resolve it, this mirrors a similar outage on march 10 that was once again only resolved via a reboot of the ec2.

As far as I can tell, this was not due to the bridge container or app crashing, since it also effected the host machine (unreachable via SSM)

As far as I can tell it was not OOM, I did not see a single kernel log for: Out of memory
Killed process ...
oom-killer nor dod I see service restarts due to oom processes being kileld

What I did see was that:

  • 2026-03-22 23:41:34 UTC: early warning from systemd-networkd
    • ens5: Could not set route: Connection timed out
  • IMDS was still working after that
  • last clearly good IMDS access was 2026-03-23 00:17:42 UTC
  • first clear failure was 2026-03-23 00:17:58 UTC
    • connect: network is unreachable when trying to reach 169.254.169.254

So the likely issue was loss of route/connectivity to IMDS on ens5, and once that happened the box couldn’t repair itself.

I have no idea how to fix this or what caused it, I have no idea why it did not gracefully recover.

What I’ve done:

  • added a small watchdog under ops/watchdog/
  • it checks IMDS and http://127.0.0.1/ every minute
  • after 5 failed minutes it restarts systemd-networkd
  • after 10 failed minutes it reboots the instance

The reason for this two tier restart system is that i dont know if restarting systemd-networkd is sufficient to restore functionality, since i couldnt test this (as ths instance was unreachable)

I also installed and tested it manually on the instance today.

We will just have to see if it happens again and if so what the logs will tell us.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions