Skip to content

MACsec sessions never recover after a dirty macsecmgrd restart #4571

@senthil-nexthop

Description

@senthil-nexthop

Summary

When the macsec Docker container is killed (SIGKILL, TimeoutStopSec exceeded during systemctl stop, or simply the per-port serial shutdown running out of time on a high-port-count box) without giving macsecmgrd time to issue disableMACsec on every port, orchagent's MACsecOrch in-memory state and the SAI MACsec objects in ASIC_DB both survive the restart. The post-restart wpa_supplicant then negotiates a fresh SAK with the peer and writes it to APPL_DB, but MACsecOrch fails to propagate that SAK down to SAI.

Result: the ASIC keeps encrypting and decrypting with the previous cycle's SAK while userspace believes a rekey has happened. ICV fails on every received frame; LACPDUs and LLDPDUs are silently dropped; LACP goes defaulted and PortChannel members are deselected; SAI counters show SAI_MACSEC_SA_STAT_IN_PKTS_NOT_VALID climbing into the hundreds of millions with IN_PKTS_OK near zero.

Recovery requires forcing a fresh SA install on every affected port
(config macsec port del followed by config macsec port add).

Environment

  • SONiC version: 202511
  • Affected Components: orchagent/macsecorch.cpp
  • Devices: Any SONiC device with MACsec enabled

Steps to reproduce

  1. Bring up SONiC with ≥2 MACsec-protected ports across a LACP portchannel, peer also macsec-enabled, MKA converged, traffic flowing.
  2. Confirm baseline: show interfaces portchannel shows members (S), IN_PKTS_OK advancing on both ends, APPL_DB SAK matches ASIC_DB SAK.
  3. Hard-kill the macsec container to bypass macsecmgrd's per-port disable loop:
    sudo docker kill -s 9 macsec
    
  4. Wait ~30–45 s for systemd to respawn the container, the new wpa_supplicant instances to come up, and MKA to re-negotiate.
  5. Check counters on any affected port:
    show macsec <port> | grep -E 'IN_PKTS_(OK|NOT_VALID)|CURRENT_XPN'
    

Root cause

After a dirty macsec container restart:

  • The old wpa_supplicant processes are killed without sending a CTRL-EVENT-DISCONNECTED reason=3 locally_generated=1 MKPDU to peers.
  • macsecmgrd does not get to issue per-port disableMACsec, so the APPL_DB MACSEC_*_SA_TABLE entries and the SAI MACsec objects (PORT / SC / SA / FLOW) persist.
  • orchagent is not restarted, so its MACsecOrch::m_macsec_ports / m_ingress_scs / m_sa_ids maps continue to reference the existing SAI OIDs.
  • The new wpa_supplicant instances negotiate a fresh SAK and write it to APPL_DB. MACsecOrch::doTask sees a SET on an SA / SC its in-memory state says already exists, and takes one of three buggy paths instead of delete-then-recreate.

SAI_MACSEC_SA_ATTR_SAK is CREATE_ONLY per SAI spec, so without a delete the SAK in hardware is immutable; the only way to install a new SAK is to remove and re-create the SAI_OBJECT_TYPE_MACSEC_SA object.

Workaround

For each affected port:

sudo config macsec port del Ethernet<N>
sudo config macsec port add Ethernet<N> <profile>

This forces orchagent to issue remove_macsec_sa (freeing the SAI OID), then create_macsec_sa with the current MKA-distributed SAK from APPL_DB. ICV starts succeeding, LACP re-converges, LLDP recovers. Brief packet loss during the per-port re-key.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions