Summary
When the macsec Docker container is killed (SIGKILL, TimeoutStopSec exceeded during systemctl stop, or simply the per-port serial shutdown running out of time on a high-port-count box) without giving macsecmgrd time to issue disableMACsec on every port, orchagent's MACsecOrch in-memory state and the SAI MACsec objects in ASIC_DB both survive the restart. The post-restart wpa_supplicant then negotiates a fresh SAK with the peer and writes it to APPL_DB, but MACsecOrch fails to propagate that SAK down to SAI.
Result: the ASIC keeps encrypting and decrypting with the previous cycle's SAK while userspace believes a rekey has happened. ICV fails on every received frame; LACPDUs and LLDPDUs are silently dropped; LACP goes defaulted and PortChannel members are deselected; SAI counters show SAI_MACSEC_SA_STAT_IN_PKTS_NOT_VALID climbing into the hundreds of millions with IN_PKTS_OK near zero.
Recovery requires forcing a fresh SA install on every affected port
(config macsec port del followed by config macsec port add).
Environment
- SONiC version: 202511
- Affected Components:
orchagent/macsecorch.cpp
- Devices: Any SONiC device with MACsec enabled
Steps to reproduce
- Bring up SONiC with ≥2 MACsec-protected ports across a LACP portchannel, peer also macsec-enabled, MKA converged, traffic flowing.
- Confirm baseline:
show interfaces portchannel shows members (S), IN_PKTS_OK advancing on both ends, APPL_DB SAK matches ASIC_DB SAK.
- Hard-kill the
macsec container to bypass macsecmgrd's per-port disable loop:
sudo docker kill -s 9 macsec
- Wait ~30–45 s for systemd to respawn the container, the new
wpa_supplicant instances to come up, and MKA to re-negotiate.
- Check counters on any affected port:
show macsec <port> | grep -E 'IN_PKTS_(OK|NOT_VALID)|CURRENT_XPN'
Root cause
After a dirty macsec container restart:
- The old
wpa_supplicant processes are killed without sending a CTRL-EVENT-DISCONNECTED reason=3 locally_generated=1 MKPDU to peers.
macsecmgrd does not get to issue per-port disableMACsec, so the APPL_DB MACSEC_*_SA_TABLE entries and the SAI MACsec objects (PORT / SC / SA / FLOW) persist.
orchagent is not restarted, so its MACsecOrch::m_macsec_ports / m_ingress_scs / m_sa_ids maps continue to reference the existing SAI OIDs.
- The new
wpa_supplicant instances negotiate a fresh SAK and write it to APPL_DB. MACsecOrch::doTask sees a SET on an SA / SC its in-memory state says already exists, and takes one of three buggy paths instead of delete-then-recreate.
SAI_MACSEC_SA_ATTR_SAK is CREATE_ONLY per SAI spec, so without a delete the SAK in hardware is immutable; the only way to install a new SAK is to remove and re-create the SAI_OBJECT_TYPE_MACSEC_SA object.
Workaround
For each affected port:
sudo config macsec port del Ethernet<N>
sudo config macsec port add Ethernet<N> <profile>
This forces orchagent to issue remove_macsec_sa (freeing the SAI OID), then create_macsec_sa with the current MKA-distributed SAK from APPL_DB. ICV starts succeeding, LACP re-converges, LLDP recovers. Brief packet loss during the per-port re-key.
Summary
When the
macsecDocker container is killed (SIGKILL,TimeoutStopSecexceeded duringsystemctl stop, or simply the per-port serial shutdown running out of time on a high-port-count box) without givingmacsecmgrdtime to issuedisableMACsecon every port,orchagent'sMACsecOrchin-memory state and the SAI MACsec objects in ASIC_DB both survive the restart. The post-restartwpa_supplicantthen negotiates a fresh SAK with the peer and writes it to APPL_DB, butMACsecOrchfails to propagate that SAK down to SAI.Result: the ASIC keeps encrypting and decrypting with the previous cycle's SAK while userspace believes a rekey has happened. ICV fails on every received frame; LACPDUs and LLDPDUs are silently dropped; LACP goes
defaultedand PortChannel members are deselected; SAI counters showSAI_MACSEC_SA_STAT_IN_PKTS_NOT_VALIDclimbing into the hundreds of millions withIN_PKTS_OKnear zero.Recovery requires forcing a fresh SA install on every affected port
(
config macsec port delfollowed byconfig macsec port add).Environment
orchagent/macsecorch.cppSteps to reproduce
show interfaces portchannelshows members(S),IN_PKTS_OKadvancing on both ends, APPL_DB SAK matches ASIC_DB SAK.macseccontainer to bypassmacsecmgrd's per-port disable loop:wpa_supplicantinstances to come up, and MKA to re-negotiate.Root cause
After a dirty
macseccontainer restart:wpa_supplicantprocesses are killed without sending aCTRL-EVENT-DISCONNECTED reason=3 locally_generated=1MKPDU to peers.macsecmgrddoes not get to issue per-portdisableMACsec, so the APPL_DBMACSEC_*_SA_TABLEentries and the SAI MACsec objects (PORT / SC / SA / FLOW) persist.orchagentis not restarted, so itsMACsecOrch::m_macsec_ports/m_ingress_scs/m_sa_idsmaps continue to reference the existing SAI OIDs.wpa_supplicantinstances negotiate a fresh SAK and write it to APPL_DB.MACsecOrch::doTasksees a SET on an SA / SC its in-memory state says already exists, and takes one of three buggy paths instead of delete-then-recreate.SAI_MACSEC_SA_ATTR_SAKisCREATE_ONLYper SAI spec, so without a delete the SAK in hardware is immutable; the only way to install a new SAK is to remove and re-create theSAI_OBJECT_TYPE_MACSEC_SAobject.Workaround
For each affected port:
This forces
orchagentto issueremove_macsec_sa(freeing the SAI OID), thencreate_macsec_sawith the current MKA-distributed SAK from APPL_DB. ICV starts succeeding, LACP re-converges, LLDP recovers. Brief packet loss during the per-port re-key.