Skip to content

Memory corruption due to race condition introduced by https://github.com/sonic-net/sonic-swss/pull/4400 #4600

@venkit-nexthop

Description

@venkit-nexthop

@deepak-singhal0408 @prabhataravind @prsunny for viz

This problem is caused by the changes in PR #4400

This change is exposing a race condition that can lead to memory corruption. We noticed Orchagent's ZMQ thread crash when METADATA configuration is changed and Orchagent is sent SIGTERM.

Before this change, orchagent terminated via _exit() on SIGTERM and ~OrchDaemon never ran.

This is where main thread is:

  main()  main.cpp:1047
  └── shared_ptr<OrchDaemon>::~shared_ptr   shared_ptr.h:175
      └── ~OrchDaemon                       orchdaemon.cpp:139
          └── ~PortsOrch                    portsorch.h:151
              └── ~FlexCounterTaggedCachedManager / ~FlexCounterCachedManager
                  └── ~FlexCounterManager   flex_counter_manager.cpp:138
                      └── stopFlexCounterPolling   saihelper.cpp:1106
                          └── sai_switch_api->set_switch_attribute
                              └── sairedis::Sai/Server/ClientServerSai::set
                                  └── RedisRemoteSaiInterface::set / notifyCounterOperations
                                      └── RedisRemoteSaiInterface::waitForResponse   RedisRemoteSaiInterface.cpp:935
                                          └── ZeroMQChannel::wait   ZeroMQChannel.cpp:279
                                              └── zmq_poll          ← BLOCKED HERE
  1. ~FlexCounterManager makes blocking SAI calls during destruction, round-tripping through sairedis's ZMQ channel to syncd. Main thread parks in zmq_poll.
  2. Meanwhile, multiple ZMQ I/O threads remain alive for various contexts (sairedis channel, north-side ZmqServer, and any ZmqProducerStateTables owned by orchs that were already deleted earlier in the reverse-order loop).
  3. Orchs deleted earlier likely freed memory still referenced by in-flight ZMQ messages — those zmq_ctx_term calls never happen because owners go out of scope without explicit termination. The crashing thread then dereferences a freed message buffer.
  4. The new async swss recorder thread (SwSSRec::setAsync(true), started lazily via ensureAsyncWorkerLocked()) adds another draining background thread sharing state with destructed orchs — second-order risk.

The graceful-shutdown signal handlers were added for one narrow purpose: to drain the new async swss.rec worker's queue on SIGTERM/SIGINT so pending records aren't lost on shutdown. Without it, _exit() on SIGTERM would discard whatever the recorder worker hadn't yet flushed to disk.
The problem with the implementation: rather than signaling just the recorder worker to drain and then calling _exit(), the PR set gOrchShutdownRequested, which lets start() return and triggers the entire ~OrchDaemon destructor chain. That's far broader than needed for the goal — and it's the chain that exposes the latent shutdown-ordering bugs (FlexCounterManager doing SAI calls during destruction, ZMQ I/O threads still alive while orchs are torn down) that crashed orchagent.
A narrower fix would have been: signal handler → drain recorder queue → _exit(). No full destructor chain needed.

Signal handlers are always fraught and we should really quantify what we get out of this.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status
Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions