Skip to content

[featured] fix stale FEATURE entry after concurrent package uninstall#384

Open
DavidZagury wants to merge 3 commits into
sonic-net:masterfrom
DavidZagury:masster_stale_appextension
Open

[featured] fix stale FEATURE entry after concurrent package uninstall#384
DavidZagury wants to merge 3 commits into
sonic-net:masterfrom
DavidZagury:masster_stale_appextension

Conversation

@DavidZagury

Copy link
Copy Markdown

Why I did it

After uninstalling a SONiC application package, the feature could reappear in show feature status with an incomplete entry (state: disabled, missing auto_restart, set_owner, etc.), even though the package was fully removed. This was caused by a race between featured and sonic-package-manager during uninstall.

sonic-package-manager unblocks as soon as featured writes disabled to STATE_DB, then immediately runs deregister() which deletes the FEATURE entry from CONFIG_DB. At that point featured is still inside its handler and has not finished writing back to CONFIG_DB. Two code paths independently recreate the deleted entry with partial data:

  1. sync_feature_scope — called after a successful disable_feature(), it writes has_per_asic_scope and has_global_scope back to CONFIG_DB. disable_feature() writes disabled to STATE_DB as its last step, which immediately unblocks spm; by the time sync_feature_scope runs, spm may have already deleted the entry. mod_entry on a missing Redis key recreates the hash with only those two fields — no state key. featured then receives a notification for this partial entry, reads state=None, and logs ERR featured: Unexpected state value 'None' for feature .
  2. resync_feature_state — called when update_feature_state() fails (e.g. systemctl start fails because the service file was already removed by the concurrent uninstall). The enable path is triggered by config feature state enabled followed immediately by sonic-package-manager uninstall. Spm unblocks on the earlier STATE_DB disabled written during install processing, removes the service file, and deletes the FEATURE entry — all while featured is still waiting for systemctl start to complete. When start fails, resync_feature_state is called and finds either a fully absent entry or the partial entry left by (1). In both cases current_feature_state is None, _feature_state_is_template(None) evaluates True, and mod_entry({state: disabled}) is called — completing or creating a ghost FEATURE record in CONFIG_DB that persists after the package is gone.

How I did it

Three guards added to featured:

  • sync_feature_scope — post-write check: after both mod_entry calls, re-read the entry. If it exists but has no state key, the write raced with deregister and recreated a deleted key; delete the partial entry immediately.
  • resync_feature_state — deleted-entry check: if get_entry() returns an empty dict, return without touching CONFIG_DB.
  • resync_feature_state — partial-entry check: if the entry is non-empty but has no state key, it is the artifact from sync_feature_scope; delete it rather than completing the ghost record by writing the state field back.

Additionally, the ERR log for state=None in update_feature_state is suppressed since state=None means the FEATURE entry was deleted by a concurrent uninstall while the notification was queued — not a programming error. Any other unexpected non-None state still logs ERR.

When sonic-package-manager uninstalls a package it deletes the FEATURE
entry from CONFIG_DB.  Because featured unblocks spm (via STATE_DB) before
finishing its own handler, the entry can be gone by the time featured tries
to write back to it.  Two code paths independently recreate the deleted
entry, leaving a ghost record that survives the uninstall and appears in
"show feature status":

1. sync_feature_scope(): called after a successful disable_feature(), it
   calls mod_entry({has_per_asic_scope, has_global_scope}) on the already-
   deleted key.  Redis HSET recreates the hash with only those two fields
   (no 'state'), so the next notification arrives with state=None and
   featured logs "ERR: Unexpected state value 'None'".
   There is also a TOCTOU gap: spm can delete the entry between the
   mod_entry calls and the post-write read, producing the same partial
   entry.

2. resync_feature_state(): called when update_feature_state() fails (e.g.
   systemctl start fails because the service file was already removed by
   the concurrent uninstall).  get_entry() returns {} for a deleted key,
   _feature_state_is_template(None) evaluates True, and mod_entry writes
   {state: disabled} back - recreating the entry with no other fields.
   The same function also completes partial entries left by (1): a non-
   empty entry with no 'state' key passes the deleted-entry guard and
   receives a state field, producing a 3-field ghost record.

Fix all paths:

- sync_feature_scope: after the two mod_entry writes, re-read the entry.
  If it exists but has no 'state' key, the write raced with deregister;
  delete the partial entry immediately.

- resync_feature_state: return early if the entry is fully absent.  If
  the entry exists but has no 'state' key, delete it rather than
  completing the ghost record.

- update_feature_state: suppress the ERR log when state is None, since
  that is a normal consequence of the race rather than a programming error.

Signed-off-by: david.zagury <davidza@nvidia.com>
test_feature_resync: the sub-case where get_entry returns None previously
asserted that mod_entry would be called with the rendered state.  That
behaviour was the root of the bug -- resync_feature_state was recreating a
FEATURE row that had been deleted by a concurrent deregister.  Update the
assertion to assert_not_called(), matching the new guard that returns early
when the entry is absent.  Add a further sub-case for a partial entry (row
exists but has no 'state' key, the TOCTOU artifact left by sync_feature_scope
racing with deregister): resync_feature_state must call set_entry(None) to
remove the ghost record rather than completing it with mod_entry.

test_sync_feature_scope_toctou: new test for the post-write TOCTOU guard in
sync_feature_scope.  Uses side_effect to return a live entry on the pre-write
check and a partial entry (no 'state' key) on the post-write check, simulating
the window where the FEATURE row is deleted and recreated by mod_entry during
the race.  sync_feature_scope must call set_entry(None) to clean it up.

Signed-off-by: david.zagury <davidza@nvidia.com>
@DavidZagury DavidZagury force-pushed the masster_stale_appextension branch from dd9a46e to b10757c Compare May 12, 2026 08:31
@DavidZagury DavidZagury requested a review from vivekrnv May 12, 2026 08:32
@DavidZagury DavidZagury changed the title Masster stale appextension [featured] fix stale FEATURE entry after concurrent package uninstall May 12, 2026
… entry.

Signed-off-by: david.zagury <davidza@nvidia.com>
@mssonicbld

Copy link
Copy Markdown

/azp run

@azure-pipelines

Copy link
Copy Markdown
Azure Pipelines successfully started running 1 pipeline(s).

@Sourabh-Kumar7

Copy link
Copy Markdown
Member

@saiarcot895 could you please help review? Thanks.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a race between featured and sonic-package-manager during package uninstall that could recreate a deleted FEATURE row in CONFIG_DB with partial fields (“ghost” entry), leading to inconsistent show feature status output and spurious error logs.

Changes:

  • Add post-write validation in sync_feature_scope() to detect and immediately delete a partially recreated FEATURE entry (missing state).
  • Harden resync_feature_state() to avoid recreating deleted entries and to delete partial entries missing state instead of “completing” them.
  • Update unit tests to cover missing/partial FEATURE row scenarios and the sync_feature_scope() TOCTOU cleanup behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
scripts/featured Adds guards/cleanup to prevent recreating deleted FEATURE entries during concurrent uninstall and suppresses erroneous log noise for state=None.
tests/featured/featured_test.py Extends test coverage for deleted/partial FEATURE entries and validates the new cleanup behavior.

Comment on lines +339 to +342
def test_sync_feature_scope_toctou(self):
"""sync_feature_scope must delete a partial entry it accidentally creates when
the FEATURE row is deleted by a concurrent deregister between the mod_entry
calls and the post-write existence check (TOCTOU)."""
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants