Skip to content

Remove client-side ug- profile merge once storage server-side merge (storage#319) is deployed #830

@matthyx

Description

@matthyx

Summary

Once kubescape/storage#319 (server-side merge of user-managed ug- ApplicationProfile / NetworkNeighborhood into the consolidated ContainerProfile) is deployed, node-agent's client-side ug- fetch + projection in containerprofilecache becomes redundant. storage#319 serves the already-merged CP on GET, so node-agent re-fetching and re-projecting the same ug- overlay is a duplicate of work the server now does.

This issue tracks removing that client-side ug- merge — gated on storage#319 being deployed — and the constraints that make it non-trivial.

Context: this is the node-agent half of the consolidation introduced by storage#319; the client-side merge was added in #788 (port of projection.go).

Current behavior (post-#788)

ContainerProfileCacheImpl.tryPopulateEntry (pkg/objectcache/containerprofilecache/containerprofilecache.go) performs two distinct overlay passes:

  1. User-managed (ug-) pass — the one storage#319 subsumes.

    • Fetches userManagedAP / userManagedNN by the well-known name UserApplicationProfilePrefix + workloadName (ug-<workload>), ~lines 331–369.
    • Applied via projectUserProfiles(cp, userManagedAP, userManagedNN, …) (~line 462), the "user-managed projection pass".
    • Covered by Test_12_MergingProfilesTest / Test_13_MergingNetworkNeighborhoodTest.
  2. User-defined overlay pass — NOT covered by storage#319.

    • Fetches userAP / userNN by the name in pod label UserDefinedProfileMetadataKeyoverlayName (~lines 393–422).
    • Applied via buildEntry(cp, userAP, userNN, …) (~line 470).

storage#319 keys only off ug-<workloadSlug>, so it replaces pass (1) only.

Why a double-merge is currently harmless (so this is cleanup, not a fix)

With storage#319 live, node-agent GETs an already-merged CP and then re-projects ug- on top of it. I checked projection.go: capabilities/execs/opens/syscalls are raw-appended, but the effective semantics downstream are allow-list / set-union / user-wins, so duplicate entries don't change any enforcement decision. Net effect of the double-merge is redundant RPCs (two extra GETs per container populate/refresh tick) and minor bloat — not a correctness regression. Removing pass (1) eliminates that redundant work.

Proposed change

Remove only the user-managed (ug-) pass:

  • Drop the ug-<workload> userManagedAP / userManagedNN fetches in tryPopulateEntry and the corresponding refresh in reconciler.go.
  • Drop the projectUserProfiles(...) user-managed projection call; keep buildEntry(cp, userAP, userNN, …).
  • Keep the user-defined overlay pass (2) intact — it is a separate, still-client-side feature.
  • Clean up now-dead surface: UserManagedAPRV / UserManagedNNRV tracking fields, shouldLogOptionalUserManagedFetchError, user-managed branches of emitOverlayMetrics, the userManagedAP/NN arms of the "need SOMETHING to cache" / synthetic-CP fallback, and the reconciler's ug--RV-change re-projection trigger.
  • projection.go / projectUserProfiles itself likely still serves pass (2) — confirm before deleting; remove only if pass (2) doesn't use it.

Constraints / gotchas

  1. Order dependency (release gate). Removing pass (1) while a cluster still runs pre-Feature/path #319 storage means the merged CP isn't served and ug- exceptions silently vanish → enforcement gap. This change must only ship after storage#319 is deployed. Recommend a minimum-storage-version floor (or a feature flag defaulting off until the storage version is confirmed) rather than relying on rollout ordering alone.
  2. Don't remove the user-defined (label) overlay (pass 2). Open question for storage: is the label-driven UserDefinedProfileMetadataKey overlay also meant to move server-side eventually? If yes, that's a separate follow-up; if no, pass (2) stays in node-agent indefinitely.
  3. storage#319 merges on GET only, not List/Watch. node-agent is safe because its CP cache is fully GET-driven (container callbacks + reconciler GetContainerProfile; no CP informer/watch). But this is the load-bearing assumption: if node-agent ever moves CP reads to a list-watch informer, the server-side merge would be bypassed and ug- overlays would disappear. Worth a code comment near the CP GET noting the dependency. (Separately flagged to the storage side as a request to make List/Watch merged-consistent or document the GET-only contract.)

Validation

  • Test_12_MergingProfilesTest / Test_13_MergingNetworkNeighborhoodTest should pass with ug- enforcement now coming from the server-merged CP (point them at a storage build with Feature/path #319, or stub a merged CP).
  • Confirm rules still alert on events absent from the merged base+ug- profile.
  • Confirm the user-defined (label) overlay path still works unchanged.
  • Confirm the redundant ug- GETs are gone (RPC-count / metrics).

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Accepted

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions