Skip to content

fix: add config option to exclude language packages with file ownership overlap#4905

Open
kimjune01 wants to merge 4 commits into
anchore:mainfrom
kimjune01:fix/exclude-language-overlap-4760
Open

fix: add config option to exclude language packages with file ownership overlap#4905
kimjune01 wants to merge 4 commits into
anchore:mainfrom
kimjune01:fix/exclude-language-overlap-4760

Conversation

@kimjune01

Copy link
Copy Markdown

Closes #4760.

Cause

When OS packages (deb, apk, rpm) install language packages (Python, Ruby, npm, etc.) via system package managers, syft catalogs both the OS package and the language package. The file ownership overlap relationship exists between them, but there was no mechanism to deduplicate.

The existing exclude-binary-overlap-by-ownership option handles binary packages, but an equivalent for language packages was missing.

Fix

Adds exclude-language-overlap-by-ownership config option that removes language packages when they have a file ownership overlap relationship with an OS package. This mirrors the existing binary exclusion logic. The option defaults to false to avoid changing existing behavior.

The implementation follows the same pattern as ExcludeBinaryPackagesByFileOwnershipOverlap: iterate relationships, identify OS-parent/language-child pairs, and delete the child package.

Tests

Unit tests cover:

  • OS → language package overlap (deb→python, apk→npm, rpm→ruby) — child removed
  • Binary → language package overlap — both kept
  • OS → OS overlap — both kept
  • Language → language overlap — both kept

Signed-off-by: June Kim kimjune01@gmail.com

@kimjune01

Copy link
Copy Markdown
Author

Pushed two test-only follow-ups (b4844d0, 64376c2): the original tests built pkg.Package literals without SetID(), so c.Package(r.From.ID()) returned nil and the assertion compared "" == "" — the suite passed without exercising the code under test. Added SetID() on the test packages and an assert.NotEqual(t, "", result) guard so the no-op can't recur. Verified by injecting a panic in the function body: pre-fix tests pass (proof of no-op), post-fix tests fail with the panic (proof of exercise).

kimjune01 added 3 commits May 18, 2026 12:29
…ip overlap

Adds a new configuration option `exclude-language-overlap-by-ownership` that
allows users to exclude language packages (Python, NPM, Ruby, etc.) from the
SBOM when they overlap with OS packages (deb, rpm, apk).

This prevents duplicate entries for packages installed via system package
managers that are also detected by language-specific catalogers.

Example: python3-django deb package vs. django Python package

The feature is disabled by default to maintain backward compatibility.

Resolves anchore#4760

Signed-off-by: June Kim <kimjune01@gmail.com>
Previously the table-driven test built pkg.Package literals without
calling SetID(), leaving every package ID as the empty string. The
collection generated its own ID on Add() but the original literals
remained empty, so the relationship From/To IDs were all '' and
c.Package('') returned nil. The function exited at the nil check
without ever evaluating identifyOverlappingLanguageRelationship, and
the assertion compared two empty strings — the test passed
unconditionally.

Inject panic('reached') before the return: tests passed (proof of
no-op). After SetID() is called and we drop the dynamic logic-mirror
in the assertion, panic injection now triggers — the function is
actually exercised.

Signed-off-by: June Kim <kimjune01@gmail.com>
Same root cause as the prior commit: pkg.Package literals had no IDs,
so child.ID() == "" and the assertion compared empty to empty when
the function correctly returned "" for the no-match case AND when it
should have returned the child ID — both branches passed trivially.
Verified by panic injection: prior to this fix the test passed even
when the function panicked at entry.

Signed-off-by: June Kim <kimjune01@gmail.com>
@kimjune01 kimjune01 force-pushed the fix/exclude-language-overlap-4760 branch from 64376c2 to a0b55b8 Compare May 18, 2026 19:42
@Dashtid

Dashtid commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

The languageCatalogerTypes set covers 8 types but omits some commonly-distro-packaged language types — notably JavaPkg, GoModulePkg, RustPkg, and SwiftPkg. The same OS↔language overlap exists for these:

  • Debian's golang-github-* packages install Go source that the Go cataloger also picks up
  • Debian's rust-* packages install Rust source that the Rust cataloger also picks up
  • libreoffice-java-common and similar install JARs that the Java cataloger reports

This could be intentional — e.g. Java/Go/Rust have different overlap semantics worth handling separately — or just an early-iteration scoping. If the latter, it might be worth either expanding the list to cover all language types, or codifying a criterion for inclusion (pkg.AllPkgs minus OS + binary + non-language types like GithubActionPkg/TerraformPkg/WordpressPluginPkg). Either way, a short comment on the slice explaining the inclusion rule would make the intent legible for future cataloger PRs that add new Types.

…types

languageCatalogerTypes was hand-maintained and missed installed-package language
types that an OS package can subsume on file-ownership overlap: cocoapods, conan,
dart-pub, hackage, hex, opam, php-pear, swift, swipl.

Define the inclusion rule explicitly and enforce it with a test that derives the
expected set from cataloger capabilities, so a new language cataloger fails CI
until it is classified.

Exclude types whose catalogers extract many components from a single OS-owned
binary or fat archive (go-module, rust-crate, dotnet, graalvm-native-image,
java-archive): OS ownership of the container does not make the embedded
components redundant, so deleting them would drop distinct packages from the SBOM.

The rule keys on pkg.Type, a coarse proxy; a follow-up could key on per-package
installed-vs-declared evidence rather than type.

Signed-off-by: June Kim <kimjune01@gmail.com>
@kimjune01

kimjune01 commented Jun 12, 2026

Copy link
Copy Markdown
Author

I derived the set from syft's own language-tagged catalogers instead of hand-listing. That adds the installed-package types below — but also excludes the binary/archive extractors, since deleting those on file-overlap loses real SBOM data. A guard test re-derives the set from capabilities.yaml and fails CI when a new language cataloger isn't classified.

types decision why
cocoapods, conan, dart-pub, hackage, hex, opam, php-pear, swift, swipl added OS overlap here means the distro repackaged the same unit, so dedup is safe
java-archive, go-module, rust-crate, dotnet, graalvm-native-image excluded these extract many components from one OS-owned binary/archive — the OS owns the container, not the components, so deleting drops distinct packages (an rpm owning a Go binary would erase its built-in modules)
php-pecl kept (exception) not language-tagged anymore (deprecated cataloger), but its packages still overlap until v2.0

Tradeoff: excluding whole types means safe go.mod/Cargo.lock overlaps won't be deduped either — better to under-delete than lose a real component.

Limits: the rule keys on pkg.Type, a coarse proxy. Some included types are lockfile-derived (cocoapods → Podfile.lock, swift → Package.resolved), so an OS package owning one of those manifests could over-delete declared deps — same issue, rarer. The precise fix is per-package installed-vs-declared evidence. Only unit-tested, not validated against real images for the new ecosystems. #4974 tracks the evidence-vs-type question plus a few tagging inconsistencies (php-pecl on a deprecated/untagged cataloger, nix as both OS and language, dotnet-deps-binary missing the binary selector go/rust carry).

@Dashtid

Dashtid commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

The extractor-vs-installed split is the right framing; my initial cut missed it. Deriving the set from capabilities.yaml plus a guard test that catches unclassified additions is cleaner than the hand-listed alternative.

Spot-checked one #4974 item locally: LanguageByName at syft/pkg/language.go:71 returns UnknownLanguage for string(Rpkg) ("R-package"), string(LuaRocksPkg) ("lua-rocks"), and string(PhpPeclPkg) ("php-pecl") — the switch matches "r"/TypeCran, string(Lua)/TypeLuaRocks, and the PHP cases without PhpPeclPkg, so the const-value strings fall through. TestLanguageByName covers the lowercase input "r" but not string(Rpkg). Worth a narrow PR independent of the evidence-vs-type work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OS package (deb) components duplicated as pypi components

2 participants