Skip to content

feat(collect): add --experimental-strategy three-phase collection pipeline#219

Open
sdavtaker wants to merge 6 commits into
mainfrom
feat/ospo-689-experimental-two-phase-strategy
Open

feat(collect): add --experimental-strategy three-phase collection pipeline#219
sdavtaker wants to merge 6 commits into
mainfrom
feat/ospo-689-experimental-two-phase-strategy

Conversation

@sdavtaker

@sdavtaker sdavtaker commented Jun 22, 2026

Copy link
Copy Markdown
Member

Summary

  • Introduces --experimental-strategy flag for generate-sbom that separates dependency discovery from metadata extraction into three phases:
    • Phase 0 (pre-finders, once): GitHubSbomMetadataCollectionStrategy — already performs full transitive closure via GitHub's API; must not iterate or it re-fetches each dep's entire dep tree
    • Phase 1 (finder fixpoint loop, up to 5 iterations): PypiMetadataCollectionStrategy, GoPkgMetadataCollectionStrategy, NpmMetadataCollectionStrategy — discover one level at a time and benefit from iteration
    • Phase 2 (enricher cascade, once): ScanCodeToolkitMetadataCollectionStrategy, GitHubRepositoryMetadataCollectionStrategy, CleanupCopyrightMetadataStrategy
  • Adds DependencyFinderStrategy and MetadataEnricherStrategy sub-ABCs for future experimental strategy classes
  • Adds ThreePhaseMetadataCollector with pre_finders, finders, and enrichers parameters
  • When --experimental-strategy + --ecosystem X, only the ecosystem-relevant finder is active by default; --no-* flags still override
  • Existing behavior (without --experimental-strategy) is completely unchanged

Motivation

Fixes the structural limitation in OSPO-689: in the classic single-pass cascade, a dependency discovered by strategy N is never seen by strategies 1..N-1. The new architecture closes the transitive dependency set before running enrichers.

The pre_finders phase was added after experimentation revealed that placing GitHubSbomMetadataCollectionStrategy in the fixpoint loop caused an explosion: re-running it on scancode-toolkit fetched that repo's full dep tree, pulling in mapbox-common-ios and 2400+ unrelated iOS/npm packages (2558 total vs 109 expected). Moving it to phase 0 fixed the output — experimental and classic now produce identical results for DataDog/dd-license-attribution and apigentools with no timing regression (~23s vs ~22s).

Test plan

  • 485 unit tests pass: pipenv run pytest tests/unit/
  • 95% coverage: pipenv run pytest --cov=src/dd_license_attribution --cov-fail-under=95
  • mypy clean: pipenv run mypy src/ tests/
  • Formatting: pipenv run isort --check-only src/ tests/ && pipenv run black --check src/ tests/
  • Smoke — classic vs experimental on a GitHub repo produce identical output:
    GITHUB_TOKEN=$(gh auth token) ddla generate-sbom https://github.com/DataDog/dd-license-attribution > /tmp/classic.csv
    GITHUB_TOKEN=$(gh auth token) ddla generate-sbom --experimental-strategy https://github.com/DataDog/dd-license-attribution > /tmp/exp.csv
    diff /tmp/classic.csv /tmp/exp.csv  # expect: empty
  • Smoke — ecosystem mode produces identical output:
    GITHUB_TOKEN=$(gh auth token) ddla generate-sbom --ecosystem python apigentools > /tmp/classic_pypi.csv
    GITHUB_TOKEN=$(gh auth token) ddla generate-sbom --ecosystem python --experimental-strategy apigentools > /tmp/exp_pypi.csv
    diff /tmp/classic_pypi.csv /tmp/exp_pypi.csv  # expect: empty
  • CI smoke steps validate --experimental-strategy against apigentools (GitHub URL and --ecosystem python)

Separates dependency discovery (finder fixpoint loop, up to 5
iterations until stable) from metadata extraction (enricher cascade,
runs once on the complete set). Gates the new architecture behind
--experimental-strategy; existing behavior is unchanged without it.

When combined with --ecosystem, only the ecosystem-relevant finder is
enabled by default; --no-* flags still override.

Adds DependencyFinderStrategy and MetadataEnricherStrategy sub-ABCs,
TwoPhaseMetadataCollector, and 478 unit tests at 95% coverage.

Fixes class name mismatch in _FINDER_STRATEGY_NAMES that caused Go and
PyPI strategies to be misclassified as enrichers.
…n explosion

GitHub's SBOM API already performs full transitive closure for a repository.
Placing it in the fixpoint loop caused it to be re-invoked on every discovered
dependency, fetching each dep's full dep tree and producing an unbounded closure
(2558 packages instead of 109 for dd-license-attribution itself).

Introduces a pre_finders parameter to TwoPhaseMetadataCollector: strategies
here run once on the root seed and their output feeds into the fixpoint loop.
GitHubSbomMetadataCollectionStrategy is routed to pre_finders; ecosystem
finders (PyPI, GoPkg, npm) remain in the fixpoint loop.

Verified: experimental output now matches classic output exactly for
DataDog/dd-license-attribution.
@sdavtaker sdavtaker requested a review from a team as a code owner June 22, 2026 15:36
…dataCollector

The collector has three distinct phases (Phase 0 pre-finders, Phase 1
finder fixpoint loop, Phase 2 enricher cascade), so the name
TwoPhaseMetadataCollector was misleading. Rename class, file, exports,
CLI references, help text, tests, CHANGELOG, and README to match.

No behaviour change.
Two end-to-end tests for ThreePhaseMetadataCollector via the CLI:
- DataDog/apigentools in GitHub repo mode
- apigentools in --ecosystem python mode

Each test runs both classic and experimental strategies, asserts both
exit cleanly, return at least a minimum number of packages, and that
experimental does not drop any package the classic found.

Uses subprocess (not CliRunner) so stderr log lines don't corrupt the
CSV stdout. Placed in tests/integration/ alongside future full-CLI tests.
…flow

Replaces the pytest-based tests/integration/ with two workflow steps
that follow the same pattern as the existing integration checks:
- GitHub repo mode against DataDog/apigentools
- --ecosystem python mode against apigentools

Both steps validate exit code, CSV header, and a minimum package count.
@sdavtaker sdavtaker requested a review from arapulido June 24, 2026 13:54
@arapulido

Copy link
Copy Markdown
Collaborator

@codex review

@sdavtaker sdavtaker changed the title feat(collect): add --experimental-strategy two-phase collection pipeline feat(collect): add --experimental-strategy three-phase collection pipeline Jun 24, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a76159e31

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…erations

OverrideCollectionStrategy.__init__ assigned override_rules and unused_rules
to the same list object. Any call to unused_rules.remove() therefore also
removed the rule from override_rules, so a REMOVE rule silently stopped
firing after its first match.

In the classic single-pass cascade this never mattered — each rule matched
at most once per run. In the ThreePhaseMetadataCollector fixpoint loop,
finders may re-add a dep on a later iteration while the REMOVE rule is
already gone, leaving the dep in the final SBOM despite the override spec
explicitly suppressing it.

Fix: make unused_rules a shallow copy (list(override_rules)) so tracking
removals never mutate the active rules list. Guard the unused_rules.remove()
calls with "if rule in self.unused_rules" so re-firing rules do not crash.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants