feat(collect): add --experimental-strategy three-phase collection pipeline by sdavtaker · Pull Request #219 · DataDog/dd-license-attribution

sdavtaker · 2026-06-22T15:36:27Z

Summary

Introduces --experimental-strategy flag for generate-sbom that separates dependency discovery from metadata extraction into three phases:
- Phase 0 (pre-finders, once): GitHubSbomMetadataCollectionStrategy — already performs full transitive closure via GitHub's API; must not iterate or it re-fetches each dep's entire dep tree
- Phase 1 (finder fixpoint loop, up to 5 iterations): PypiMetadataCollectionStrategy, GoPkgMetadataCollectionStrategy, NpmMetadataCollectionStrategy — discover one level at a time and benefit from iteration
- Phase 2 (enricher cascade, once): ScanCodeToolkitMetadataCollectionStrategy, GitHubRepositoryMetadataCollectionStrategy, CleanupCopyrightMetadataStrategy
Adds DependencyFinderStrategy and MetadataEnricherStrategy sub-ABCs for future experimental strategy classes
Adds ThreePhaseMetadataCollector with pre_finders, finders, and enrichers parameters
When --experimental-strategy + --ecosystem X, only the ecosystem-relevant finder is active by default; --no-* flags still override
Existing behavior (without --experimental-strategy) is completely unchanged

Motivation

Fixes the structural limitation in OSPO-689: in the classic single-pass cascade, a dependency discovered by strategy N is never seen by strategies 1..N-1. The new architecture closes the transitive dependency set before running enrichers.

The pre_finders phase was added after experimentation revealed that placing GitHubSbomMetadataCollectionStrategy in the fixpoint loop caused an explosion: re-running it on scancode-toolkit fetched that repo's full dep tree, pulling in mapbox-common-ios and 2400+ unrelated iOS/npm packages (2558 total vs 109 expected). Moving it to phase 0 fixed the output — experimental and classic now produce identical results for DataDog/dd-license-attribution and apigentools with no timing regression (~23s vs ~22s).

Test plan

485 unit tests pass: pipenv run pytest tests/unit/
95% coverage: pipenv run pytest --cov=src/dd_license_attribution --cov-fail-under=95
mypy clean: pipenv run mypy src/ tests/
Formatting: pipenv run isort --check-only src/ tests/ && pipenv run black --check src/ tests/

Smoke — classic vs experimental on a GitHub repo produce identical output:

GITHUB_TOKEN=$(gh auth token) ddla generate-sbom https://github.com/DataDog/dd-license-attribution > /tmp/classic.csv
GITHUB_TOKEN=$(gh auth token) ddla generate-sbom --experimental-strategy https://github.com/DataDog/dd-license-attribution > /tmp/exp.csv
diff /tmp/classic.csv /tmp/exp.csv  # expect: empty

Smoke — ecosystem mode produces identical output:

GITHUB_TOKEN=$(gh auth token) ddla generate-sbom --ecosystem python apigentools > /tmp/classic_pypi.csv
GITHUB_TOKEN=$(gh auth token) ddla generate-sbom --ecosystem python --experimental-strategy apigentools > /tmp/exp_pypi.csv
diff /tmp/classic_pypi.csv /tmp/exp_pypi.csv  # expect: empty

CI smoke steps validate --experimental-strategy against apigentools (GitHub URL and --ecosystem python)

Separates dependency discovery (finder fixpoint loop, up to 5 iterations until stable) from metadata extraction (enricher cascade, runs once on the complete set). Gates the new architecture behind --experimental-strategy; existing behavior is unchanged without it. When combined with --ecosystem, only the ecosystem-relevant finder is enabled by default; --no-* flags still override. Adds DependencyFinderStrategy and MetadataEnricherStrategy sub-ABCs, TwoPhaseMetadataCollector, and 478 unit tests at 95% coverage. Fixes class name mismatch in _FINDER_STRATEGY_NAMES that caused Go and PyPI strategies to be misclassified as enrichers.

…n explosion GitHub's SBOM API already performs full transitive closure for a repository. Placing it in the fixpoint loop caused it to be re-invoked on every discovered dependency, fetching each dep's full dep tree and producing an unbounded closure (2558 packages instead of 109 for dd-license-attribution itself). Introduces a pre_finders parameter to TwoPhaseMetadataCollector: strategies here run once on the root seed and their output feeds into the fixpoint loop. GitHubSbomMetadataCollectionStrategy is routed to pre_finders; ecosystem finders (PyPI, GoPkg, npm) remain in the fixpoint loop. Verified: experimental output now matches classic output exactly for DataDog/dd-license-attribution.

…dataCollector The collector has three distinct phases (Phase 0 pre-finders, Phase 1 finder fixpoint loop, Phase 2 enricher cascade), so the name TwoPhaseMetadataCollector was misleading. Rename class, file, exports, CLI references, help text, tests, CHANGELOG, and README to match. No behaviour change.

Two end-to-end tests for ThreePhaseMetadataCollector via the CLI: - DataDog/apigentools in GitHub repo mode - apigentools in --ecosystem python mode Each test runs both classic and experimental strategies, asserts both exit cleanly, return at least a minimum number of packages, and that experimental does not drop any package the classic found. Uses subprocess (not CliRunner) so stderr log lines don't corrupt the CSV stdout. Placed in tests/integration/ alongside future full-CLI tests.

…flow Replaces the pytest-based tests/integration/ with two workflow steps that follow the same pattern as the existing integration checks: - GitHub repo mode against DataDog/apigentools - --ecosystem python mode against apigentools Both steps validate exit code, CSV header, and a minimum package count.

arapulido · 2026-06-24T14:23:12Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1a76159e31

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…erations OverrideCollectionStrategy.__init__ assigned override_rules and unused_rules to the same list object. Any call to unused_rules.remove() therefore also removed the rule from override_rules, so a REMOVE rule silently stopped firing after its first match. In the classic single-pass cascade this never mattered — each rule matched at most once per run. In the ThreePhaseMetadataCollector fixpoint loop, finders may re-add a dep on a later iteration while the REMOVE rule is already gone, leaving the dep in the final SBOM despite the override spec explicitly suppressing it. Fix: make unused_rules a shallow copy (list(override_rules)) so tracking removals never mutate the active rules list. Guard the unused_rules.remove() calls with "if rule in self.unused_rules" so re-firing rules do not crash.

sdavtaker added 2 commits June 19, 2026 16:21

sdavtaker requested a review from a team as a code owner June 22, 2026 15:36

sdavtaker added 3 commits June 22, 2026 15:15

sdavtaker requested a review from arapulido June 24, 2026 13:54

sdavtaker changed the title ~~feat(collect): add --experimental-strategy two-phase collection pipeline~~ feat(collect): add --experimental-strategy three-phase collection pipeline Jun 24, 2026

chatgpt-codex-connector Bot reviewed Jun 24, 2026

View reviewed changes

Comment thread src/dd_license_attribution/metadata_collector/three_phase_metadata_collector.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(collect): add --experimental-strategy three-phase collection pipeline#219

feat(collect): add --experimental-strategy three-phase collection pipeline#219
sdavtaker wants to merge 6 commits into
mainfrom
feat/ospo-689-experimental-two-phase-strategy

sdavtaker commented Jun 22, 2026 •

edited

Loading

Uh oh!

arapulido commented Jun 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

sdavtaker commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Test plan

Uh oh!

arapulido commented Jun 24, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sdavtaker commented Jun 22, 2026 •

edited

Loading