feat(collect): add --experimental-strategy three-phase collection pipeline#219
Open
sdavtaker wants to merge 6 commits into
Open
feat(collect): add --experimental-strategy three-phase collection pipeline#219sdavtaker wants to merge 6 commits into
sdavtaker wants to merge 6 commits into
Conversation
Separates dependency discovery (finder fixpoint loop, up to 5 iterations until stable) from metadata extraction (enricher cascade, runs once on the complete set). Gates the new architecture behind --experimental-strategy; existing behavior is unchanged without it. When combined with --ecosystem, only the ecosystem-relevant finder is enabled by default; --no-* flags still override. Adds DependencyFinderStrategy and MetadataEnricherStrategy sub-ABCs, TwoPhaseMetadataCollector, and 478 unit tests at 95% coverage. Fixes class name mismatch in _FINDER_STRATEGY_NAMES that caused Go and PyPI strategies to be misclassified as enrichers.
…n explosion GitHub's SBOM API already performs full transitive closure for a repository. Placing it in the fixpoint loop caused it to be re-invoked on every discovered dependency, fetching each dep's full dep tree and producing an unbounded closure (2558 packages instead of 109 for dd-license-attribution itself). Introduces a pre_finders parameter to TwoPhaseMetadataCollector: strategies here run once on the root seed and their output feeds into the fixpoint loop. GitHubSbomMetadataCollectionStrategy is routed to pre_finders; ecosystem finders (PyPI, GoPkg, npm) remain in the fixpoint loop. Verified: experimental output now matches classic output exactly for DataDog/dd-license-attribution.
…dataCollector The collector has three distinct phases (Phase 0 pre-finders, Phase 1 finder fixpoint loop, Phase 2 enricher cascade), so the name TwoPhaseMetadataCollector was misleading. Rename class, file, exports, CLI references, help text, tests, CHANGELOG, and README to match. No behaviour change.
Two end-to-end tests for ThreePhaseMetadataCollector via the CLI: - DataDog/apigentools in GitHub repo mode - apigentools in --ecosystem python mode Each test runs both classic and experimental strategies, asserts both exit cleanly, return at least a minimum number of packages, and that experimental does not drop any package the classic found. Uses subprocess (not CliRunner) so stderr log lines don't corrupt the CSV stdout. Placed in tests/integration/ alongside future full-CLI tests.
…flow Replaces the pytest-based tests/integration/ with two workflow steps that follow the same pattern as the existing integration checks: - GitHub repo mode against DataDog/apigentools - --ecosystem python mode against apigentools Both steps validate exit code, CSV header, and a minimum package count.
Collaborator
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1a76159e31
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…erations OverrideCollectionStrategy.__init__ assigned override_rules and unused_rules to the same list object. Any call to unused_rules.remove() therefore also removed the rule from override_rules, so a REMOVE rule silently stopped firing after its first match. In the classic single-pass cascade this never mattered — each rule matched at most once per run. In the ThreePhaseMetadataCollector fixpoint loop, finders may re-add a dep on a later iteration while the REMOVE rule is already gone, leaving the dep in the final SBOM despite the override spec explicitly suppressing it. Fix: make unused_rules a shallow copy (list(override_rules)) so tracking removals never mutate the active rules list. Guard the unused_rules.remove() calls with "if rule in self.unused_rules" so re-firing rules do not crash.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--experimental-strategyflag forgenerate-sbomthat separates dependency discovery from metadata extraction into three phases:GitHubSbomMetadataCollectionStrategy— already performs full transitive closure via GitHub's API; must not iterate or it re-fetches each dep's entire dep treePypiMetadataCollectionStrategy,GoPkgMetadataCollectionStrategy,NpmMetadataCollectionStrategy— discover one level at a time and benefit from iterationScanCodeToolkitMetadataCollectionStrategy,GitHubRepositoryMetadataCollectionStrategy,CleanupCopyrightMetadataStrategyDependencyFinderStrategyandMetadataEnricherStrategysub-ABCs for future experimental strategy classesThreePhaseMetadataCollectorwithpre_finders,finders, andenrichersparameters--experimental-strategy+--ecosystem X, only the ecosystem-relevant finder is active by default;--no-*flags still override--experimental-strategy) is completely unchangedMotivation
Fixes the structural limitation in OSPO-689: in the classic single-pass cascade, a dependency discovered by strategy N is never seen by strategies 1..N-1. The new architecture closes the transitive dependency set before running enrichers.
The
pre_findersphase was added after experimentation revealed that placingGitHubSbomMetadataCollectionStrategyin the fixpoint loop caused an explosion: re-running it onscancode-toolkitfetched that repo's full dep tree, pulling inmapbox-common-iosand 2400+ unrelated iOS/npm packages (2558 total vs 109 expected). Moving it to phase 0 fixed the output — experimental and classic now produce identical results forDataDog/dd-license-attributionandapigentoolswith no timing regression (~23s vs ~22s).Test plan
pipenv run pytest tests/unit/pipenv run pytest --cov=src/dd_license_attribution --cov-fail-under=95pipenv run mypy src/ tests/pipenv run isort --check-only src/ tests/ && pipenv run black --check src/ tests/--experimental-strategyagainstapigentools(GitHub URL and--ecosystem python)