Skip to content

Unify the scan pipeline behind one interface in git-tidy-core #117

Description

@jakebromberg

Problem

Every tool that scans repos repeats the same five-step shape in its scan.rs:

  1. Discover repos
  2. (Optionally) parallel-fetch
  3. Dispatch a per-repo closure via parallel_classify
  4. The closure calls a tool-specific classify_* and assembles a tool-specific result struct
  5. Aggregate counts and warnings into a ScanResult

The result is one shape, seven copies. The supposedly shared git-tidy-core/src/scan.rs is only 41 lines — it holds rayon glue, not the pipeline. The orchestration of discovery → fetch → parallelism → progress → aggregation has leaked into each tool.

Evidence

Tool scan.rs Lines
git-repo-tidy 579
git-branch-tidy 563
git-lfs-tidy 450
git-stash-tidy 437
git-tag-tidy 386
git-worktree-tidy 336
git-remote-tidy 311
Shared git-tidy-core/src/scan.rs 41

The real per-tool variation is the classifier: classify_branch, classify_tag, classify_remote_branch, etc. Everything else — fetch sequencing, progress bars, parallelism, error collection — is bookkeeping that doesn't change between tools.

Desired end state

A deep ScanPipeline module in git-tidy-core owns discover, fetch, parallel dispatch, progress reporting, and result aggregation. Each tool plugs in a narrow Classifier (or equivalent) port that returns a tool-specific classified item. Per-tool scan.rs files shrink to the classifier impl and a thin call into the pipeline.

The interface across the seam becomes "give me a way to classify one repo's items, I'll give you the aggregated ScanResult<T>."

Where

  • crates/git-tidy-core/src/scan.rs — currently a 41-line rayon glue layer; expand into the deep pipeline
  • crates/git-tidy-core/src/{fetch.rs, discovery.rs, progress.rs} — likely consumed internally by the new pipeline
  • crates/git-tidy-core/src/classification.rs — defines Classification + classify_branch + classify_worktree; the port lives near these
  • crates/git-{branch,tag,stash,remote,repo,lfs,worktree}-tidy/src/scan.rs — each collapses to a classifier impl + call into core
  • crates/git-config-tidy is the exception (lint-not-scan) and stays as-is

Constraints

  • --match substring filter (worktree-tidy) and discovery override flags must remain wired through the pipeline's options.
  • Fetch ordering: worktree-tidy and branch-tidy currently fetch before classification (via parallel_fetch); the pipeline must let tools opt in to fetch.
  • Landed-detection caching is currently per-repo, per-branch — keep that locality. The cache lives behind the pipeline, not above it.
  • Progress bar behaviour (terminal vs JSON vs porcelain) must not regress.
  • All existing integration tests must pass without behaviour change.

Suggested approach

  1. Identify the union of options across the seven scanners — fetch flag, behind threshold, age threshold, offline flag, repo filter — and design a ScanOptions once.
  2. Design Classifier (or ScanWorker) around: input = repo + options; output = Vec<TidyItem> + per-repo warnings (where TidyItem is the trait introduced in Implement git-worktree-tidy CLI #1).
  3. Land the pipeline behind tests in core.
  4. Migrate tools one at a time, deleting per-tool orchestration as each lands.
  5. Update CLAUDE.md "Core patterns" to describe the ScanPipeline seam.

Acceptance criteria

  • One ScanPipeline implementation in git-tidy-core runs discover → fetch → parallelise → classify → aggregate for every scan-shaped tool.
  • Each tool's scan.rs is ≤ ~100 lines (classifier impl + thin pipeline call).
  • All existing integration tests pass; no snapshot diff beyond test-suite consolidation.
  • Total per-tool scan.rs LOC drops by ≥ 60%.
  • git-tidy-core/src/scan.rs is no longer "just rayon glue" — it carries the pipeline.
  • CLAUDE.md "Core patterns" section updated.

Notes for implementer

  • This issue is recommended to land after Collapse per-tool output formatters into one deep formatter #116 so the TidyItem interface is settled when the pipeline starts producing rows. Without Implement git-worktree-tidy CLI #1, the pipeline either invents a parallel row type or rewrites it again later.
  • This issue is a deepening, not a rewrite. The pipeline is where complexity concentrates. The narrow classifier port is the interface across which the seven existing classifiers already differ.
  • The deletion test: if you delete the seven scan.rs files, the orchestration must reappear somewhere — in one place, in core, behind one interface. The leverage is high: bug fixes to fetch sequencing or progress reporting land once and pay back across seven tools.
  • Once this lands, Narrow the GitOps interface; hide sequencing behind a Classification port #119 becomes much cheaper: narrowing the GitOps surface only touches one consolidated scan callsite, not seven.

Related

Part of the architectural review:

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions