code-audit-pipeline

A recipe for converting a codebase into actionable refactor recommendations. AST extractors emit a canonical type/function catalog; jq queries cluster the rhymes — duplicate types, parallel protocols, name-without-shape collisions, missed abstractions; an agent reads each cluster and proposes a concrete refactor with grounded rationale. The code-audit binary glues these stages into one command line.

The principle

Deterministic extraction, agentic synthesis. A 200-line AST extractor will reproducibly enumerate every type in your repo. An LLM won't. Reserve agents for the judgment step at the end — "should these three duplicates be extracted to a common package, or are they a PAT-shaped pair with one differing type slot?" — and use ordinary tools for everything upstream.

The lit-test before fan-out: can the question be answered by clustering structured rows? If yes, write the extractor. If no, agents earn their keep.

The deliverable, in two layers

Two distinct things must work for the pipeline to be useful, and the project measures them separately:

Input layer — does the substrate find the rhymes? The extractors and cluster queries surface every structurally-parallel pair, every duplicated shape, every potential missed abstraction. Measured by plant-recall: inject synthetic rhymes into a real codebase, count how many surface in the right cluster. Validated by the V6 Swift substrate experiment at 19/20 plants on a 350-file Swift codebase across 22 packages.
Output layer — does the agent turn cluster rows into actionable recommendations? For each cluster row, the agent emits a structured refactor recommendation (category + specifics + grounded rationale + alternative). Measured by recommendation correctness plus restraint (no false-positive recommendations on intentional duplication). The V7 refactor-recommendation experiment is the methodology for this.

Both layers are necessary; neither is sufficient on its own.

Install

brew install jakebromberg/tap/code-audit       # macOS / Linux
go install github.com/jakebromberg/code-audit-pipeline/cmd/code-audit@latest

Or download a tarball from Releases.

The binary embeds the full pipeline/queries/*.jq set AND the extractor source. Each extractor's runtime (Node, Swift toolchain, future Python) stays external — extractor source is laid down to ~/.config/audit/extractors/<name>/ on first use, and any per-extractor bootstrap (e.g., npm install) runs automatically. No code-audit init step is required for the brew flow. All three install paths (brew, go install, tarball) ship the same embedded query + extractor set.

Quick start

# No init required — first `extract` auto-extracts source from the binary
# and runs the extractor's [runtime].bootstrap (e.g., `npm install`).
code-audit extract typescript --root /path/to/your/repo
# First call takes ~30s on a fresh install (npm install for the TS extractor);
# subsequent calls are fast.

# Inspect cached state and the resolved query/extractor sources.
code-audit status

# Run an individual query interactively (text-mode output, ergonomic for humans).
code-audit query exact-duplicates
code-audit query near-duplicates --arg threshold=0.7

# Run every applicable query and write a single markdown report.
code-audit report
# → .audit/reports/findings-2026-05-30.md

The full subcommand surface:

Subcommand	Purpose
`code-audit extract <name>`	Run an extractor; caches its catalog under `.audit/catalogs/`.
`code-audit query <name>`	Evaluate a query against cached catalogs (or `--catalog <path>` override).
`code-audit status`	Show `.audit/` state, resolved query/extractor sources, and staleness.
`code-audit report`	Run every runnable query and write a markdown report to `.audit/reports/`.
`code-audit init`	Lay down `~/.config/audit/` (extractors + queries) explicitly. Optional — `extract` auto-bootstraps. Useful for contributors editing extractor source: `init --from <local-checkout>` points at a live tree.
`code-audit version`	Print binary version.

Per-command flags follow the long-form GNU convention (--root, --queries-dir, --catalog, etc.); run any subcommand with --help for the full list.

How discovery works

Both queries and extractors are resolved via a lookup-order chain (ADR-0006):

Explicit --queries-dir / --extractors-dir flag.
pipeline/queries/ and extractors/ rooted at the audit cwd (when present).
$AUDIT_HOME if set.
Fallback: embedded queries (always present); ~/.config/audit/extractors/ (populated by code-audit init).

code-audit status always prints the resolved source for both. The cwd-relative path makes contributor edits live without rebuilding the binary; the embedded fallback means a brewed binary works against any pre-existing catalog with no install steps.

For tier 4 specifically (the ~/.config/audit/extractors/ fallback), code-audit extract <name> auto-extracts the binary's embedded extractor source and runs the manifest's [runtime].bootstrap argv on first use. Per-extractor concurrency uses an flock(2) at <extractor>/.audit-init/lock; outcomes (ok / failed / pending / n-a) persist in ~/.config/audit/.audit-init/state.json and surface in code-audit status. See ADR-0008 for the protocol.

Environment variables

Variable	Purpose
`AUDIT_HOME`	Override tier 3 in the discovery chain — point at `~/.config/audit` (or any sibling layout) to use a non-default location. Unset to fall through to the default tier-4 path.
`XDG_CONFIG_HOME`	Used by `code-audit init`'s default destination (`$XDG_CONFIG_HOME/audit` falls back to `~/.config/audit`).
`HOME`	Determines `~/.config/audit/extractors` (tier 4) when `XDG_CONFIG_HOME` is unset.

The auto-extract path adds zero environment variables — every knob above already existed.

What the catalog contains

Top-level JSON is {schema_version: "1.1", extractor: {...}, entries: [...]} — one record per declared type inside entries. The contract is in docs/pipeline-contract.md. Core fields every extractor emits:

Field	Meaning
`name`	declared identifier
`kind`	`interface` / `type-alias-object` / `type-alias-union` / `zod-object` / `drizzle-table` / language-specific variants
`package`	which root the file came from (e.g., `main`, `shared`)
`file`, `line`	relative-to-package-root path and 1-indexed line
`fields`	sorted `name:type` list, or `null` for non-shape types
`shape_sig`	`fields.join("
`touched_in_window`	true if `file` appears in the `--touched` JSON list
`generated`	true for `.d.ts` or files under `generated/`
`is_test`	true if `file` matches test/fixture path patterns (see contract for the normative set)
`exported`	from-file export status
`extends`	sorted array of direct supertype names — empty if the declaration has no heritage
`references`	sorted array of `{name, kind: "type-ref"}` — names referenced in the declaration body, type-parameter-scoped, deny-listed against built-ins
`references_count`	`references \| length` — derived; emitted explicitly so queries don't pay the inline length call

Sibling artifacts: references.json (inverted edge list, --emit-references-graph), files.json (per-file import edges, --emit-files), function-catalog.json (signature + body data, separate extractor), file-hashes.json (raw + normalized content hashes).

Cluster queries

All operate on the JSON catalog and emit human-readable text mode or OUTPUT_FORMAT=jsonl for the report path. Each .jq file carries a #! shape: cluster|pair|metric front-matter line per ADR-0003; code-audit report dispatches every JSONL row through one of three shape renderers.

Query	What it finds	Catalog
`exact-duplicates.jq`	Same `shape_sig` across ≥2 declarations	type
`name-collisions.jq`	Same `name` across multiple files	type
`cross-package-shadows.jq`	Type in `main` whose name exists in `shared`	type
`near-duplicates.jq`	Pairs with Jaccard ≥ threshold on field-name sets (default `0.7`)	type
`subset-pairs.jq`	Pairs (A, B) where A's field-name set is a strict subset of B's	type
`cross-package-shape-near-duplicates.jq`	main↔shared pairs with different names but Jaccard ≥ threshold	type
`function-duplicates.jq`	Exact body-hash clusters + pairwise Jaccard near-duplicates on function bodies	function
`file-duplicates.jq`	Exact byte-equal files + whitespace-normalized-only matches	file-hash
`copied-from-header.jq`	Files whose top comment self-confesses as a fork (`// Copied from X`, `// Fork of X`, etc.) — requires `file-hashes --scan-header`	file-hash
`cross-catalog-name-collisions.jq`	Type names declared in TWO catalogs (cross-repo, cross-language)	type, two-catalog
`migration-progress.jq`	Counts decls on old vs new `shape_sig`, computes % migrated, lists touched-in-window stragglers	type
`shape-sig-frequency.jq`	Lists `shape_sig` values by count desc with sample names	type
`versioned-type-pairs.jq`	Groups declarations sharing a base name after stripping `(?i)V?<n>` suffix — stalled-migration signal (`Track`/`TrackV2`, `Episode`/`EpisodeV2`/`EpisodeV3`)	type
`generic-arity-drift.jq`	Declarations sharing a name but differing in type-parameter arity	type
`generic-convention-bound.jq`	Declarations whose field types reference a type-parameter-shaped identifier not bound by `generics`	type
`touched-window-debt-summary.jq`	PR-time meta-query: for each cluster type, fraction with ≥1 touched-in-window member	type
`orphan-infer-model.jq`	Drizzle tables nothing in the catalog derives a TS type from	type
`test-prod-drift.jq`	Near-duplicate pairs where exactly one side is in a test path	type
`dead-code.jq`	Exported, non-generated declarations with zero resolved incoming references	type + references
`public-api-leaks.jq`	Exported functions whose param or return types reference a non-exported same-package type	function + type
`cross-package-backward-imports.jq`	`shared/` files importing from `main/` — layering violation	files
`coverage.jq`	Cross-repo scope report — covered, missing, stale, errored repos against the substrate's `index.json`	substrate index
`preflight-versions.jq`	Refuse cross-repo merge on extractor major-version skew or missing/malformed extractor metadata	substrate index

For cross-repo queries that merge catalogs across many repos via the substrate (docs/substrate.md), the canonical entry point is pipeline/run-cross-repo-query.sh. It composes fetch → preflight → coverage → query and prepends the coverage header to the query's output, so a consumer can always read its scope and trust the merge-safety. See pipeline-contract.md § Cross-repo substrate guardrails.

Adding a new extractor

Any language with an AST library works. Each extractor must:

Accept --root <path>, optional --shared <path>, optional --touched <json-file>, optional --output <path>.
Walk source files under each root, skipping node_modules/dist/.git/etc.
For each type-equivalent declaration, emit one JSON record matching the contract.
Print summary stats to stderr; the JSON catalog to stdout (or --output).

Drop a manifest.toml in the extractor directory (ADR-0002) so code-audit extract <name> knows the invocation. The contract doc has the full schema. The TypeScript extractor (~280 lines, uses typescript) is the reference. Suggested next:

Python — ast (stdlib). Feasibility study: docs/python-extractor-design-notes.md.
Rust — syn crate, or treesitter-rust.
Go — go/ast + go/parser (stdlib).
Swift — SwiftSyntax. Feasibility study: docs/swift-extractor-design-notes.md.

Hand-run mode

The binary delegates to the same extractors and queries that have always lived under extractors/ and pipeline/queries/. For development work, one-off audits without installing the binary, or pipelines that need bespoke composition, the bash recipe still works end-to-end:

# 1. Manifest — every PR merged in the last 5 weeks
gh pr list --state merged --search "merged:>=$(date -v-5w +%Y-%m-%d)" --limit 300 \
  --json number,title,mergedAt,author,headRefName,files,closingIssuesReferences,labels \
  > prs.json

# 2. Classify PRs by file-path signal (adapt path patterns to your repo)
jq -f pipeline/classify.jq prs.json > prs-classified.json

# 3. Enumerate candidate .ts files touched by code-touching PRs
jq -s '
  .[0] as $cls | .[1] as $prs
  | ($cls | map(select(.primary == "code-touching" or .primary == "code")) | map(.number)) as $nums
  | $prs | map(select(.number as $n | $nums | index($n)))
  | map(.files[].path) | unique
  | map(select(test("\\.(ts|mts|cts)$")))
  | map(select(test("\\.(test|spec)\\.ts$") | not))
' prs-classified.json prs.json > candidates.json

# 4. Run the catalog (npm install inside extractors/typescript first)
cd extractors/typescript && npm install
node type-catalog.mjs \
  --root /path/to/your/repo \
  --shared /path/to/sibling/shared-package \
  --touched ../../candidates.json \
  --output ../../catalog.json

# 5. Cluster (queries emit multi-line strings — use -r for readable output).
# Queries `import "_canonical" as canonical;` for the shared cluster helper,
# so `-L pipeline/queries` is required so jq can resolve the import path.
jq -L pipeline/queries -rf pipeline/queries/exact-duplicates.jq catalog.json
jq -L pipeline/queries -r --argjson threshold 0.7 -f pipeline/queries/near-duplicates.jq catalog.json

The binary path produces the same artifacts under .audit/catalogs/ and accepts JSONL on every query (OUTPUT_FORMAT=jsonl for the bash recipe, --format jsonl for code-audit query).

Provenance

Extracted from a 5-week type-duplication audit of a TypeScript monorepo (179 source files, 595 type declarations indexed, 10 exact-dupe clusters and 15 near-dupe clusters found). The full origin story — what the audit found, why agent fan-out was the wrong reach, what to build next — is in docs/case-study.md.

Experiment series

The project's validation track. Each experiment doc records its setup, plant set, results, and what changed about the methodology.

Experiment	Layer	Question	Doc
V2	input	Does broader substrate (function bodies, file hashes, cross-package shapes) catch what V1 missed?	V2 results
V3	input	Does plant-recall hold up under synthetic ground-truth methodology?	V3 results
V4	input	Does V3's recall hold up after contamination vectors are removed?	V4 results
V5	input	Do the four V4-flagged substrate gaps close?	V5 results
V6	input	Does the substrate transfer to Swift (wxyc-ios-64, 350 files, 22 packages)?	V6 results
V7	output	Does the substrate's cluster output feed actionable refactor recommendations, by category?	V7 methodology

V2–V6 validate the input layer. V7 is the first experiment on the output layer.

Architecture decisions

The binary's design is captured in seven ADRs under docs/adr/:

.audit/ — per-repo cached state directory.
Hybrid registration — front-matter for queries, manifest.toml for extractors.
Cluster envelope — three shape renderers (cluster, pair, metric).
Router architecture — subcommand dispatch.
Go binary + gojq engine — embedded jq vs. system shell-out.
Bundling + discovery — bundle queries, leave extractors external, lookup-order chain.
Reconciliation with snapshot family — catalog envelope vs. cluster envelope.

Future directions

A ranked map of where this project could grow — temporal indexing, broader extractor kinds, queryable substrate, an evolved agent layer, and what to keep out — is in docs/future-directions.md.

License

Anti-Capitalist Software License v1.4. See LICENSE for the full text. Use is permitted for individuals, non-profits, educational institutions, and worker-owned cooperatives; not permitted for capitalist organizations, law enforcement, or military.

Name		Name	Last commit message	Last commit date
Latest commit History 198 Commits
.github		.github
cmd/code-audit		cmd/code-audit
docs		docs
examples/swift-plants/Shared		examples/swift-plants/Shared
experiments/v7-refactor-recommendation		experiments/v7-refactor-recommendation
extractors		extractors
internal		internal
pipeline		pipeline
plans		plans
.gitignore		.gitignore
.goreleaser.yaml		.goreleaser.yaml
CLAUDE.md		CLAUDE.md
CONTEXT.md		CONTEXT.md
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

code-audit-pipeline

The principle

The deliverable, in two layers

Install

Quick start

How discovery works

Environment variables

What the catalog contains

Cluster queries

Adding a new extractor

Hand-run mode

Provenance

Experiment series

Architecture decisions

Future directions

License

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

code-audit-pipeline

The principle

The deliverable, in two layers

Install

Quick start

How discovery works

Environment variables

What the catalog contains

Cluster queries

Adding a new extractor

Hand-run mode

Provenance

Experiment series

Architecture decisions

Future directions

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages