feat: add corpus validation testing against large real-world repos

## Background

The [tree-sitter-al v2 rewrite](https://sshadows.dk/blog/tree-sitter-al-v2-rewrite/) validated every grammar change against 15,358 production files across eight phases, catching regressions immediately. Their [al-corpus tool](https://sshadows.dk/blog/one-parser-six-tools/) walks the typed AST to extract structured training data from real codebases at scale.

We currently test with small, hand-crafted fixtures. This is good for unit testing specific patterns but misses emergent issues that only appear in real-world code — like the 427 orphan Method nodes the [gdotv blog](https://gdotv.com/blog/codebase-rag-knowledge-graph-analysis-part-2/) found when indexing the Soufflé C++ codebase.

## Problem

Without corpus validation:
- Parser changes might introduce regressions we don't catch until users report them
- We can't measure graph quality improvements objectively
- The determinism fix (#515) can't be validated against the actual repos that exhibited non-determinism
- We have no baseline metrics for graph completeness per language

## Proposed approach

### 1. Select benchmark repos

Choose well-known, stable open-source repos that exercise each supported language's features. Candidates:

| Language | Repo | Why |
|----------|------|-----|
| Python | [flask](https://github.com/pallets/flask) | Decorators, blueprints, class inheritance |
| JavaScript | [axios](https://github.com/axios/axios) | Async patterns, closures, module system |
| TypeScript | [zod](https://github.com/colinhacks/zod) | Generics, type inference, complex types |
| Rust | [bat](https://github.com/sharkdp/bat) | Traits, modules, cross-file references |
| Java | [gson](https://github.com/google/gson) | Interfaces, generics, inner classes |
| C++ | [souffle](https://github.com/souffle-lang/souffle) | Templates, namespaces, cross-TU methods |
| Go | [cobra](https://github.com/spf13/cobra) | Interfaces, packages, struct methods |
| Lua | [kong](https://github.com/Kong/kong) | Metatables, OOP patterns, modules |

### 2. Define tracked metrics per repo

For each benchmark run, record:
- **Total nodes** by label (Function, Class, Method, Module, etc.)
- **Total relationships** by type (CALLS, DEFINES, IMPORTS, etc.)
- **Orphan node count** — nodes with zero relationships
- **Parse error count** — files that failed to parse
- **Graph digest** — deterministic hash of sorted node/edge lists
- **Processing time** — wall clock for full indexing

### 3. Golden file testing

Store expected metrics as golden files (JSON). On each CI run:
1. Shallow-clone each benchmark repo at a pinned commit SHA
2. Run the parser and collect metrics
3. Compare against golden file
4. Fail CI if any metric changes unexpectedly (with tolerance for node/edge counts to allow for improvements)

### 4. Regression detection

Flag these as regressions:
- Orphan node count increases
- Total node count decreases (we lost information)
- Graph digest changes (non-determinism reintroduced)
- Parse error count increases

Flag these as improvements (info-only, don't fail):
- Orphan node count decreases
- Total node count increases (we captured more)
- Parse error count decreases

### 5. CI integration

Run corpus validation as a separate CI job (not blocking unit tests):
- Weekly scheduled run against all benchmark repos
- On-demand trigger for parser/query changes
- Results posted as CI artifacts or summary comments

## Impact

- Objective measurement of graph quality across releases
- Catch regressions before users hit them
- Validate improvements like the C++ orphan fix (#496) and determinism fix (#515) against real code
- Build confidence for parser changes

## References

- [tree-sitter-al v2 rewrite](https://sshadows.dk/blog/tree-sitter-al-v2-rewrite/) — 15,358 file validation approach
- [One Parser, Six Tools — al-corpus](https://sshadows.dk/blog/one-parser-six-tools/) — structured extraction at scale
- [gdotv blog Part 2](https://gdotv.com/blog/codebase-rag-knowledge-graph-analysis-part-2/) — orphan nodes and schema issues found in Soufflé
- Issue #515 — non-deterministic output for JS/TS, Rust, Java repos

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add corpus validation testing against large real-world repos #522

Background

Problem

Proposed approach

1. Select benchmark repos

2. Define tracked metrics per repo

3. Golden file testing

4. Regression detection

5. CI integration

Impact

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Language	Repo	Why
Python	flask	Decorators, blueprints, class inheritance
JavaScript	axios	Async patterns, closures, module system
TypeScript	zod	Generics, type inference, complex types
Rust	bat	Traits, modules, cross-file references
Java	gson	Interfaces, generics, inner classes
C++	souffle	Templates, namespaces, cross-TU methods
Go	cobra	Interfaces, packages, struct methods
Lua	kong	Metatables, OOP patterns, modules

Uh oh!

feat: add corpus validation testing against large real-world repos #522

Description

Background

Problem

Proposed approach

1. Select benchmark repos

2. Define tracked metrics per repo

3. Golden file testing

4. Regression detection

5. CI integration

Impact

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions