Background
The tree-sitter-al v2 rewrite validated every grammar change against 15,358 production files across eight phases, catching regressions immediately. Their al-corpus tool walks the typed AST to extract structured training data from real codebases at scale.
We currently test with small, hand-crafted fixtures. This is good for unit testing specific patterns but misses emergent issues that only appear in real-world code — like the 427 orphan Method nodes the gdotv blog found when indexing the Soufflé C++ codebase.
Problem
Without corpus validation:
Proposed approach
1. Select benchmark repos
Choose well-known, stable open-source repos that exercise each supported language's features. Candidates:
| Language |
Repo |
Why |
| Python |
flask |
Decorators, blueprints, class inheritance |
| JavaScript |
axios |
Async patterns, closures, module system |
| TypeScript |
zod |
Generics, type inference, complex types |
| Rust |
bat |
Traits, modules, cross-file references |
| Java |
gson |
Interfaces, generics, inner classes |
| C++ |
souffle |
Templates, namespaces, cross-TU methods |
| Go |
cobra |
Interfaces, packages, struct methods |
| Lua |
kong |
Metatables, OOP patterns, modules |
2. Define tracked metrics per repo
For each benchmark run, record:
- Total nodes by label (Function, Class, Method, Module, etc.)
- Total relationships by type (CALLS, DEFINES, IMPORTS, etc.)
- Orphan node count — nodes with zero relationships
- Parse error count — files that failed to parse
- Graph digest — deterministic hash of sorted node/edge lists
- Processing time — wall clock for full indexing
3. Golden file testing
Store expected metrics as golden files (JSON). On each CI run:
- Shallow-clone each benchmark repo at a pinned commit SHA
- Run the parser and collect metrics
- Compare against golden file
- Fail CI if any metric changes unexpectedly (with tolerance for node/edge counts to allow for improvements)
4. Regression detection
Flag these as regressions:
- Orphan node count increases
- Total node count decreases (we lost information)
- Graph digest changes (non-determinism reintroduced)
- Parse error count increases
Flag these as improvements (info-only, don't fail):
- Orphan node count decreases
- Total node count increases (we captured more)
- Parse error count decreases
5. CI integration
Run corpus validation as a separate CI job (not blocking unit tests):
- Weekly scheduled run against all benchmark repos
- On-demand trigger for parser/query changes
- Results posted as CI artifacts or summary comments
Impact
References
Background
The tree-sitter-al v2 rewrite validated every grammar change against 15,358 production files across eight phases, catching regressions immediately. Their al-corpus tool walks the typed AST to extract structured training data from real codebases at scale.
We currently test with small, hand-crafted fixtures. This is good for unit testing specific patterns but misses emergent issues that only appear in real-world code — like the 427 orphan Method nodes the gdotv blog found when indexing the Soufflé C++ codebase.
Problem
Without corpus validation:
Proposed approach
1. Select benchmark repos
Choose well-known, stable open-source repos that exercise each supported language's features. Candidates:
2. Define tracked metrics per repo
For each benchmark run, record:
3. Golden file testing
Store expected metrics as golden files (JSON). On each CI run:
4. Regression detection
Flag these as regressions:
Flag these as improvements (info-only, don't fail):
5. CI integration
Run corpus validation as a separate CI job (not blocking unit tests):
Impact
References