Skip to content

feat: add corpus validation testing against large real-world repos #522

Description

@vitali87

Background

The tree-sitter-al v2 rewrite validated every grammar change against 15,358 production files across eight phases, catching regressions immediately. Their al-corpus tool walks the typed AST to extract structured training data from real codebases at scale.

We currently test with small, hand-crafted fixtures. This is good for unit testing specific patterns but misses emergent issues that only appear in real-world code — like the 427 orphan Method nodes the gdotv blog found when indexing the Soufflé C++ codebase.

Problem

Without corpus validation:

Proposed approach

1. Select benchmark repos

Choose well-known, stable open-source repos that exercise each supported language's features. Candidates:

Language Repo Why
Python flask Decorators, blueprints, class inheritance
JavaScript axios Async patterns, closures, module system
TypeScript zod Generics, type inference, complex types
Rust bat Traits, modules, cross-file references
Java gson Interfaces, generics, inner classes
C++ souffle Templates, namespaces, cross-TU methods
Go cobra Interfaces, packages, struct methods
Lua kong Metatables, OOP patterns, modules

2. Define tracked metrics per repo

For each benchmark run, record:

  • Total nodes by label (Function, Class, Method, Module, etc.)
  • Total relationships by type (CALLS, DEFINES, IMPORTS, etc.)
  • Orphan node count — nodes with zero relationships
  • Parse error count — files that failed to parse
  • Graph digest — deterministic hash of sorted node/edge lists
  • Processing time — wall clock for full indexing

3. Golden file testing

Store expected metrics as golden files (JSON). On each CI run:

  1. Shallow-clone each benchmark repo at a pinned commit SHA
  2. Run the parser and collect metrics
  3. Compare against golden file
  4. Fail CI if any metric changes unexpectedly (with tolerance for node/edge counts to allow for improvements)

4. Regression detection

Flag these as regressions:

  • Orphan node count increases
  • Total node count decreases (we lost information)
  • Graph digest changes (non-determinism reintroduced)
  • Parse error count increases

Flag these as improvements (info-only, don't fail):

  • Orphan node count decreases
  • Total node count increases (we captured more)
  • Parse error count decreases

5. CI integration

Run corpus validation as a separate CI job (not blocking unit tests):

  • Weekly scheduled run against all benchmark repos
  • On-demand trigger for parser/query changes
  • Results posted as CI artifacts or summary comments

Impact

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    Status
    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions