Skip to content

[Router] Back-compat conformance corpus (golden policy → Decision tests) #2425

Description

@ramkrishna2910

Goal

Mechanically guarantee the milestone's hard back-compat rule: a future server must never break a policy authored against an earlier schema major. Today that rule is held by discipline (the "never redefine a shipped field / never delete a major's schema" principles documented in src/cpp/resources/schemas/README.md). This issue makes it a CI gate via a frozen conformance corpus of versioned policies + their expected Decisions, replayed against every server build.

Why

Schema validation proves an old policy still parses; it does not prove it still routes the same way. The deterministic tiers (keyword/regex/char rules, the routing.router desugaring, band/first-match logic) are pure lemonade engine logic and MUST stay behavior-stable across versions — exactly the thing a golden corpus locks down. (Model-backed scores carry inherent backend/model numerical wobble; see Scope for how the corpus handles that.)

Scope

  • Corpus layout: test/conformance/routing/<schema_major>/<case>/ each holding policy.json (a versioned collection.router policy) + cases.jsonl (input → expected Decision). Seed v1 from the existing test/cpp/fixtures/routing/ L0a–L3 examples.
  • Deterministic cases (exact): L1 keyword/regex/char rules, any/all/not, first-match-wins, default_model fallback, and routing.router desugaring — assert the full Decision (route_to, matched_rule, default_used, outputs) byte-for-byte.
  • Model-backed cases (stubbed): L2/L3/L0a run against a pinned fake ClassifierServices (fixed embeddings / scores / chat reply) so the assertion tests the engine's threshold + selection logic, not the backend's floats. The fake fixture is committed alongside the case. (Live-backend tolerance bands are explicitly out of scope here — that's a separate, non-gating perf/accuracy concern.)
  • Runner: a CTest target (C++, reusing the foundation RoutingPolicyEngine + fake services) and/or a Python harness under test/, run in CI on every PR.
  • Append-only discipline: when a new schema major ships, its corpus dir is added; existing major dirs are immutable — editing a frozen case is the CI signal that a change broke back-compat.

Out of scope

  • Live-backend numerical reproducibility / tolerance bands for model-backed scores.
  • Component-resolution compatibility (an old policy naming a delisted model) — that's a model-registry concern, tracked separately.
  • The migration shims themselves (per-major load-time upgraders) — this issue is the test that would guard them; the shim machinery lands when the second major is introduced.

Acceptance

  • A committed v1 corpus covering the deterministic paths + stubbed model-backed paths, seeded from the L0a–L3 fixtures.
  • A CI-wired runner that replays every case and diffs the produced Decision against the expected one; any mismatch fails the build.
  • A short doc note (in the schemas README) pointing to the corpus as the enforcement mechanism behind the "never redefine" rule.

Dependencies

Needs the engine actually producing Decisions: the foundation interfaces (#2407), the evaluator/registry/conditions/classifiers, and the engine assembly (#2382). Best filled once route() is implemented; the corpus fixtures can be authored earlier. Non-gating for the rest of the milestone — it's the guard rail, not a build step on the critical path.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area::ciCI / GitHub Actions / self-hosted runner infrastructureenhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions