Goal
Mechanically guarantee the milestone's hard back-compat rule: a future server must never break a policy authored against an earlier schema major. Today that rule is held by discipline (the "never redefine a shipped field / never delete a major's schema" principles documented in src/cpp/resources/schemas/README.md). This issue makes it a CI gate via a frozen conformance corpus of versioned policies + their expected Decisions, replayed against every server build.
Why
Schema validation proves an old policy still parses; it does not prove it still routes the same way. The deterministic tiers (keyword/regex/char rules, the routing.router desugaring, band/first-match logic) are pure lemonade engine logic and MUST stay behavior-stable across versions — exactly the thing a golden corpus locks down. (Model-backed scores carry inherent backend/model numerical wobble; see Scope for how the corpus handles that.)
Scope
- Corpus layout:
test/conformance/routing/<schema_major>/<case>/ each holding policy.json (a versioned collection.router policy) + cases.jsonl (input → expected Decision). Seed v1 from the existing test/cpp/fixtures/routing/ L0a–L3 examples.
- Deterministic cases (exact): L1 keyword/regex/char rules,
any/all/not, first-match-wins, default_model fallback, and routing.router desugaring — assert the full Decision (route_to, matched_rule, default_used, outputs) byte-for-byte.
- Model-backed cases (stubbed): L2/L3/L0a run against a pinned fake
ClassifierServices (fixed embeddings / scores / chat reply) so the assertion tests the engine's threshold + selection logic, not the backend's floats. The fake fixture is committed alongside the case. (Live-backend tolerance bands are explicitly out of scope here — that's a separate, non-gating perf/accuracy concern.)
- Runner: a CTest target (C++, reusing the foundation
RoutingPolicyEngine + fake services) and/or a Python harness under test/, run in CI on every PR.
- Append-only discipline: when a new schema major ships, its corpus dir is added; existing major dirs are immutable — editing a frozen case is the CI signal that a change broke back-compat.
Out of scope
- Live-backend numerical reproducibility / tolerance bands for model-backed scores.
- Component-resolution compatibility (an old policy naming a delisted model) — that's a model-registry concern, tracked separately.
- The migration shims themselves (per-major load-time upgraders) — this issue is the test that would guard them; the shim machinery lands when the second major is introduced.
Acceptance
- A committed v1 corpus covering the deterministic paths + stubbed model-backed paths, seeded from the L0a–L3 fixtures.
- A CI-wired runner that replays every case and diffs the produced
Decision against the expected one; any mismatch fails the build.
- A short doc note (in the schemas README) pointing to the corpus as the enforcement mechanism behind the "never redefine" rule.
Dependencies
Needs the engine actually producing Decisions: the foundation interfaces (#2407), the evaluator/registry/conditions/classifiers, and the engine assembly (#2382). Best filled once route() is implemented; the corpus fixtures can be authored earlier. Non-gating for the rest of the milestone — it's the guard rail, not a build step on the critical path.
Goal
Mechanically guarantee the milestone's hard back-compat rule: a future server must never break a policy authored against an earlier schema major. Today that rule is held by discipline (the "never redefine a shipped field / never delete a major's schema" principles documented in
src/cpp/resources/schemas/README.md). This issue makes it a CI gate via a frozen conformance corpus of versioned policies + their expectedDecisions, replayed against every server build.Why
Schema validation proves an old policy still parses; it does not prove it still routes the same way. The deterministic tiers (keyword/regex/char rules, the
routing.routerdesugaring, band/first-match logic) are pure lemonade engine logic and MUST stay behavior-stable across versions — exactly the thing a golden corpus locks down. (Model-backed scores carry inherent backend/model numerical wobble; see Scope for how the corpus handles that.)Scope
test/conformance/routing/<schema_major>/<case>/each holdingpolicy.json(a versionedcollection.routerpolicy) +cases.jsonl(input → expectedDecision). Seed v1 from the existingtest/cpp/fixtures/routing/L0a–L3 examples.any/all/not, first-match-wins,default_modelfallback, androuting.routerdesugaring — assert the fullDecision(route_to,matched_rule,default_used,outputs) byte-for-byte.ClassifierServices(fixed embeddings / scores / chat reply) so the assertion tests the engine's threshold + selection logic, not the backend's floats. The fake fixture is committed alongside the case. (Live-backend tolerance bands are explicitly out of scope here — that's a separate, non-gating perf/accuracy concern.)RoutingPolicyEngine+ fake services) and/or a Python harness undertest/, run in CI on every PR.Out of scope
Acceptance
Decisionagainst the expected one; any mismatch fails the build.Dependencies
Needs the engine actually producing
Decisions: the foundation interfaces (#2407), the evaluator/registry/conditions/classifiers, and the engine assembly (#2382). Best filled onceroute()is implemented; the corpus fixtures can be authored earlier. Non-gating for the rest of the milestone — it's the guard rail, not a build step on the critical path.