Goal
Create the initial benchmark format for evaluating OVK on agent-authored PR verification tasks.
Scope
- Define benchmark item schema.
- Add 5 seed fixtures based on the current examples.
- Include expected intents, expected backend class, expected evidence status, and expected merge decision.
- Add a simple scoring script placeholder.
Acceptance criteria
- Benchmark fixtures can be loaded and validated.
- Each fixture states the expected verification intent and decision.
- Scoring dimensions include intent recall, backend selection, evidence honesty, counterexample usefulness, and merge decision.
Goal
Create the initial benchmark format for evaluating OVK on agent-authored PR verification tasks.
Scope
Acceptance criteria