Vera is built on the principle that AI features should be tested with the same rigor as traditional software, but with techniques adapted for the probabilistic nature of Large Language Models (LLMs).
While unit tests are great for deterministic code, they often struggle with natural language generation or complex reasoning tasks. Vera combines two layers of tests/evaluation:
- Static Checks: Programmatic verification using Python code. This is used for things that
can be determined definitively, like SQL syntax validity, regex matching, or ensuring certain
forbidden strings (like
DROP TABLE) are absent. - LLM-as-a-Judge: Using a high-reasoning model (like Gemini 1.5 Pro) to test the output against human-readable specifications. This allows for nuanced grading of business logic, tone, and safety that a regex simply cannot capture.
A key innovation in Vera is the use of structured Markdown "Specs" to guide the LLM Judge. These specs are defined per feature and provide the context needed for an accurate evaluation.
The Rubric is the most important spec. It defines the specific metrics (e.g., Business Logic, Efficiency, Accuracy) and a scale (typically 1-5) for each.
- Why?: It ensures the LLM Judge evaluates the feature consistently across different test cases and runs.
This file lists the "red lines" the feature must not cross.
- Why?: If a safety constraint is violated, the model should fail the test immediately, regardless of how good the rest of the output is.
Defines domain-specific terms and business logic.
- Why?: To help the Judge understand the context. For example, what exactly qualifies as an " Active User" in your specific database schema?
Defines expectations for the "look and feel" of the output.
- Why?: Checks for tone, markdown usage, or specific formatting requirements (e.g., "Always use snake_case for column aliases").
Any other relevant information that doesn't fit into the other categories, such as technical constraints or user personas.
Provides examples of "perfect" inputs and outputs.
- Why?: LLMs perform much better at evaluation when they have "few-shot" examples of what success looks like.
- Execution: The feature is run with the provided test case input.
- Static Tests: The plugin's Python code runs programmatic checks on the output.
- LLM Evaluation:
- Vera gathers all the Spec files.
- It constructs a comprehensive prompt for the Gemini Judge, including the specs, the test case input, and the actual feature output.
- The Judge returns a structured evaluation (scores and reasoning) based on the specs.
- Reporting: All results are aggregated into a single row in a CSV report.