-
Notifications
You must be signed in to change notification settings - Fork 262
Add skill validator CLI improvements and create-skill-evaluation meta-skill #376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
javiercn
wants to merge
3
commits into
dotnet:main
Choose a base branch
from
javiercn:javiercn/skill-validator-improvements
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+365
−16
Open
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,242 @@ | ||
| --- | ||
| name: create-skill-evaluation | ||
| description: > | ||
| Write eval.yaml files that measure whether a SKILL.md improves agent output. | ||
| USE when creating or improving tests/<plugin>/<skill>/eval.yaml. | ||
| DO NOT USE for writing SKILL.md content itself or for running the skill-validator CLI. | ||
| --- | ||
|
|
||
| # Create Skill Evaluation | ||
|
|
||
| Produce an `eval.yaml` that the skill-validator can run to prove a skill adds measurable value. | ||
| The eval file lives at `tests/<plugin>/<skill-name>/eval.yaml`. | ||
|
|
||
| ## Inputs | ||
|
|
||
| Before writing an eval, you need: | ||
|
|
||
| 1. The **SKILL.md** being evaluated — read it to catalog every teaching | ||
| 2. The **plugin domain** — determines setup approach (project scaffolding, fixtures, inline files) | ||
| 3. The **CONTRIBUTING.md** and **eng/skill-validator/README.md** — reference for assertion types, setup options, and CLI flags | ||
|
|
||
| ## Eval design principles | ||
|
|
||
| ### Scenarios describe goals, not techniques | ||
|
|
||
| The prompt must read like a real user request: "I need a component that does X." | ||
| It must never mention concepts the skill teaches. If the skill teaches retry policies | ||
| with Polly, the prompt says "make the HTTP calls resilient" — not "add a Polly retry policy." | ||
| When the prompt contains the skill's teachings, the baseline gets the same guidance | ||
| and the eval cannot measure the skill's contribution. | ||
|
|
||
| ### Scenario domains must not overlap with SKILL.md examples | ||
|
|
||
| If the skill teaches caching with a "product catalog" example, do not create an | ||
| eval scenario about product catalogs. The agent may memorize and replay the example | ||
| rather than generalize. Choose unrelated domains — weather data, appointment scheduling, | ||
| audit logging — so the eval measures whether the agent internalized the pattern. | ||
|
|
||
| ### 5 scenarios is the sweet spot | ||
|
|
||
| Enough diversity to avoid overfitting to one problem shape, few enough to finish | ||
| in reasonable time. Each scenario should exercise a different subset of the skill's | ||
| teachings so no single pattern carries the entire score. | ||
|
|
||
| ### Baselines should be achievable but imperfect | ||
|
|
||
| If the baseline already scores 5/5 on a scenario, the skill cannot show improvement. | ||
| Pick problems where an unskilled agent produces working but non-idiomatic code — the | ||
| kind of gap a skill is designed to close. | ||
|
|
||
| ## eval.yaml structure | ||
|
|
||
| ### Setup | ||
|
|
||
| Choose the setup approach based on what the scenario needs: | ||
|
|
||
| | Approach | When to use | Example | | ||
| |----------|-------------|---------| | ||
| | `commands:` | Agent needs a scaffolded project | `dotnet new webapi --no-https` | | ||
| | `copy_test_files: true` | Fixtures live alongside eval.yaml | Legacy migration scenarios | | ||
| | `files:` (inline) | Specific file contents required | Broken code that needs fixing | | ||
| | No setup | Agent creates everything from scratch | Pure authoring tasks | | ||
|
|
||
| Use YAML anchors when all scenarios share the same setup: | ||
|
|
||
| ```yaml | ||
| scenarios: | ||
| - name: "First scenario" | ||
| setup: &api_project | ||
| commands: | ||
| - "dotnet new webapi --no-https" | ||
| # ... | ||
|
|
||
| - name: "Second scenario" | ||
| setup: *api_project | ||
| # ... | ||
| ``` | ||
|
|
||
| ### Prompts | ||
|
|
||
| Write prompts as a user would — describe the goal with bulleted requirements: | ||
|
|
||
| ```yaml | ||
| prompt: | | ||
| I need a `RetryHandler` for my ASP.NET Core app. | ||
|
|
||
| Requirements: | ||
| - Wraps HttpClient calls with automatic retry on transient failures | ||
| - Configurable max retry count and base delay via DI options | ||
| - Uses exponential backoff with jitter | ||
| - Logs each retry attempt with the attempt number and delay | ||
| - After exhausting retries, throws the original exception | ||
| - Registered as a DelegatingHandler in the DI container | ||
| ``` | ||
|
|
||
| Do not include architecture decisions, pattern names, or implementation hints | ||
| that come from the skill. The prompt describes *what*, the skill provides *how*. | ||
|
|
||
| ### Assertions | ||
|
|
||
| Use 3–4 assertions per scenario as structural gates. Assertions verify necessary | ||
| conditions — the rubric handles quality. | ||
|
|
||
| ```yaml | ||
| assertions: | ||
| - type: "output_contains" | ||
| value: "DelegatingHandler" | ||
| - type: "output_contains" | ||
| value: "ILogger" | ||
| - type: "output_matches" | ||
| pattern: "(AddHttpMessageHandler|AddTransientHttpErrorPolicy)" | ||
| ``` | ||
|
|
||
| Available assertion types: | ||
|
|
||
| | Type | Purpose | | ||
| |------|---------| | ||
| | `output_contains` | Key type/method name appears in agent output | | ||
| | `output_not_contains` | Banned pattern is absent (e.g., hardcoded secrets) | | ||
| | `output_matches` | Regex for alternative valid patterns | | ||
| | `file_exists` | Expected file was created | | ||
| | `file_contains` | File has specific content | | ||
| | `exit_success` | Agent produced non-empty output | | ||
|
|
||
| ### Rubric items | ||
|
|
||
| Write 10–13 items per scenario. Each item is one observable behavior scored 1–5 | ||
| by the LLM judge. | ||
|
|
||
| **Format**: "Does [correct thing] — not [common mistake]" | ||
|
|
||
| ```yaml | ||
| rubric: | ||
| - "Implements retry as a DelegatingHandler — not as inline try/catch in every call site" | ||
| - "Uses exponential backoff with jitter — not fixed-delay retry" | ||
| - "Registers via AddHttpMessageHandler in DI — not manually wrapping HttpClient" | ||
| - "Logs each retry with attempt number and computed delay — not silent retries" | ||
| - "Configures retry options through IOptions<T> pattern — not hardcoded constants" | ||
| - "Only retries on transient HTTP errors (408, 429, 5xx, network failures) — not on 4xx client errors" | ||
| - "Throws the original exception after exhausting retries — not a wrapped AggregateException" | ||
| - "Disposes no unmanaged resources and does not implement IDisposable unnecessarily" | ||
| ``` | ||
|
|
||
| Rules: | ||
| - One behavior per item — the judge must be able to score it independently | ||
| - Name specific APIs and methods — "uses `AddHttpMessageHandler`" not "registers correctly" | ||
| - Include anti-pattern phrasing — tells the judge what penalties to apply | ||
| - Don't rubric things the baseline already does perfectly — wastes discriminative power | ||
| - Add items for patterns the skill explicitly warns against — these are frequently missed | ||
|
|
||
| ### Timeout | ||
|
|
||
| | Scenario type | Timeout | | ||
| |---------------|---------| | ||
| | Analysis only (no code generation) | 120s | | ||
| | Code generation, simple scaffolding | 180–300s | | ||
| | Code generation with project setup + multiple files | 300–600s | | ||
|
|
||
| If rubric count exceeds 10 items per scenario, use `--judge-timeout 600` when running | ||
| the validator. | ||
|
|
||
| ## Iteration workflow | ||
|
|
||
| ### 1. Fast feedback (1-run) | ||
|
|
||
| ```bash | ||
| dotnet run --no-launch-profile --project eng/skill-validator/src/SkillValidator.csproj \ | ||
| -- <skill-path> --tests-dir tests/<plugin> --runs 1 --verbose \ | ||
| --reporter json --reporter markdown --no-overfitting-check \ | ||
| --keep-work-dirs --work-dir artifacts/work-dirs | ||
| ``` | ||
|
|
||
| ### 2. Check baseline scores | ||
|
|
||
| For each rubric item, look at the baseline score. If baseline scores 5/5 on an | ||
| item, that item adds no signal — remove or replace it with something the baseline | ||
| misses. | ||
|
|
||
| ### 3. Check for equal failures | ||
|
|
||
| If both baseline and skilled variants fail a rubric item equally, either: | ||
| - The skill doesn't teach that pattern (skill gap — fix the SKILL.md) | ||
| - The rubric is unfair (asks for something unreasonable — rewrite the item) | ||
|
|
||
| ### 4. Compare structural diversity across variants | ||
|
|
||
| After a 1-run eval, read the baseline and skilled output for each scenario. | ||
| If the generated code has the same file structure, class hierarchy, and API | ||
| surface across both variants, the scenario is too constrained or the skill | ||
| is not changing how the agent approaches the problem. Good scenarios show | ||
| structural differences: the skilled variant creates a wrapper class the baseline | ||
| does not, uses a different disposal pattern, or organizes files differently. | ||
| If 3+ scenarios produce near-identical baseline vs skilled code, diversify the | ||
| scenario domains or loosen the prompts. | ||
|
|
||
| ### 5. Read judge reasoning | ||
|
|
||
| The results JSON contains per-rubric `reasoning` from the judge. Read it to | ||
| understand why specific items scored low — the judge often reveals whether the | ||
| agent missed a pattern or the rubric was ambiguous. | ||
|
|
||
| ### 6. Final validation (5-run) | ||
|
|
||
| ```bash | ||
| dotnet run --no-launch-profile --project eng/skill-validator/src/SkillValidator.csproj \ | ||
| -- <skill-path> --tests-dir tests/<plugin> --runs 5 --verbose \ | ||
| --reporter json --reporter markdown --no-overfitting-check \ | ||
| --judge-timeout 600 | ||
| ``` | ||
|
|
||
| The eval passes when overall score is ≥ +10% and all scenarios show improvement. | ||
|
|
||
| ## Common mistakes | ||
|
|
||
| | Mistake | Why it hurts | | ||
| |---------|-------------| | ||
| | Prompt leaks skill teachings | Mentioning "use the Options pattern" or "create a handler class" in the prompt gives the baseline the same guidance the skill provides — eval cannot measure improvement | | ||
| | All scenarios exercise the same 2–3 teachings | The eval appears to pass but only validates a fraction of the skill | | ||
| | Scenario domains overlap with SKILL.md examples | Agent replays memorized examples instead of generalizing | | ||
| | Baseline and skilled outputs are structurally identical | Scenarios are too narrow or prompts too prescriptive — the skill has no room to change the agent's approach | | ||
| | Rubric tests vocabulary instead of outcomes | "Uses the word 'resilience'" vs "Retries on transient failures" — the first is overfitting, the second measures behavior | | ||
| | Compound rubric items | "Does A and B and C" — the judge cannot score partially; split into separate items | | ||
| | Assertions check things unrelated to the skill | A `file_exists` check for `README.md` when the skill teaches error handling — noise | | ||
| | Timeout too short | Agent gets killed mid-generation; scored as failure when it was simply slow | | ||
| | Missing anti-pattern rubrics | The skill warns "don't do X" but the rubric never penalizes X — the eval misses a key signal | | ||
|
|
||
| ## Checklist | ||
|
|
||
| Before submitting the eval for review: | ||
|
|
||
| - [ ] 5 scenarios with diverse problem domains | ||
| - [ ] No domain overlap between eval scenarios and SKILL.md examples | ||
| - [ ] Prompts describe goals — no skill concepts or pattern names mentioned | ||
| - [ ] 3–4 assertions per scenario (structural gates) | ||
| - [ ] 10–13 rubric items per scenario (quality bar) | ||
| - [ ] Each rubric item is a single observable behavior | ||
| - [ ] Anti-pattern phrasing on items that catch common mistakes | ||
| - [ ] Timeout appropriate for setup complexity | ||
| - [ ] YAML anchors for shared setup | ||
| - [ ] 1-run eval: baseline vs skilled outputs are structurally different | ||
| - [ ] 1-run eval: no rubric item scores 5/5 on baseline | ||
| - [ ] 5-run eval passes with ≥ +10% overall |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other options make sense but I don't see much value in this one. Is this for some kind of deterministic runs?
We intentionally chose a GUID not just for randomness of the dir name but because the LLM draw meaning from the folder name. So anything that is not a GUID impacts the decision tree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so while you are iterating over the scenarios you can quickly and easily inspect the outputs per scenario. I think the LLM inferring stuff is a bit of a stretch, and this is mainly for development purposes, when running on CI, this is never passed, so it runs with the temp folder and the hashed folder.
Note that you could add an explicit instruction to not make assumptions based on the path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We saw the inference of the path in validation runs - this isn't theoretical. You can give it any kind of instructions but it can always decide to just ignore them. Avoiding to encode any meaning in the path is the best we can do here, even for dev purposes. The results between a local and a CI run shouldn't differ.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes it super hard to see the actual output from the tool. I don't understand why this is such a big problem, given that it's not a default.
I'm not saying it's not a real thing, I'm suggesting that its impact is limited, and that during development people need to look at the output from skills, they can run without the flag at the end locally too. But being able to easily review the different outputs and also have the LLM separately compare (despite the scores) the outputs against the skill contents is an important quality to be able to iterate quickly.
So, my point is:
I don't see an obvious drawback here, but without it, the development experience is not good. You either don't look at the actual outputs, or you spend a significant amount of time trying to dig them out.
Having the outputs be so complex to look at and compare doing development encourages people to not review the actual outputs and just take the judgement results, which in my experience are not reliable.
The judgement results help when they point out issues, but they don't help when the issues don't surface, and the only way to make the issues surface is to be able to look at the outputs.
Ultimately people need to be reviewing the output from skills to ensure some level of quality on the actual results. People is ultimately responsible for the code we ship, not machines, so we should create an environment where it's easy for people to do the right thing. After all, a doctor would not perform surgery just because a machine tells it that there is an issue, nor we would ship features without actually testing them.
We are paid for shipping working software, so we should have a process that makes it easy for people to do the right thing.