Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
242 changes: 242 additions & 0 deletions .agents/skills/create-skill-evaluation/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
---
name: create-skill-evaluation
description: >
Write eval.yaml files that measure whether a SKILL.md improves agent output.
USE when creating or improving tests/<plugin>/<skill>/eval.yaml.
DO NOT USE for writing SKILL.md content itself or for running the skill-validator CLI.
---

# Create Skill Evaluation

Produce an `eval.yaml` that the skill-validator can run to prove a skill adds measurable value.
The eval file lives at `tests/<plugin>/<skill-name>/eval.yaml`.

## Inputs

Before writing an eval, you need:

1. The **SKILL.md** being evaluated — read it to catalog every teaching
2. The **plugin domain** — determines setup approach (project scaffolding, fixtures, inline files)
3. The **CONTRIBUTING.md** and **eng/skill-validator/README.md** — reference for assertion types, setup options, and CLI flags

## Eval design principles

### Scenarios describe goals, not techniques

The prompt must read like a real user request: "I need a component that does X."
It must never mention concepts the skill teaches. If the skill teaches retry policies
with Polly, the prompt says "make the HTTP calls resilient" — not "add a Polly retry policy."
When the prompt contains the skill's teachings, the baseline gets the same guidance
and the eval cannot measure the skill's contribution.

### Scenario domains must not overlap with SKILL.md examples

If the skill teaches caching with a "product catalog" example, do not create an
eval scenario about product catalogs. The agent may memorize and replay the example
rather than generalize. Choose unrelated domains — weather data, appointment scheduling,
audit logging — so the eval measures whether the agent internalized the pattern.

### 5 scenarios is the sweet spot

Enough diversity to avoid overfitting to one problem shape, few enough to finish
in reasonable time. Each scenario should exercise a different subset of the skill's
teachings so no single pattern carries the entire score.

### Baselines should be achievable but imperfect

If the baseline already scores 5/5 on a scenario, the skill cannot show improvement.
Pick problems where an unskilled agent produces working but non-idiomatic code — the
kind of gap a skill is designed to close.

## eval.yaml structure

### Setup

Choose the setup approach based on what the scenario needs:

| Approach | When to use | Example |
|----------|-------------|---------|
| `commands:` | Agent needs a scaffolded project | `dotnet new webapi --no-https` |
| `copy_test_files: true` | Fixtures live alongside eval.yaml | Legacy migration scenarios |
| `files:` (inline) | Specific file contents required | Broken code that needs fixing |
| No setup | Agent creates everything from scratch | Pure authoring tasks |

Use YAML anchors when all scenarios share the same setup:

```yaml
scenarios:
- name: "First scenario"
setup: &api_project
commands:
- "dotnet new webapi --no-https"
# ...

- name: "Second scenario"
setup: *api_project
# ...
```

### Prompts

Write prompts as a user would — describe the goal with bulleted requirements:

```yaml
prompt: |
I need a `RetryHandler` for my ASP.NET Core app.

Requirements:
- Wraps HttpClient calls with automatic retry on transient failures
- Configurable max retry count and base delay via DI options
- Uses exponential backoff with jitter
- Logs each retry attempt with the attempt number and delay
- After exhausting retries, throws the original exception
- Registered as a DelegatingHandler in the DI container
```

Do not include architecture decisions, pattern names, or implementation hints
that come from the skill. The prompt describes *what*, the skill provides *how*.

### Assertions

Use 3–4 assertions per scenario as structural gates. Assertions verify necessary
conditions — the rubric handles quality.

```yaml
assertions:
- type: "output_contains"
value: "DelegatingHandler"
- type: "output_contains"
value: "ILogger"
- type: "output_matches"
pattern: "(AddHttpMessageHandler|AddTransientHttpErrorPolicy)"
```

Available assertion types:

| Type | Purpose |
|------|---------|
| `output_contains` | Key type/method name appears in agent output |
| `output_not_contains` | Banned pattern is absent (e.g., hardcoded secrets) |
| `output_matches` | Regex for alternative valid patterns |
| `file_exists` | Expected file was created |
| `file_contains` | File has specific content |
| `exit_success` | Agent produced non-empty output |

### Rubric items

Write 10–13 items per scenario. Each item is one observable behavior scored 1–5
by the LLM judge.

**Format**: "Does [correct thing] — not [common mistake]"

```yaml
rubric:
- "Implements retry as a DelegatingHandler — not as inline try/catch in every call site"
- "Uses exponential backoff with jitter — not fixed-delay retry"
- "Registers via AddHttpMessageHandler in DI — not manually wrapping HttpClient"
- "Logs each retry with attempt number and computed delay — not silent retries"
- "Configures retry options through IOptions<T> pattern — not hardcoded constants"
- "Only retries on transient HTTP errors (408, 429, 5xx, network failures) — not on 4xx client errors"
- "Throws the original exception after exhausting retries — not a wrapped AggregateException"
- "Disposes no unmanaged resources and does not implement IDisposable unnecessarily"
```

Rules:
- One behavior per item — the judge must be able to score it independently
- Name specific APIs and methods — "uses `AddHttpMessageHandler`" not "registers correctly"
- Include anti-pattern phrasing — tells the judge what penalties to apply
- Don't rubric things the baseline already does perfectly — wastes discriminative power
- Add items for patterns the skill explicitly warns against — these are frequently missed

### Timeout

| Scenario type | Timeout |
|---------------|---------|
| Analysis only (no code generation) | 120s |
| Code generation, simple scaffolding | 180–300s |
| Code generation with project setup + multiple files | 300–600s |

If rubric count exceeds 10 items per scenario, use `--judge-timeout 600` when running
the validator.

## Iteration workflow

### 1. Fast feedback (1-run)

```bash
dotnet run --no-launch-profile --project eng/skill-validator/src/SkillValidator.csproj \
-- <skill-path> --tests-dir tests/<plugin> --runs 1 --verbose \
--reporter json --reporter markdown --no-overfitting-check \
--keep-work-dirs --work-dir artifacts/work-dirs
```

### 2. Check baseline scores

For each rubric item, look at the baseline score. If baseline scores 5/5 on an
item, that item adds no signal — remove or replace it with something the baseline
misses.

### 3. Check for equal failures

If both baseline and skilled variants fail a rubric item equally, either:
- The skill doesn't teach that pattern (skill gap — fix the SKILL.md)
- The rubric is unfair (asks for something unreasonable — rewrite the item)

### 4. Compare structural diversity across variants

After a 1-run eval, read the baseline and skilled output for each scenario.
If the generated code has the same file structure, class hierarchy, and API
surface across both variants, the scenario is too constrained or the skill
is not changing how the agent approaches the problem. Good scenarios show
structural differences: the skilled variant creates a wrapper class the baseline
does not, uses a different disposal pattern, or organizes files differently.
If 3+ scenarios produce near-identical baseline vs skilled code, diversify the
scenario domains or loosen the prompts.

### 5. Read judge reasoning

The results JSON contains per-rubric `reasoning` from the judge. Read it to
understand why specific items scored low — the judge often reveals whether the
agent missed a pattern or the rubric was ambiguous.

### 6. Final validation (5-run)

```bash
dotnet run --no-launch-profile --project eng/skill-validator/src/SkillValidator.csproj \
-- <skill-path> --tests-dir tests/<plugin> --runs 5 --verbose \
--reporter json --reporter markdown --no-overfitting-check \
--judge-timeout 600
```

The eval passes when overall score is ≥ +10% and all scenarios show improvement.

## Common mistakes

| Mistake | Why it hurts |
|---------|-------------|
| Prompt leaks skill teachings | Mentioning "use the Options pattern" or "create a handler class" in the prompt gives the baseline the same guidance the skill provides — eval cannot measure improvement |
| All scenarios exercise the same 2–3 teachings | The eval appears to pass but only validates a fraction of the skill |
| Scenario domains overlap with SKILL.md examples | Agent replays memorized examples instead of generalizing |
| Baseline and skilled outputs are structurally identical | Scenarios are too narrow or prompts too prescriptive — the skill has no room to change the agent's approach |
| Rubric tests vocabulary instead of outcomes | "Uses the word 'resilience'" vs "Retries on transient failures" — the first is overfitting, the second measures behavior |
| Compound rubric items | "Does A and B and C" — the judge cannot score partially; split into separate items |
| Assertions check things unrelated to the skill | A `file_exists` check for `README.md` when the skill teaches error handling — noise |
| Timeout too short | Agent gets killed mid-generation; scored as failure when it was simply slow |
| Missing anti-pattern rubrics | The skill warns "don't do X" but the rubric never penalizes X — the eval misses a key signal |

## Checklist

Before submitting the eval for review:

- [ ] 5 scenarios with diverse problem domains
- [ ] No domain overlap between eval scenarios and SKILL.md examples
- [ ] Prompts describe goals — no skill concepts or pattern names mentioned
- [ ] 3–4 assertions per scenario (structural gates)
- [ ] 10–13 rubric items per scenario (quality bar)
- [ ] Each rubric item is a single observable behavior
- [ ] Anti-pattern phrasing on items that catch common mistakes
- [ ] Timeout appropriate for setup complexity
- [ ] YAML anchors for shared setup
- [ ] 1-run eval: baseline vs skilled outputs are structurally different
- [ ] 1-run eval: no rubric item scores 5/5 on baseline
- [ ] 5-run eval passes with ≥ +10% overall
65 changes: 58 additions & 7 deletions eng/skill-validator/src/Commands/ValidateCommand.cs
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,11 @@ public static RootCommand Create()
var noiseSkillsDirOpt = new Option<string?>("--noise-skills-dir") { Description = "Directory containing skills to load as noise. Enables the noise test: re-runs scenarios with all noise skills loaded and measures degradation." };
var noiseMaxDegradationOpt = new Option<double>("--noise-max-degradation") { Description = "Maximum acceptable average quality degradation (0-1) in noise test (only positive degradations count)", DefaultValueFactory = _ => 0.2 };
var noiseMaxScenarioDegradationOpt = new Option<double>("--noise-max-scenario-degradation") { Description = "Maximum acceptable quality degradation (0-1) for any single noise-test scenario", DefaultValueFactory = _ => 0.4 };
var keepWorkDirsOpt = new Option<bool>("--keep-work-dirs") { Description = "Preserve temporary working directories after the run (paths printed when --verbose is set)" };
var workDirBaseOpt = new Option<string?>("--work-dir") { Description = "Base directory for temporary working directories (defaults to system temp)" };
var readableWorkDirsOpt = new Option<bool>("--readable-work-dirs") { Description = "Use human-readable directory names (<scenario>/<run-N>/<variant>) instead of GUIDs" };

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other options make sense but I don't see much value in this one. Is this for some kind of deterministic runs?

We intentionally chose a GUID not just for randomness of the dir name but because the LLM draw meaning from the folder name. So anything that is not a GUID impacts the decision tree.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so while you are iterating over the scenarios you can quickly and easily inspect the outputs per scenario. I think the LLM inferring stuff is a bit of a stretch, and this is mainly for development purposes, when running on CI, this is never passed, so it runs with the temp folder and the hashed folder.

Note that you could add an explicit instruction to not make assumptions based on the path.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that you could add an explicit instruction to not make assumptions based on the path.

We saw the inference of the path in validation runs - this isn't theoretical. You can give it any kind of instructions but it can always decide to just ignore them. Avoiding to encode any meaning in the path is the best we can do here, even for dev purposes. The results between a local and a CI run shouldn't differ.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes it super hard to see the actual output from the tool. I don't understand why this is such a big problem, given that it's not a default.

I'm not saying it's not a real thing, I'm suggesting that its impact is limited, and that during development people need to look at the output from skills, they can run without the flag at the end locally too. But being able to easily review the different outputs and also have the LLM separately compare (despite the scores) the outputs against the skill contents is an important quality to be able to iterate quickly.

So, my point is:

  • It's not on by default
  • You can run without it during development and observe the delta.
  • The CI runs without it and won't have this potential bias built in.

I don't see an obvious drawback here, but without it, the development experience is not good. You either don't look at the actual outputs, or you spend a significant amount of time trying to dig them out.

Having the outputs be so complex to look at and compare doing development encourages people to not review the actual outputs and just take the judgement results, which in my experience are not reliable.

The judgement results help when they point out issues, but they don't help when the issues don't surface, and the only way to make the issues surface is to be able to look at the outputs.

Ultimately people need to be reviewing the output from skills to ensure some level of quality on the actual results. People is ultimately responsible for the code we ship, not machines, so we should create an environment where it's easy for people to do the right thing. After all, a doctor would not perform surgery just because a machine tells it that there is an issue, nor we would ship features without actually testing them.

We are paid for shipping working software, so we should have a process that makes it easy for people to do the right thing.

var scenarioOpt = new Option<string[]>("--scenario") { Description = "Run only scenarios whose name contains this substring (case-insensitive). Can be repeated.", AllowMultipleArgumentsPerToken = true };
var environmentOpt = new Option<string?>("--environment") { Description = "Run only scenarios matching this environment (e.g. 'ci', 'local'). Scenarios without an environment always run." };

var command = new RootCommand("Validate that agent skills meaningfully improve agent performance")
{
Expand All @@ -60,6 +65,11 @@ public static RootCommand Create()
noiseSkillsDirOpt,
noiseMaxDegradationOpt,
noiseMaxScenarioDegradationOpt,
keepWorkDirsOpt,
workDirBaseOpt,
readableWorkDirsOpt,
scenarioOpt,
environmentOpt,
};

command.SetAction(async (parseResult, _) =>
Expand Down Expand Up @@ -108,6 +118,11 @@ public static RootCommand Create()
NoiseSkillsDir = parseResult.GetValue(noiseSkillsDirOpt),
NoiseDegradationLimit = parseResult.GetValue(noiseMaxDegradationOpt),
NoiseMaxScenarioDegradation = parseResult.GetValue(noiseMaxScenarioDegradationOpt),
KeepWorkDirs = parseResult.GetValue(keepWorkDirsOpt),
WorkDirBase = parseResult.GetValue(workDirBaseOpt),
ReadableWorkDirs = parseResult.GetValue(readableWorkDirsOpt),
ScenarioFilters = parseResult.GetValue(scenarioOpt) ?? [],
Environment = parseResult.GetValue(environmentOpt),
};

return await Run(config);
Expand All @@ -127,6 +142,9 @@ public static RootCommand Create()

public static async Task<int> Run(ValidatorConfig config)
{
AgentRunner.SetWorkDirBase(config.WorkDirBase);
AgentRunner.SetReadableWorkDirs(config.ReadableWorkDirs);

// Validate model early
try
{
Expand Down Expand Up @@ -327,7 +345,10 @@ await Reporter.ReportResults(verdicts, config.Reporters, config.Verbose,
}

await AgentRunner.StopAllClients();
await AgentRunner.CleanupWorkDirs();
if (config.KeepWorkDirs)
Console.WriteLine($"\x1b[33m📂 Keeping {AgentRunner.WorkDirCount} work dir(s). Use --verbose to see paths during the run.\x1b[0m");
else
await AgentRunner.CleanupWorkDirs();

// Always fail on execution errors, even in --verdict-warn-only mode
if (rejectionMessages.Count > 0) return 1;
Expand Down Expand Up @@ -469,7 +490,37 @@ internal static List<string> CheckAggregateDescriptionLimits(IReadOnlyList<Skill
bool singleScenario = skill.EvalConfig!.Scenarios.Count == 1;
using var scenarioLimit = new ConcurrencyLimiter(config.ParallelScenarios);

var scenarioTasks = skill.EvalConfig.Scenarios.Select(scenario =>
var scenarios = (IReadOnlyList<EvalScenario>)skill.EvalConfig.Scenarios;

// Filter by --environment: scenarios with no environment always run;
// scenarios with an environment run only when it matches.
if (config.Environment is not null)
{
scenarios = scenarios.Where(s =>
s.Environment is null ||
s.Environment.Equals(config.Environment, StringComparison.OrdinalIgnoreCase)).ToList();
}

// Filter by --scenario name substring
if (config.ScenarioFilters.Count > 0)
{
scenarios = scenarios.Where(s =>
config.ScenarioFilters.Any(f => s.Name.Contains(f, StringComparison.OrdinalIgnoreCase))).ToList();
}

if (scenarios.Count == 0)
{
var filters = new List<string>();
if (config.Environment is not null) filters.Add($"--environment {config.Environment}");
if (config.ScenarioFilters.Count > 0) filters.Add($"--scenario {string.Join(", ", config.ScenarioFilters)}");
log($"⏭ No scenarios match filter(s): {string.Join("; ", filters)}");
return null;
}

if (scenarios.Count < skill.EvalConfig.Scenarios.Count)
log($"🔍 Running {scenarios.Count}/{skill.EvalConfig.Scenarios.Count} scenario(s) matching filter");

var scenarioTasks = scenarios.Select(scenario =>
scenarioLimit.RunAsync(() => ExecuteScenario(scenario, skill, config, usePairwise, singleScenario, spinner)));
var comparisons = (await Task.WhenAll(scenarioTasks)).ToList();

Expand Down Expand Up @@ -697,13 +748,13 @@ private static async Task<RunExecutionResult> ExecuteRun(
var agentTasks = await Task.WhenAll(
// 1. Baseline: no plugin, no skills — vanilla agent
AgentRunner.RunAgent(new RunOptions(scenario, null, skill.EvalPath, config.Model, config.Verbose,
PluginRoot: null, Log: runLog)),
PluginRoot: null, Log: runLog, RunIndex: runIndex)),
// 2. Skilled-isolated: single skill only (current behavior)
AgentRunner.RunAgent(new RunOptions(scenario, skill, skill.EvalPath, config.Model, config.Verbose,
PluginRoot: null, Log: runLog)),
PluginRoot: null, Log: runLog, RunIndex: runIndex)),
// 3. Skilled-plugin: load entire plugin from plugin root directory
AgentRunner.RunAgent(new RunOptions(scenario, skill, skill.EvalPath, config.Model, config.Verbose,
PluginRoot: pluginRoot, Log: runLog)));
PluginRoot: pluginRoot, Log: runLog, RunIndex: runIndex)));
var baselineMetrics = agentTasks[0];
var isolatedMetrics = agentTasks[1];
var pluginMetrics = agentTasks[2];
Expand Down Expand Up @@ -914,11 +965,11 @@ private static async Task<NoiseTestResult> ExecuteNoiseTest(
{
// Run with target skill only
var skillOnlyMetrics = await AgentRunner.RunAgent(new RunOptions(
scenario, targetSkill, targetSkill.EvalPath, config.Model, config.Verbose, Log: scenarioLog));
scenario, targetSkill, targetSkill.EvalPath, config.Model, config.Verbose, Log: scenarioLog, RunIndex: runIndex));

// Run with all skills loaded
var allSkillsMetrics = await AgentRunner.RunAgent(new RunOptions(
scenario, targetSkill, targetSkill.EvalPath, config.Model, config.Verbose, Log: scenarioLog, AdditionalSkills: otherSkills));
scenario, targetSkill, targetSkill.EvalPath, config.Model, config.Verbose, Log: scenarioLog, AdditionalSkills: otherSkills, RunIndex: runIndex));

// Evaluate assertions on both
if (scenario.Assertions is { Count: > 0 })
Expand Down
8 changes: 7 additions & 1 deletion eng/skill-validator/src/Models/Models.cs
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,8 @@ public sealed record EvalScenario(
IReadOnlyList<string>? RejectTools = null,
int? MaxTurns = null,
int? MaxTokens = null,
bool ExpectActivation = true);
bool ExpectActivation = true,
string? Environment = null);

public sealed record EvalConfig(IReadOnlyList<EvalScenario> Scenarios);

Expand Down Expand Up @@ -378,6 +379,11 @@ public sealed record ValidatorConfig
public string? NoiseSkillsDir { get; init; }
public double NoiseDegradationLimit { get; init; } = 0.2;
public double NoiseMaxScenarioDegradation { get; init; } = 0.4;
public bool KeepWorkDirs { get; init; }
public string? WorkDirBase { get; init; }
public bool ReadableWorkDirs { get; init; }
public IReadOnlyList<string> ScenarioFilters { get; init; } = [];
public string? Environment { get; init; }
}

public static class DefaultWeights
Expand Down
Loading