Add post: Evaluation-Driven Agent Readiness in Copilot Studio by KarimaKT · Pull Request #331 · microsoft/mcscatblog

KarimaKT · 2026-06-18T05:53:32Z

Here's a Draft for Adi. Screenshots and some tweaks missing.

Summary

New maker-focused post: Evaluation-Driven Agent Readiness in Copilot Studio
Walks through scoping a narrow agent, bucketing coverage, generating realistic test sets, stacking graders, reading failures, fixing the design, and rerunning to prove it
Uses a Kcontoso shipping-support agent as the running example; links to MS Learn agent-evaluation + triage/remediation docs

Still to do

Add the 8 screenshots (header + 01-07) under assets/posts/evaluation-driven-agent-readiness/
Minor copy tweaks
Flip published: false to true when ready

Checklist

Ran /review-post and addressed feedback
Local server renders correctly (./tools/run.sh)
All images have alt text and captions
2-3 internal links to related posts
Tags chosen for Chirpy "Further Reading" overlap

…ots, mermaid diagrams

Maker-focused walkthrough of evaluation-driven agent readiness: scope a narrow agent, bucket coverage, generate realistic tests, stack graders, read failures, fix the design, and rerun. Draft pending screenshots and minor tweaks.

github-actions · 2026-06-18T05:55:21Z

Blog compiled, but preview generation failed for 356ec70.

The Jekyll build completed, but one or more HTML or screenshot previews could not be generated. See the workflow run or the result artifact.

Missing previews:

\_posts/2026\-06\-01\-evaluation\-driven\-agent\-readiness\-copilot\-studio\.md: Expected generated page was not found at _site/mcscatblog/posts/evaluation-driven-agent-readiness-copilot-studio/index.html.

adilei

my comments inline. Thanks Karima!

adilei · 2026-06-20T06:33:10Z

+
+## A better question than "is it good?"
+
+"Is my agent good?" feels like the right question, but it has no answer. A generative agent reasons over tools, knowledge, instructions, and an open-ended range of phrasings, so any single overall score hides more than it shows. Tell a stakeholder the agent scored 78% and you'll get a fair-sounding question back that nobody can answer: is that good?


The problematic question is "is my agent good enough"

The enough is the problematic part

adilei · 2026-06-20T06:50:16Z

+
+"Is my agent good?" feels like the right question, but it has no answer. A generative agent reasons over tools, knowledge, instructions, and an open-ended range of phrasings, so any single overall score hides more than it shows. Tell a stakeholder the agent scored 78% and you'll get a fair-sounding question back that nobody can answer: is that good?
+
+And a generative agent is statistical, so "perfect" was never the bar. "Is it good?" really means "is it good enough to ship?", and that splits into two questions you *can* answer: did I build the right things, the right tools, knowledge, and instructions, and have I tested them against the cases that matter, including the edge cases and the ones that should fail?


We should flesh out the problem a little bit more here.

The reason why focusing on a threshold (e.g. 80% success rate) is problematic is because thresholds do not have fixed meaning. Thresholds are meaningful only in relation to your test set size, coverage %, and how the different test cases are weighted. The same 80% can be noise on 20 rows or signal on 2,000, and most customers do not have the luxury of having data science teams defining their eval strategy for each and every agent.

adilei · 2026-06-20T07:02:27Z

+A better question is narrower, and really two in one:
+
+> Does this agent cover what I expect, with quality I can explain? And is this version better than the last, in exactly which scenarios?


This section is the heart of the post, but it's underdeveloped and the "narrower / two in one" framing undersells it. The point isn't that the question gets narrower. It's that the single threshold question should be replaced by a few distinct questions, each targeting something a threshold can't answer on its own:

Does this version improve on the most recent one, and in exactly which scenarios?

Does my test set have enough coverage to trust the result? (size, representativeness, and weighting are what give any number its meaning)

When a case fails, why did it fail? Is it a real bug, an acceptable miss, or correct behavior like a guardrail refusal? Interpret failures by cause, not just as a flat pass/fail.

Let's flesh this out and make it the spine of the article: state these three questions explicitly here, then have the rest of the post visibly answer them. Generation + coverage answers #2, 'Compare with' answers #1, and reading grader explanations answers #3. Those ideas already exist later in the post, but they're scattered as techniques. Pulling them up into this section as the organizing frame would make everything downstream feel like it's answering a question we posed, instead of a list of features.

adilei · 2026-06-20T07:06:51Z

+
+```mermaid
+flowchart LR
+    A[Define a narrow agent] --> B[Generate targeted tests]
+    B --> C[Run the evaluation]
+    C --> D[Read the failures]
+    D --> E[Improve the design]
+    E --> F[Rerun the same tests]
+    F -->|new failure pattern| B
+    F -->|better, with known gaps| G[Explainable readiness]
+```


Minor flow note: this diagram interrupts the momentum a little. We've just declared what should replace the threshold, so the natural next move is to dig into those alternatives in more detail rather than pause on a high-level loop. Might read more smoothly if we go straight into them, and bring back a visual later only if it earns its place.

adilei · 2026-06-20T07:28:52Z

+```
+_The evaluation-driven design loop. You leave it when coverage is understood, agreed, and explainable, not when a number is high._
+
+## Set the stage: a narrow shipping agent


Worth separating two problems that are tangled together here, because the post actually addresses both:

Agreement — do makers and business stakeholders ever agree what "good"/"ready" means? (a process gap)

Method — even when they try, is "threshold + test cases" a defensible way to answer it? (a measurement gap)

We can touch on both, but better not at length — just be explicit that there are two. Right now they're blended, so the reader doesn't notice.

If we're using this example to set the scene, let's be explicit about what it's actually doing: before we can talk about the evaluation method at all, there has to be agreement on what the agent does and doesn't do. That agreement is the foundation, and being clear about it here pays off later — coverage, guardrails, and reading failures by cause all build on it. As written, the example reads as a standalone "narrow your agent" tip rather than step one of that foundation.

adilei · 2026-06-20T07:53:44Z

+> The handful of requests you wrote during use-case scoping are a great starting point. Use them to seed the larger generated sets, so every evaluation run feeds rapid, evidence-based improvement instead of guesswork.
+{: .prompt-tip }
+
+#### Combine generated rows with must-work rows


This section speaks about two separate things, and it might read more clearly if we tease them apart: what should be tested (must-haves vs. broad coverage) and how we generate the cases (hand-curated vs. LLM-generated). Right now they're a bit interleaved, so the mapping between them is easy to miss.

One way to make it flow:

What you need: must-haves and broad coverage. (Ties back to the Q2 coverage point — must-haves carry high weight, broad coverage fills the long tail.)

How to produce each in the product: hand-write the must-haves manually, and lean on generated questions to scale broad coverage.

This also lines up nicely with the data-science doc — "Manual for precision, generated for coverage." So we could land it as one clear line: generate must-haves manually, use generated questions for coverage.

adilei · 2026-06-20T07:59:35Z

+| status of shippment New York to Tokyo | In Transit |
+| Give me the status and days in transit for the Chicago to London shipment. | In Transit; 4 days in transit |
+| Is the LA to London package on time or running late? | It is delayed |
+


Two things here:

Up to now we've used weight to mean importance (how much a failure matters). This section reuses the same word for density of test cases (how many rows a bucket gets). Same term, different meaning — that's confusing. Worth picking one.

Is it actually true that must-haves get more rows? If anything I'd expect fewer: a must-have that fails should arguably fail the whole test (that's the point of a must-have), so it doesn't need volume to gate. The nuance is you may still want a handful of phrasings to prove it holds under real-world wording. Either way, row-count feels like the wrong axis here — must-haves are about gating weight, not density.

adilei · 2026-06-20T08:02:28Z

+A test row is only as good as the grader judging it. Copilot Studio's graders fall into two families, and the difference drives how you author your test content.
+
+| Family | Graders | Needs an expected output? |
+| --- | --- | --- |


"Deterministic (path and fact checks)" — "fact checks" is misleading. These graders (Exact match, Keyword match, Tool use) match the response against an expected value, keyword, or tool you supply. They confirm the answer conforms to a known key; they don't verify whether a claim is actually true. A reasonable reader will read "fact checking" as "the grader determines if the fact is correct," which isn't what's happening — the "fact" only counts as right because you already encoded the expected answer. Worth relabeling, e.g. "exact-value / path matching," to avoid implying truth-verification.

adilei · 2026-06-20T08:05:48Z

+| --- | --- | --- |
+| **Deterministic** (path and fact checks) | Exact match, Keyword match, Tool use | Usually yes (the expected value, keyword, or tool) |
+| **AI** (quality and meaning checks) | General quality, Compare meaning, Text similarity, Custom | Mixed: Compare and Text similarity need one; General quality does not |
+


Is this saying grade must-haves deterministically? If so, there's a tradeoff worth naming: Exact match fails a correct answer that's just phrased differently ("In Transit" vs. "The shipment is currently in transit") — a false negative that punishes good behavior.

The data-science doc actually steers the other way for known-good answers: it recommends Compare meaning (tunable pass score), because "an answer can be phrased in different correct ways but the meaning still needs to come through." It never recommends Exact match. So I'd reserve Exact match for the truly fixed cases you already list (compliance line, ID, identity), and lean on Compare meaning for must-haves where the answer matters but wording can vary.

adilei · 2026-06-20T08:08:18Z

+Match the grader to the kind of answer, too. Transactional answers, a status, a date, a number, check cleanly with **Exact match** or **Keyword match**. Analytical or generated answers, a summary or an explanation, need meaning-based checks like **Compare meaning** or **General quality**, and sometimes a structural check on length or format. The shipping agent is mostly transactional, which is why its core sets lean deterministic, but the moment an agent starts summarizing or reasoning, the grader mix has to shift with it.
+
+Analytical answers are where a single grader is most likely to mislead you. A knowledge-grounded summary can read perfectly and still be wrong, or right for the wrong reason, so check both the path the agent took and the answer it produced:
+


Quick clar: are we suggesting stacking graders on every test case, or just the must-haves? It's ambiguous — line 190 says "stack on any test… per use case and per bucket" (sounds like everywhere), but the info-box just below says "one path + one answer grader per set, add more only when needed" (sounds selective). Those pull in different directions.

I'd make it explicit that stacking is selective: it pays off where both the path and the answer can fail independently — knowledge-grounded / tool-using answers, i.e. your high-value must-haves. For a simple guardrail refusal or a plain transactional lookup, one well-chosen grader is enough. That keeps stacking concentrated where it earns its cost instead of implying you double-grade everything.

KarimaKT added 6 commits May 6, 2026 17:28

Draft: end-to-end evaluations post (title TBD)

32688dc

Set title and align filename slug

0e91947

Build out evaluating-copilot-studio-agents post: full draft, screensh…

0eb3597

…ots, mermaid diagrams

Review fixes: typos, stub subsection, mermaid label, H4 consistency

ff9d3e5

Refine evaluation strategy framing and statistical thresholds

429b410

KarimaKT mentioned this pull request Jun 18, 2026

Add post: Evaluating Copilot Studio Agents (Scope, Target, Deliver) #294

Closed

6 tasks

KarimaKT added 3 commits June 18, 2026 01:57

Add agent_edition: both to front matter

0646232

Refine test-generation, grader, comparison, and readiness sections

f91414d

Rename run section heading to focus on grader justifications

356ec70

adilei reviewed Jun 20, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add post: Evaluation-Driven Agent Readiness in Copilot Studio#331

Add post: Evaluation-Driven Agent Readiness in Copilot Studio#331
KarimaKT wants to merge 9 commits into
microsoft:mainfrom
KarimaKT:post/evaluation-driven-agent-readiness

KarimaKT commented Jun 18, 2026

Uh oh!

github-actions Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

adilei left a comment

Uh oh!

adilei Jun 20, 2026

Uh oh!

adilei Jun 20, 2026

Uh oh!

adilei Jun 20, 2026

Uh oh!

adilei Jun 20, 2026

Uh oh!

adilei Jun 20, 2026

Uh oh!

adilei Jun 20, 2026

Uh oh!

adilei Jun 20, 2026 •

edited

Loading

Uh oh!

adilei Jun 20, 2026

Uh oh!

adilei Jun 20, 2026

Uh oh!

adilei Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		## A better question than "is it good?"

		"Is my agent good?" feels like the right question, but it has no answer. A generative agent reasons over tools, knowledge, instructions, and an open-ended range of phrasings, so any single overall score hides more than it shows. Tell a stakeholder the agent scored 78% and you'll get a fair-sounding question back that nobody can answer: is that good?


		"Is my agent good?" feels like the right question, but it has no answer. A generative agent reasons over tools, knowledge, instructions, and an open-ended range of phrasings, so any single overall score hides more than it shows. Tell a stakeholder the agent scored 78% and you'll get a fair-sounding question back that nobody can answer: is that good?

		And a generative agent is statistical, so "perfect" was never the bar. "Is it good?" really means "is it good enough to ship?", and that splits into two questions you can answer: did I build the right things, the right tools, knowledge, and instructions, and have I tested them against the cases that matter, including the edge cases and the ones that should fail?

		A better question is narrower, and really two in one:

		> Does this agent cover what I expect, with quality I can explain? And is this version better than the last, in exactly which scenarios?

		Match the grader to the kind of answer, too. Transactional answers, a status, a date, a number, check cleanly with Exact match or Keyword match. Analytical or generated answers, a summary or an explanation, need meaning-based checks like Compare meaning or General quality, and sometimes a structural check on length or format. The shipping agent is mostly transactional, which is why its core sets lean deterministic, but the moment an agent starts summarizing or reasoning, the grader mix has to shift with it.

		Analytical answers are where a single grader is most likely to mislead you. A knowledge-grounded summary can read perfectly and still be wrong, or right for the wrong reason, so check both the path the agent took and the answer it produced:

Uh oh!

Conversation

KarimaKT commented Jun 18, 2026

Summary

Still to do

Checklist

Uh oh!

github-actions Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adilei left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adilei Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented Jun 18, 2026 •

edited

Loading

adilei Jun 20, 2026 •

edited

Loading