Add post: Evaluation-Driven Agent Readiness in Copilot Studio#331
Add post: Evaluation-Driven Agent Readiness in Copilot Studio#331KarimaKT wants to merge 9 commits into
Conversation
…ots, mermaid diagrams
Maker-focused walkthrough of evaluation-driven agent readiness: scope a narrow agent, bucket coverage, generate realistic tests, stack graders, read failures, fix the design, and rerun. Draft pending screenshots and minor tweaks.
|
Blog compiled, but preview generation failed for The Jekyll build completed, but one or more HTML or screenshot previews could not be generated. See the workflow run or the result artifact. Missing previews:
|
adilei
left a comment
There was a problem hiding this comment.
my comments inline. Thanks Karima!
|
|
||
| ## A better question than "is it good?" | ||
|
|
||
| "Is my agent good?" feels like the right question, but it has no answer. A generative agent reasons over tools, knowledge, instructions, and an open-ended range of phrasings, so any single overall score hides more than it shows. Tell a stakeholder the agent scored 78% and you'll get a fair-sounding question back that nobody can answer: is that good? |
There was a problem hiding this comment.
The problematic question is "is my agent good enough"
The enough is the problematic part
|
|
||
| "Is my agent good?" feels like the right question, but it has no answer. A generative agent reasons over tools, knowledge, instructions, and an open-ended range of phrasings, so any single overall score hides more than it shows. Tell a stakeholder the agent scored 78% and you'll get a fair-sounding question back that nobody can answer: is that good? | ||
|
|
||
| And a generative agent is statistical, so "perfect" was never the bar. "Is it good?" really means "is it good enough to ship?", and that splits into two questions you *can* answer: did I build the right things, the right tools, knowledge, and instructions, and have I tested them against the cases that matter, including the edge cases and the ones that should fail? |
There was a problem hiding this comment.
We should flesh out the problem a little bit more here.
The reason why focusing on a threshold (e.g. 80% success rate) is problematic is because thresholds do not have fixed meaning. Thresholds are meaningful only in relation to your test set size, coverage %, and how the different test cases are weighted. The same 80% can be noise on 20 rows or signal on 2,000, and most customers do not have the luxury of having data science teams defining their eval strategy for each and every agent.
| A better question is narrower, and really two in one: | ||
|
|
||
| > Does this agent cover what I expect, with quality I can explain? And is this version better than the last, in exactly which scenarios? |
There was a problem hiding this comment.
This section is the heart of the post, but it's underdeveloped and the "narrower / two in one" framing undersells it. The point isn't that the question gets narrower. It's that the single threshold question should be replaced by a few distinct questions, each targeting something a threshold can't answer on its own:
- Does this version improve on the most recent one, and in exactly which scenarios?
- Does my test set have enough coverage to trust the result? (size, representativeness, and weighting are what give any number its meaning)
- When a case fails, why did it fail? Is it a real bug, an acceptable miss, or correct behavior like a guardrail refusal? Interpret failures by cause, not just as a flat pass/fail.
Let's flesh this out and make it the spine of the article: state these three questions explicitly here, then have the rest of the post visibly answer them. Generation + coverage answers #2, 'Compare with' answers #1, and reading grader explanations answers #3. Those ideas already exist later in the post, but they're scattered as techniques. Pulling them up into this section as the organizing frame would make everything downstream feel like it's answering a question we posed, instead of a list of features.
|
|
||
| ```mermaid | ||
| flowchart LR | ||
| A[Define a narrow agent] --> B[Generate targeted tests] | ||
| B --> C[Run the evaluation] | ||
| C --> D[Read the failures] | ||
| D --> E[Improve the design] | ||
| E --> F[Rerun the same tests] | ||
| F -->|new failure pattern| B | ||
| F -->|better, with known gaps| G[Explainable readiness] | ||
| ``` |
There was a problem hiding this comment.
Minor flow note: this diagram interrupts the momentum a little. We've just declared what should replace the threshold, so the natural next move is to dig into those alternatives in more detail rather than pause on a high-level loop. Might read more smoothly if we go straight into them, and bring back a visual later only if it earns its place.
| ``` | ||
| _The evaluation-driven design loop. You leave it when coverage is understood, agreed, and explainable, not when a number is high._ | ||
|
|
||
| ## Set the stage: a narrow shipping agent |
There was a problem hiding this comment.
Worth separating two problems that are tangled together here, because the post actually addresses both:
- Agreement — do makers and business stakeholders ever agree what "good"/"ready" means? (a process gap)
- Method — even when they try, is "threshold + test cases" a defensible way to answer it? (a measurement gap)
We can touch on both, but better not at length — just be explicit that there are two. Right now they're blended, so the reader doesn't notice.
If we're using this example to set the scene, let's be explicit about what it's actually doing: before we can talk about the evaluation method at all, there has to be agreement on what the agent does and doesn't do. That agreement is the foundation, and being clear about it here pays off later — coverage, guardrails, and reading failures by cause all build on it. As written, the example reads as a standalone "narrow your agent" tip rather than step one of that foundation.
| > The handful of requests you wrote during use-case scoping are a great starting point. Use them to seed the larger generated sets, so every evaluation run feeds rapid, evidence-based improvement instead of guesswork. | ||
| {: .prompt-tip } | ||
|
|
||
| #### Combine generated rows with must-work rows |
There was a problem hiding this comment.
This section speaks about two separate things, and it might read more clearly if we tease them apart: what should be tested (must-haves vs. broad coverage) and how we generate the cases (hand-curated vs. LLM-generated). Right now they're a bit interleaved, so the mapping between them is easy to miss.
One way to make it flow:
- What you need: must-haves and broad coverage. (Ties back to the Q2 coverage point — must-haves carry high weight, broad coverage fills the long tail.)
- How to produce each in the product: hand-write the must-haves manually, and lean on generated questions to scale broad coverage.
This also lines up nicely with the data-science doc — "Manual for precision, generated for coverage." So we could land it as one clear line: generate must-haves manually, use generated questions for coverage.
| | status of shippment New York to Tokyo | In Transit | | ||
| | Give me the status and days in transit for the Chicago to London shipment. | In Transit; 4 days in transit | | ||
| | Is the LA to London package on time or running late? | It is delayed | | ||
|
|
There was a problem hiding this comment.
Two things here:
- Up to now we've used weight to mean importance (how much a failure matters). This section reuses the same word for density of test cases (how many rows a bucket gets). Same term, different meaning — that's confusing. Worth picking one.
- Is it actually true that must-haves get more rows? If anything I'd expect fewer: a must-have that fails should arguably fail the whole test (that's the point of a must-have), so it doesn't need volume to gate. The nuance is you may still want a handful of phrasings to prove it holds under real-world wording. Either way, row-count feels like the wrong axis here — must-haves are about gating weight, not density.
| A test row is only as good as the grader judging it. Copilot Studio's graders fall into two families, and the difference drives how you author your test content. | ||
|
|
||
| | Family | Graders | Needs an expected output? | | ||
| | --- | --- | --- | |
There was a problem hiding this comment.
"Deterministic (path and fact checks)" — "fact checks" is misleading. These graders (Exact match, Keyword match, Tool use) match the response against an expected value, keyword, or tool you supply. They confirm the answer conforms to a known key; they don't verify whether a claim is actually true. A reasonable reader will read "fact checking" as "the grader determines if the fact is correct," which isn't what's happening — the "fact" only counts as right because you already encoded the expected answer. Worth relabeling, e.g. "exact-value / path matching," to avoid implying truth-verification.
| | --- | --- | --- | | ||
| | **Deterministic** (path and fact checks) | Exact match, Keyword match, Tool use | Usually yes (the expected value, keyword, or tool) | | ||
| | **AI** (quality and meaning checks) | General quality, Compare meaning, Text similarity, Custom | Mixed: Compare and Text similarity need one; General quality does not | | ||
|
|
There was a problem hiding this comment.
Is this saying grade must-haves deterministically? If so, there's a tradeoff worth naming: Exact match fails a correct answer that's just phrased differently ("In Transit" vs. "The shipment is currently in transit") — a false negative that punishes good behavior.
The data-science doc actually steers the other way for known-good answers: it recommends Compare meaning (tunable pass score), because "an answer can be phrased in different correct ways but the meaning still needs to come through." It never recommends Exact match. So I'd reserve Exact match for the truly fixed cases you already list (compliance line, ID, identity), and lean on Compare meaning for must-haves where the answer matters but wording can vary.
| Match the grader to the kind of answer, too. Transactional answers, a status, a date, a number, check cleanly with **Exact match** or **Keyword match**. Analytical or generated answers, a summary or an explanation, need meaning-based checks like **Compare meaning** or **General quality**, and sometimes a structural check on length or format. The shipping agent is mostly transactional, which is why its core sets lean deterministic, but the moment an agent starts summarizing or reasoning, the grader mix has to shift with it. | ||
|
|
||
| Analytical answers are where a single grader is most likely to mislead you. A knowledge-grounded summary can read perfectly and still be wrong, or right for the wrong reason, so check both the path the agent took and the answer it produced: | ||
|
|
There was a problem hiding this comment.
Quick clar: are we suggesting stacking graders on every test case, or just the must-haves? It's ambiguous — line 190 says "stack on any test… per use case and per bucket" (sounds like everywhere), but the info-box just below says "one path + one answer grader per set, add more only when needed" (sounds selective). Those pull in different directions.
I'd make it explicit that stacking is selective: it pays off where both the path and the answer can fail independently — knowledge-grounded / tool-using answers, i.e. your high-value must-haves. For a simple guardrail refusal or a plain transactional lookup, one well-chosen grader is enough. That keeps stacking concentrated where it earns its cost instead of implying you double-grade everything.
Here's a Draft for Adi. Screenshots and some tweaks missing.
Summary
Still to do
assets/posts/evaluation-driven-agent-readiness/published: falsetotruewhen readyChecklist
/review-postand addressed feedback./tools/run.sh)