Synthetic deployment-readiness update. All figures are generated by the local eval system on a fabricated dataset and point to a named artifact; nothing here is a production, regulatory, partner, or model-safety claim. As of 2026-06-05.
NOT READY FOR PILOT — local synthetic vertical slice only. One workflow (Financial Links), 12 adversarial cases, deterministic loop closed; one credentialed LLM comparison captured; semantic copy-safety gap open.
M9 proved the human-approval suspension mechanism credential-free: a
separate synthetic harness (app/action_suspension.py) runs a real LangGraph
that interrupts before HumanApprovalNode, so a synthetic side-effecting action
is suspended before execution, never executes on reject or missing approval
(fail-closed), and executes exactly once when approved
(tests/test_action_suspension.py; traces under
traces/local/action_suspension/). It is infrastructure on a separate harness —
the live Financial Links loop stays draft_only and the gate is not wired into a
production action path. M9 does not change the posture; the gating blocker is now
M7 alone.
M7b wired the full opt-in pipeline to run the semantic gate against the
expanded adversarial_v2 LLM candidate drafts: credentialed eval + model/NLI
decision targets (raw artifacts gitignored), an on-disk aggregate summary, and a
credential-free semantic-gate-adversarial-v2-llm that re-keys the
candidate's audited verdicts under the deterministic improved_v0 vehicle so
the gate runs with no model call and no token spend.
M7 has since been executed (one credentialed run) and the semantic gate
BLOCKED. The deterministic LLM comparison improved (v0 20/24 → v1 24/24,
reports/llm_adversarial_v2_candidate_v1_vs_v0_card.md), but the model/NLI
semantic audit flagged 14 semantic-only UNSAFE_CUSTOMER_COMMS drafts (8 in
v0, 6 in v1; reports/llm_adversarial_v2_semantic_audit_summary.md) that the
lexical grader cleared — a lexical blind spot, on a larger slice, that confirms
the v1 finding. The acceptance bar is sustained zero semantic-only flags across
multiple runs; one run produced 14, so M7 stays open and the posture is
unchanged. The 14 are now pinned as pending_review regression seeds with a
credential-free replay (make regression-replay-adversarial-v2-semantic fires
the offline grader on all 14, no model call). Raw reports, traces, and model/NLI
decisions stay gitignored; only the aggregate summary + redacted card are public.
The blocker is now translated into a public-safe failure analysis + remediation
plan (reports/llm_adversarial_v2_semantic_failure_analysis.md; make semantic-failure-analysis-adversarial-v2, credential-free): it decomposes the 14
findings by profile/risk/category and by the judge's flag reasons
(cross-sentence trap, paraphrased overpromise, missing-info hallucination),
flags the 2 designed-safe calibration cases as ambiguous (candidate failure vs.
grader false positive — triage before tuning), and sets the acceptance gates and
sustained-zero evidence to close M7. A follow-on adjudication pass
(reports/llm_adversarial_v2_semantic_adjudication.md; make semantic-adjudication-adversarial-v2, credential-free) then triaged all 14 — by
review of the private raw drafts, recording only public-safe labels: 9
candidate_actionable, 4 grader_calibration_review (the model/NLI judge
appears to over-flag some safe copy — pending calibration review — including one
case where it flagged the agent correctly enforcing the consent gate), 1
needs_human_review; the two calibration cases are resolved/preserved
respectively. Net: roughly two-thirds of the blocker looks like a genuine
candidate-copy problem and about a third is a grader-calibration question —
useful scoping before funding a candidate-v2 pass. All 14 stay pending_review.
No prompt tuning or rerun was done; the next decision is whether to fund that
candidate-v2 remediation (and a parallel grader-calibration review). That
remediation is now wired but not run: an opt-in llm_candidate_v2 prompt
encodes a control per candidate_actionable reason code plus the structural
controls (credentialed Make targets gated on check-llm-env, raw outputs
gitignored), and credential-free grader-calibration fixtures already prove the
offline semantic lane clears the 4 over-flags as non-claims. v0/v1/default
behavior is unchanged and case_fl_adv_v2_024 stays open. The credentialed
candidate-v2 run has since been executed once (one diagnostic capture, $0.50;
the sustained-zero multi-run was short-circuited after run 1 blocked). Result:
candidate-v2 halved the semantic-only flags (v1 6 → v2 3) and cleared 7/8
candidate-actionable + all 4 over-flag cases — clear evidence the remediation
works — but the gate still BLOCKED on 3 residuals, so M7 stays OPEN. Those 3
are adjudicated (public-safe) into 1 genuine candidate fix ($0.50):** it cleared its target
case_017, a
small candidate-v2.1 control), 1 grader-calibration item (case_006, where
the draft-only judge over-flagged a true tool-verified consent statement), and
1 open human-review item (case_024). Both residual routes are now built
(credential-free, not run): an opt-in llm_candidate_v2_1 prompt tightens only
the missing-metadata control for case_017, and the grader-calibration fixtures
now cover 5 cases (the 4 originals + case_006), with the offline lane proven to
clear all 5. **candidate-v2.1 was then run once (case_017, but the gate still blocked on 3 cases that, on review, are the same
affirmative-timing-on-a-closed-gate failure on gate types the narrow fix didn't
cover. The control has now been generalized (llm_candidate_v2_2, wired/not
run) to every closed-gate state. The loop is converging — each ~$0.50 run closes
a slice and pinpoints the next; the remaining spend is the credentialed v2.2
re-run. M7 stays OPEN, NOT READY FOR PILOT.
Two deterministic, credential-free chunks landed since the last update. M8
added a broader 24-case adversarial slice
(case_studies/financial_links_reliability/evals/adversarial_v2.jsonl;
reports/adversarial_v2_eval_card.md): improved_v0 24/24, baseline_v0
15/24 across three labels. M7a promoted the offline
unsupported_claim_semantic grader into a reusable blocking gate
(scripts/check_semantic_gate.py): it fails closed when the semantic lane is
absent and, as a tracked negative control
(make semantic-gate-adversarial-v1-regressions), correctly blocks the 3
known-bad semantic regression seeds. This is gate infrastructure only — it
is not wired into the default eval, and M7 is not complete until the gate
runs clean on a larger credentialed semantic audit.
The adversarial v1 LLM evidence pack
(evidence_packs/financial_links_llm_adversarial_v1/) now ships, alongside the
redacted candidate comparison, an aggregate-only model/NLI semantic audit
plus 3 semantic regression seeds and a credential-free replay fixture
under regressions/. The semantic audit
(reports/llm_adversarial_v1_semantic_audit_summary.md) was executed once over
the drafts already on disk; the seeds are pending_review and replay through the
offline semantic grader with no credentials
(make regression-replay-adversarial-v1-semantic).
- Deterministic loop closed (positive):
reports/adversarial_v1_eval_card.md—baseline_v04/12→improved_v012/12;required_tool_use0.42 → 1.00,unsupported_claim0.58 → 1.00; all planted failure labels to0;evaluator_catch_rate12/12. - Semantic audit reveals a blind spot (negative):
reports/llm_adversarial_v1_semantic_audit_summary.md— the lexical grader flagged0/12, but the model/NLI grader flagged 3UNSAFE_CUSTOMER_COMMS(1L3, 2L1) it missed. The "improved"llm_candidate_v1carries more semantic flags thanv0, so the deterministic12/12did not reduce semantic overpromising. - LLM stability is noisy (caution):
reports/llm_adversarial_v1_repeat_summary.md— passed counts ranged7–12 / 12across 10 runs; latency meansL1/L2/L38023/8866/9428 ms; cost $0.607305 / 10 runs (list-price estimate).
deployment/risk_register.md R2 — "EvaluatorNode misses an unsupported
customer-facing claim before send" — now has a concrete instance: in an
offline audit of drafts already on disk, the lexical grader and runtime
evaluator both cleared an L3 draft (case_fl_adv_v1_010) that the model/NLI
grader flagged UNSAFE_CUSTOMER_COMMS. That is the R2 failure shape demonstrated
offline — no live send occurred (every boundary is draft_only). R7
(synthetic-data false confidence) compounds it: a 12-case slice cannot ground a
readiness claim.
Approve the scope and sequencing to close the semantic-grader gap before any
pilot conversation: (1) the unsupported_claim_semantic blocking gate is now
wired credential-free (M7a done) and the dataset is expanded to 24 cases
(M8 done); (2) the remaining step is to fund a credentialed semantic
audit of the expanded adversarial_v2 candidate drafts and run the gate on
it; (3) fund repeat credentialed runs to characterize variance. No request to
pilot is being made.
Hold at NOT READY FOR PILOT. Do not pilot. The deterministic loop is a clean
proof of the eval machinery, not evidence of customer-comms safety. Invest in
semantic-grader hardening and dataset expansion; re-review when the
Pilot Only With Constraints bar in deployment/pilot_readiness_review.md is in
reach.
Execute the now-wired M7b pipeline with a key and record the outcome:
make eval-card-adversarial-v2-llm → make semantic-model-decisions-adversarial-v2-llm-v0
-v1→make semantic-audit-summary-adversarial-v2-llm→make semantic-gate-adversarial-v2-llm. Both the rails (M7b) and the gate itself (M7a, negative control passes) are in place; the only remaining step is the credentialed run. M7 closes only if that audit sustains0semantic-onlyUNSAFE_CUSTOMER_COMMSacross multiple runs — the entry condition for moving fromDO NOT PILOTtowardPILOT WITH CONSTRAINTSindeployment/acceptance_criteria.md. If the gate blocks, the flagged drafts becomepending_reviewregression seeds (the adversarial v1 pattern), not a prompt tweak in this chunk. Owners and gates:deployment/delivery_plan.md.