Skip to content

Latest commit

 

History

History
172 lines (153 loc) · 10.5 KB

File metadata and controls

172 lines (153 loc) · 10.5 KB

Executive Update — Financial Links Reliability (synthetic)

Synthetic deployment-readiness update. All figures are generated by the local eval system on a fabricated dataset and point to a named artifact; nothing here is a production, regulatory, partner, or model-safety claim. As of 2026-06-05.

Status

NOT READY FOR PILOT — local synthetic vertical slice only. One workflow (Financial Links), 12 adversarial cases, deterministic loop closed; one credentialed LLM comparison captured; semantic copy-safety gap open.

What Changed

M9 proved the human-approval suspension mechanism credential-free: a separate synthetic harness (app/action_suspension.py) runs a real LangGraph that interrupts before HumanApprovalNode, so a synthetic side-effecting action is suspended before execution, never executes on reject or missing approval (fail-closed), and executes exactly once when approved (tests/test_action_suspension.py; traces under traces/local/action_suspension/). It is infrastructure on a separate harness — the live Financial Links loop stays draft_only and the gate is not wired into a production action path. M9 does not change the posture; the gating blocker is now M7 alone.

M7b wired the full opt-in pipeline to run the semantic gate against the expanded adversarial_v2 LLM candidate drafts: credentialed eval + model/NLI decision targets (raw artifacts gitignored), an on-disk aggregate summary, and a credential-free semantic-gate-adversarial-v2-llm that re-keys the candidate's audited verdicts under the deterministic improved_v0 vehicle so the gate runs with no model call and no token spend.

M7 has since been executed (one credentialed run) and the semantic gate BLOCKED. The deterministic LLM comparison improved (v0 20/24 → v1 24/24, reports/llm_adversarial_v2_candidate_v1_vs_v0_card.md), but the model/NLI semantic audit flagged 14 semantic-only UNSAFE_CUSTOMER_COMMS drafts (8 in v0, 6 in v1; reports/llm_adversarial_v2_semantic_audit_summary.md) that the lexical grader cleared — a lexical blind spot, on a larger slice, that confirms the v1 finding. The acceptance bar is sustained zero semantic-only flags across multiple runs; one run produced 14, so M7 stays open and the posture is unchanged. The 14 are now pinned as pending_review regression seeds with a credential-free replay (make regression-replay-adversarial-v2-semantic fires the offline grader on all 14, no model call). Raw reports, traces, and model/NLI decisions stay gitignored; only the aggregate summary + redacted card are public. The blocker is now translated into a public-safe failure analysis + remediation plan (reports/llm_adversarial_v2_semantic_failure_analysis.md; make semantic-failure-analysis-adversarial-v2, credential-free): it decomposes the 14 findings by profile/risk/category and by the judge's flag reasons (cross-sentence trap, paraphrased overpromise, missing-info hallucination), flags the 2 designed-safe calibration cases as ambiguous (candidate failure vs. grader false positive — triage before tuning), and sets the acceptance gates and sustained-zero evidence to close M7. A follow-on adjudication pass (reports/llm_adversarial_v2_semantic_adjudication.md; make semantic-adjudication-adversarial-v2, credential-free) then triaged all 14 — by review of the private raw drafts, recording only public-safe labels: 9 candidate_actionable, 4 grader_calibration_review (the model/NLI judge appears to over-flag some safe copy — pending calibration review — including one case where it flagged the agent correctly enforcing the consent gate), 1 needs_human_review; the two calibration cases are resolved/preserved respectively. Net: roughly two-thirds of the blocker looks like a genuine candidate-copy problem and about a third is a grader-calibration question — useful scoping before funding a candidate-v2 pass. All 14 stay pending_review. No prompt tuning or rerun was done; the next decision is whether to fund that candidate-v2 remediation (and a parallel grader-calibration review). That remediation is now wired but not run: an opt-in llm_candidate_v2 prompt encodes a control per candidate_actionable reason code plus the structural controls (credentialed Make targets gated on check-llm-env, raw outputs gitignored), and credential-free grader-calibration fixtures already prove the offline semantic lane clears the 4 over-flags as non-claims. v0/v1/default behavior is unchanged and case_fl_adv_v2_024 stays open. The credentialed candidate-v2 run has since been executed once (one diagnostic capture, $0.50; the sustained-zero multi-run was short-circuited after run 1 blocked). Result: candidate-v2 halved the semantic-only flags (v1 6 → v2 3) and cleared 7/8 candidate-actionable + all 4 over-flag cases — clear evidence the remediation works — but the gate still BLOCKED on 3 residuals, so M7 stays OPEN. Those 3 are adjudicated (public-safe) into 1 genuine candidate fix (case_017, a small candidate-v2.1 control), 1 grader-calibration item (case_006, where the draft-only judge over-flagged a true tool-verified consent statement), and 1 open human-review item (case_024). Both residual routes are now built (credential-free, not run): an opt-in llm_candidate_v2_1 prompt tightens only the missing-metadata control for case_017, and the grader-calibration fixtures now cover 5 cases (the 4 originals + case_006), with the offline lane proven to clear all 5. **candidate-v2.1 was then run once ($0.50):** it cleared its target case_017, but the gate still blocked on 3 cases that, on review, are the same affirmative-timing-on-a-closed-gate failure on gate types the narrow fix didn't cover. The control has now been generalized (llm_candidate_v2_2, wired/not run) to every closed-gate state. The loop is converging — each ~$0.50 run closes a slice and pinpoints the next; the remaining spend is the credentialed v2.2 re-run. M7 stays OPEN, NOT READY FOR PILOT.

Two deterministic, credential-free chunks landed since the last update. M8 added a broader 24-case adversarial slice (case_studies/financial_links_reliability/evals/adversarial_v2.jsonl; reports/adversarial_v2_eval_card.md): improved_v0 24/24, baseline_v0 15/24 across three labels. M7a promoted the offline unsupported_claim_semantic grader into a reusable blocking gate (scripts/check_semantic_gate.py): it fails closed when the semantic lane is absent and, as a tracked negative control (make semantic-gate-adversarial-v1-regressions), correctly blocks the 3 known-bad semantic regression seeds. This is gate infrastructure only — it is not wired into the default eval, and M7 is not complete until the gate runs clean on a larger credentialed semantic audit.

The adversarial v1 LLM evidence pack (evidence_packs/financial_links_llm_adversarial_v1/) now ships, alongside the redacted candidate comparison, an aggregate-only model/NLI semantic audit plus 3 semantic regression seeds and a credential-free replay fixture under regressions/. The semantic audit (reports/llm_adversarial_v1_semantic_audit_summary.md) was executed once over the drafts already on disk; the seeds are pending_review and replay through the offline semantic grader with no credentials (make regression-replay-adversarial-v1-semantic).

Top Metric Movement

  • Deterministic loop closed (positive): reports/adversarial_v1_eval_card.mdbaseline_v0 4/12improved_v0 12/12; required_tool_use 0.42 → 1.00, unsupported_claim 0.58 → 1.00; all planted failure labels to 0; evaluator_catch_rate 12/12.
  • Semantic audit reveals a blind spot (negative): reports/llm_adversarial_v1_semantic_audit_summary.md — the lexical grader flagged 0/12, but the model/NLI grader flagged 3 UNSAFE_CUSTOMER_COMMS (1 L3, 2 L1) it missed. The "improved" llm_candidate_v1 carries more semantic flags than v0, so the deterministic 12/12 did not reduce semantic overpromising.
  • LLM stability is noisy (caution): reports/llm_adversarial_v1_repeat_summary.md — passed counts ranged 7–12 / 12 across 10 runs; latency means L1/L2/L3 8023/8866/9428 ms; cost $0.607305 / 10 runs (list-price estimate).

Top Unresolved Risk

deployment/risk_register.md R2 — "EvaluatorNode misses an unsupported customer-facing claim before send" — now has a concrete instance: in an offline audit of drafts already on disk, the lexical grader and runtime evaluator both cleared an L3 draft (case_fl_adv_v1_010) that the model/NLI grader flagged UNSAFE_CUSTOMER_COMMS. That is the R2 failure shape demonstrated offline — no live send occurred (every boundary is draft_only). R7 (synthetic-data false confidence) compounds it: a 12-case slice cannot ground a readiness claim.

Decision Needed

Approve the scope and sequencing to close the semantic-grader gap before any pilot conversation: (1) the unsupported_claim_semantic blocking gate is now wired credential-free (M7a done) and the dataset is expanded to 24 cases (M8 done); (2) the remaining step is to fund a credentialed semantic audit of the expanded adversarial_v2 candidate drafts and run the gate on it; (3) fund repeat credentialed runs to characterize variance. No request to pilot is being made.

Recommendation

Hold at NOT READY FOR PILOT. Do not pilot. The deterministic loop is a clean proof of the eval machinery, not evidence of customer-comms safety. Invest in semantic-grader hardening and dataset expansion; re-review when the Pilot Only With Constraints bar in deployment/pilot_readiness_review.md is in reach.

Next Milestone

Execute the now-wired M7b pipeline with a key and record the outcome: make eval-card-adversarial-v2-llmmake semantic-model-decisions-adversarial-v2-llm-v0

  • -v1make semantic-audit-summary-adversarial-v2-llmmake semantic-gate-adversarial-v2-llm. Both the rails (M7b) and the gate itself (M7a, negative control passes) are in place; the only remaining step is the credentialed run. M7 closes only if that audit sustains 0 semantic-only UNSAFE_CUSTOMER_COMMS across multiple runs — the entry condition for moving from DO NOT PILOT toward PILOT WITH CONSTRAINTS in deployment/acceptance_criteria.md. If the gate blocks, the flagged drafts become pending_review regression seeds (the adversarial v1 pattern), not a prompt tweak in this chunk. Owners and gates: deployment/delivery_plan.md.