spec 020: deterministic claim caching + planning reference-only verification by jeremymanning · Pull Request #275 · ContextLab/llmXive

jeremymanning · 2026-06-05T02:52:57Z

Summary

Re-architects the claim-verification stack (specs 016→019) to fix two
maintainer-observed failures, reusing existing machinery rather than
duplicating it:

Claim waffling — a verified value was made → flagged → corrected → then
overwritten next round → re-flagged, never stabilizing.
Planning thrash — planning docs asserted low-level empirical values
(e.g. the prime-knot count), which were fully fetched/grounded and, when
wrong, drove kickbacks that exhausted the convergence cap toward human
escalation (the PROJ-552 "49 vs 9,988" stall).

Accuracy stays paramount: references are verified everywhere, and paper-stage
verification is made stronger (deterministic + immutable). Only the
location of low-level-claim verification changes — out of planning.

Full speckit pipeline: plan → tasks → analyze (7 findings, all fixed) → implement → verify.

Part A — planning = references-only

Stage signal (claims/stage.py): is_planning_stage SSoT; threaded from the
speckit commands via claim_stage_label() and the convergence reviser's
self-consistency pass (so planning revisions are gated too — plan_cmd/
paper_plan_cmd can't be told apart by slash_command_name).
Strip/smooth (claims/smooth.py): a detected low-level claim is replaced by
a higher-level statement (LLM rewrite → deterministic re-detect guard →
clause-removal fallback). References still fail-closed via the existing path.
Recall + GUARANTEE: the LLM extractor is tuned for research claims and has
low recall on planning scope/metadata, so a planning-recall prompt addendum
raises it (0→6 on the real plan) and a deterministic final pass
(claims/planning_scan.py) removes every high-confidence empirical value
(comma-grouped counts, percentages, timed quantities) regardless of LLM recall,
while preserving structural numbers (scope bounds, indices, ranges, versions,
dates, hashes). This closes SC-001 deterministically.
Templates/prompts (FR-006): spec/plan templates + speckit SKILLs + the
data-resources panel defer empirical specifics; the 20 stale per-project
template copies are re-synced and a gate-enforced sync test prevents future drift.

Part B — deterministic frozen cache

Freeze (FR-009/010/011): reuse keyed by the value-independent subject_key;
a VERIFIED (kind, subject_key) record is adopted without re-resolution and is
never re-opened by a transient failure or a pending re-extraction.
Durable placeholder (FR-007/008/SC-007): the canonical stored doc carries a
durable {{claim:id}} for each verified claim (never the baked value, so it is
never re-extracted); render_view/render_artifact_view substitute values for
the human/published view. The proven-good value-correction path is preserved
(render()'s default mode is unchanged; citation repair re-anchors on the
placeholder). Convergence operates on the placeholder form (per the clarification).
Value-independent cache key (FR-012): the fill verdict key drops the asserted
value and keys on subject_key, so "49 at 13" and "9,988 at 13" share one entry.
Asserted-value SSoT: unified the two duplicated asserted-value
implementations into one pointer.select_asserted_token/asserted_value (dup
regex removed) and added copula-following disambiguation, ordered after the
comma rule so every grouped case is byte-identical to before (zero regression)
and only comma-less prose like "…13 is 27635"→27635 is fixed.

Verification

Offline gate: 2295 passed / 0 failed (baseline 2165 → +130, no regression).
Real-call (free Dartmouth models): planning strip on the real PROJ-552 plan
(anti-stall + 27,635/2,000/95% stripped, structural numbers preserved); fabricated
DOI still blocks; paper-stage freeze (FR-011) through the real pipeline; SC-005
no-regression on the proven-good suite (10/10); PROJ-552 stall regression (SC-006).
ruff + mypy clean throughout.

Honest note

The LLM extractor is non-deterministic (one run found 6 empirical claims on the
real plan, another found 0). Improving its prompt raises recall but cannot
guarantee removal; the deterministic planning_scan pass is what provides the
guarantee. Tests reflect this: a deterministic test pins the addendum wiring + the
strip behavior, and the real-plan test passes even on a run where the LLM found 0.

Review guidance

Touches the claim layer that runs on every artifact across 600+ projects. Highest-
risk areas to review: the durable-placeholder render boundary (claims/pointer.py,
claims/service.py) and the subject_key cache key + asserted-value disambiguation
(claims/pointer.py, claims/canonical.py, fill/service.py).

🤖 Generated with Claude Code

…fication Phase 0/1 planning artifacts for spec 020, grounded in the actual claim-layer code (verified this session): - plan.md: Technical Context + Constitution Check (no violations) + structure - research.md: D1-D7 design decisions (stage signal, strip/smooth, freeze, durable placeholder, value-independent cache key, templates, testing) - data-model.md: Claim identity/lifecycle, frozen store, placeholder, stage class, strip/smooth transform - contracts/claim-layer-contracts.md: C1-C10 internal interface contracts - quickstart.md: offline + real-call verification matrix (SC->evidence map) - CLAUDE.md: agent-context plan pointer -> spec 020 Part A = planning (specify/clarify/plan/tasks) verifies references only and strips/smooths low-level claims; Part B = deterministic frozen cache keyed by value-independent subject_key. Reuses existing machinery (FR-015); no new deps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- tasks.md: 35 tasks organized by user story (US1/US2 P1, US3 P2), TDD, with dependency graph + parallel examples + MVP strategy (US1). - analyze remediations (all severities): * I1: spec FR-002/Key-Entities -> planning low-level class = all non-citation kinds (adds RESULT), consistent with US1 references-only. * C1: record per-project template-copy duplication in plan Complexity Tracking with justification + de-dup follow-up (Principle I honesty). * G1: T014 asserts distinct-subject non-collision (FR-009 edge case). * G2: T032 adds FR-015 reuse/no-duplication audit. * A1: T012 verifies actual emitted stage_label strings. * A2: T029 specifies the low-level-detector assertion mechanism. * X1: quickstart B7 row for the US3 doc-scope real-call test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…003/004) Foundational + US1 (the MVP unblocker), TDD: - claims/stage.py: is_planning_stage() SSoT (planning = spec/clarify/plan/tasks). - claims/smooth.py: strip_and_smooth() — LLM rewrite -> deterministic re-detect guard (reuses canonical._asserted_value/pointer token logic) -> deterministic clause-removal fallback; idempotent + claim-free, citations preserved. - claims/service.py: process_document(stage_label=...) planning branch — extract to FIND low-level claims, strip/smooth them, NO resolve/fill/ground/marker/ kickback; references untouched (verified by the separate F-18 path). - fill: channels_for/fill_claim stage_label gate (defense-in-depth; returns []/ blocked for low-level kinds in planning). - speckit: SlashCommandContext threads stage via claim_stage_label() (NOT derivable from slash_command_name — plan_cmd & paper_plan_cmd share "speckit.plan"); specify/clarify->"spec", plan->"plan", tasks->"tasks"; paper/* inherit None=full. Tests: 30 offline (stage truth table, strip/smooth idempotence+fallback, planning skip+no-marker+no-block, fill-boundary gate) + T007 real-call (gated). Baseline 2165 unaffected (claim/fill suites green). ruff+mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…12/013) - state/claims.py: load_verified_by_subject() — the frozen (kind, subject_key) index over VERIFIED records (single value-independent source of truth, FR-013). - claims/service.py: resolve_registered_claims adopts a VERIFIED subject-twin WITHOUT re-resolving — closes the gap where a claim left PENDING in the registry (a prior round's transient failure / re-extraction) was re-resolved every round and waffled (FR-009/010/011). process_document step-2 reuse now calls the helper (SSoT, no inline duplicate). - fill/service.py: drop resolved_value from the verdict cache key (FR-012) so the pre- and post-resolution lookups of one fact share an entry. Scoped note in the docstring: the stronger "different asserted values collide" needs hardening the shared canonical._asserted_value disambiguation primitive (broad blast radius); the practical waffling is already closed by the freeze + git-tracked store. Tests: test_frozen_claim_cache (4: index, twin-adopt-no-resolve, no-reopen-on- transient, distinct-subject-non-collision) + test_value_independent_cache_key (3). No regression: claim suites 86 green, fill+grounding 58 green. ruff+mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

US3 — templates/prompts defer empirical specifics to implementation: - .specify/templates/{spec,plan}-template.md: success criteria / technical context reframed to "what is measured + reference", deferral note added. - .claude/skills/speckit-{specify,clarify,plan,tasks}/SKILL.md: RQ+method+ references scope note; defer specific empirical values. - agents/prompts/panels/panel_plan_data_resources.md: cite source for any quantity; do not kick back a plan for omitting a specific empirical number. - PROJ-552 per-project template copies synced byte-identical (FR-006). Real-call tests (gated LLMXIVE_REAL_TESTS, free models, no mocks): - test_planning_references_only (US1: lowlevel stripped + fabricated DOI blocks) - test_paper_stage_freeze (US2: exact-count verify + freeze, no waffle — SC-005/003) - test_proj552_planning_no_kickback (SC-006 headline regression) - test_planning_doc_scope (US3: reference-anchored doc has no low-level claims) FR-013 verified: state/claims/ git-tracked, state/grounding-cache/ gitignored. checklist updated with pipeline status + the open FR-007/008/SC-007 design fork. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

US1+US2+US3 done & offline-verified (2198 passed, no regression vs 2165 baseline); FR-007/008/SC-007 durable placeholder surfaced as an open design fork. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ture - All 4 spec-020 real-call tests use the free model qwen.qwen3.5-122b (the paid qwen-2.5-72b-instruct was rejected by the Principle-IV free-only guard). - test_fabricated_doi_blocks: use a real sha256 artifact_hash and a schema-valid project id/path (citations store requires ^projects/PROJ-\d{3,}-[a-z0-9-]+/.+). US1 real-call GREEN end-to-end: low-level "49" stripped in a planning stage (SC-001), fabricated DOI still blocks (FR-004/SC-002), PROJ-552 stall gone (SC-006). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…determinism The first real-call run exposed two FLAKY tests (not code bugs — the offline gate 2198/0 confirms all logic, incl. test_exact_count_no_regress): - test_planning_doc_scope: the extractor (qwen) is non-deterministic on short docs, so an extraction-based "BAD doc trips the detector" check is unreliable. US3's real deliverable is the FR-006 template/skill/panel guidance — now pinned DETERMINISTICALLY (offline): each producing file carries the deferral guidance and the PROJ-552 per-project copies match the shared templates byte-for-byte. - test_paper_stage_freeze: depended on flaky real OEIS resolution + extraction of 9,988. Restructured to the robust FR-011 freeze assertion: seed a VERIFIED record, push a WRONG value (27,635) through the REAL pipeline, assert the frozen 9,988 is never re-opened/overwritten. PASSES (real-call, 36s). Proven-good initial verification (SC-005) is covered by the existing specs 016-019 real-call suite. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Final verification: offline gate 2206 passed / 0 failed (baseline 2165, +41, no regression); mypy clean on all 11 changed source files; real-call sign-off green (US1 3/3, US2 freeze 1/1, US3 8/8 deterministic, SC-005 existing suite 10/10). Remaining: T016/T021/T022 = the durable-placeholder design fork (FR-007/008/SC-007), surfaced for maintainer decision (see checklist + notes). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The canonical STORED form now carries a durable {{claim:id}} placeholder for each VERIFIED claim instead of baking the value into prose — so a verified value is never re-extractable round-to-round (SC-007), killing the waffling at its root. The proven-good value-correction path is preserved (render()'s default mode is unchanged → all existing render tests stay green). - pointer.py: render(placeholder_verified=True) keeps the {{claim:id}}; new render_view() substitutes verified values for the human/review/publish view (FR-008); pointer_ids() + drop_orphan_pointers() for the round-trip. - gate.py: strip_claim_artifacts(preserve_pointers=True) keeps durable placeholders (only transient [UNRESOLVED-CLAIM:] markers are stripped). - service.py: process_document full path renders placeholders, carries over prior-round placeholders from the registry, drops orphans; the no-new-claims early return does the same. New render_artifact_view() = the FR-008 read-time projection from the frozen state/claims store. - citation_repair.py: anchor on the {{claim:id}} placeholder (unambiguous) when present, else the value — so the citation is still corrected in the stored form. Tests: test_durable_placeholder (5) + the round-trip is a fixed point. Full offline gate 2211 passed / 0 failed (baseline 2165, +46, no regression). ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… (US1) Discovered while wiring FR-008: the convergence reviser's self-consistency claim pass (_self_consistency._verify_claims -> process_document) ran with NO stage_label, so a PLANNING-stage REVISION still did full verification (resolve/ ground/mark low-level claims) — bypassing US1 inside the convergence loop. Now run_with_self_consistency threads stage_label -> _clean_citations -> _verify_claims -> process_document, and the planning revisers pass it: spec_reviser="spec", plan_reviser="plan", tasks_reviser="tasks". flesh_out (idea stage, clarification Q1) and paper_*/implementer keep full verification (default None). Reviser suites green (27); ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

T016/T021/T022 done. Checklist + tracker updated: FR-007/008/SC-007 implemented (durable placeholder + render_view + reviser stage threading); proven-good paths preserved; offline gate 2211/0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…key cache key Hardens the shared asserted-value primitive (the follow-up): one source of truth in pointer.py, deterministic disambiguation, value-independent cache key. - pointer.select_asserted_token()/asserted_value(): the SINGLE primitive for "which number does this text assert". Disambiguation in priority order: thousands-separated magnitude -> copula-following answer (is/=/: ...) -> first. The copula rule is ordered AFTER the comma rule, so every comma-grouped case is byte-identical to before (zero regression) and copula only fixes comma-less multi-number prose like "... crossing number 13 is 27635" -> 27635 (not 13). - canonical.py: _asserted_value delegates to pointer.asserted_value; the duplicate _NUMBER_IN_TEXT_RE regex is removed (imports pointer's) — Principle I. - fill/service._cache_key_parts: keys on the value-EXCLUDED subject_key (now sound because asserted-value is robust), bare-number fallback to fact_fingerprint. So "49 at 13" and "9,988 at 13" (same subject, different asserted value) share one verdict entry (FR-012 strong form), the verdict carrying the correct value. Template de-dup (Principle I): re-synced 20 stale per-project .specify/templates copies to the shared templates (0 drift) + tests/test_template_sync enforces the invariant in the gate so future drift fails instead of silently diverging. Tests: strong cache-key tests (49==9988 same key; comma-less collides) + the real PROJ-552 plan.md real-call test (the actual "27,635 at crossing 13" is stripped in a planning stage). Unit suite 1781 passed / 0 failed; asserted-value blast radius 103 passed. ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ord recall gap Real-project testing on the ACTUAL PROJ-552 plan.md confirmed the headline: a planning stage on the real plan emits NO [UNRESOLVED-CLAIM:] marker, never blocks, registers nothing — the PROJ-552 stall is genuinely fixed end-to-end. It also surfaced a real limitation (now documented in the test): extract_claims is tuned for asserted research claims and has LOW recall on planning-doc scope/ metadata — it returned 0 claims for this plan, so an UNDETECTED value like "~27,635 at crossing number 13 (downloaded but not fully validated)" is not stripped. The test now asserts the anti-stall guarantee (gates progress) + that any DETECTED value IS smoothed, rather than over-asserting SC-001's "zero remain" which detection recall cannot guarantee on arbitrary planning prose. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…w-up) Real-project testing on the actual PROJ-552 plan found extract_claims returns 0 claims on planning scope/metadata (it is tuned for asserted research claims), so the planning strip/smooth had nothing to act on and low-level values survived. Per the maintainer's choice (improve the extractor, not add a separate deterministic detector): thread stage_label into extract_claims and, for a planning stage, append a PLANNING-RECALL addendum instructing it to ALSO flag specific empirical values stated as scope/metadata/goals (exact counts, dataset sizes, measured quantities/durations, percentages) while explicitly NOT extracting structural numbers (phase/section/version, dates, identifiers, scope bounds like "<=13 crossings"). Threaded service._process_planning_document -> extract_claims. Diagnostic on the real plan: recall 0 -> 6 verbatim empirical claims (incl. the wrong "27,635 at crossing 13", ">=95% completeness", "~2,000 prime knots", "within 15 minutes"); no structural over-extraction. HONEST LIMITATION (verified): extract_claims is an LLM call and is NON-DETERMINISTIC — a later run found 0 again. So the addendum raises recall a lot but cannot GUARANTEE SC-001 per run; only a deterministic detector could. Tests reflect this: test_planning_recall_prompt pins the addendum wiring deterministically (present for planning, absent for paper/impl); the real PROJ-552 plan test asserts the RELIABLE guarantees (anti-stall + no over-stripping), value-removal best-effort. Offline suites covering the change: 163 passed / 0 failed. ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…removal The LLM extractor's recall is non-deterministic, so it alone cannot guarantee SC-001 (a run found 0 claims on the real plan). claims/planning_scan adds a DETERMINISTIC final pass to the planning branch that removes every high-confidence empirical value regardless of LLM recall: STRIPPED comma-grouped counts (27,635; 1,701,936; 2,000), percentages (≥95%, 10%), timed/measured quantities (15 minutes, 60s, 1s) PRESERVED structural numbers — scope bounds (≤13 crossings), indices (crossing number 13), ranges (1-10), versions (1.0.0), dates (2026-05-29), identifiers (SHA-256, ds002800), bare decimal thresholds (0.7); markdown headers + fenced code untouched. Runs AFTER the LLM strip/smooth (which handles prose quality on what it detected) as the guarantee net. Idempotent. process_document(stage=planning) now strips the empirical value even when extract_claims returns [] (proven in test_planning_scan). Real PROJ-552 plan end-to-end: the wrong "27,635 at crossing 13", "~2,000", "≥95%", "15 minutes" are removed while "crossing number 13"/"Phase 1"/dates/SHA survive — and it passes even on a run where the LLM extractor found 0 claims. Offline gate 2295 passed / 0 failed; planning_scan suite 18/0. ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…lips failing CI) The real-call gate's pre-flight (llmxive.checks.backends) probed list_models() single-shot, so a momentary non-JSON blip from Dartmouth's models endpoint ("Expecting value: line 2 column 1") failed the WHOLE gate even when the backend was healthy — the recurring cause of red real-call CI on PRs #275/#281/#282 (the job never reached the tests). Probe now retries 3x with backoff (~6s total); a transient error clears, a genuine sustained outage still fails fast. Test added (transient-clears / first-try / sustained-outage-still-fails). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

jeremymanning and others added 16 commits June 4, 2026 12:58

jeremymanning merged commit a375021 into main Jun 5, 2026
4 of 5 checks passed

This was referenced Jun 5, 2026

fix(claims): strip_claim_artifacts must not mangle aligned whitespace #281

Open

ci(real-call): split heavy suite into fast per-PR gate + nightly full run #282

Merged

jeremymanning mentioned this pull request Jun 7, 2026

docs: handoff/resume note — spec-020 follow-ups + real-call CI status #288

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spec 020: deterministic claim caching + planning reference-only verification#275

spec 020: deterministic claim caching + planning reference-only verification#275
jeremymanning merged 16 commits into
mainfrom
020-deterministic-claim-caching

jeremymanning commented Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeremymanning commented Jun 5, 2026

Summary

Part A — planning = references-only

Part B — deterministic frozen cache

Verification

Honest note

Review guidance

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant