Skip to content

spec 020: deterministic claim caching + planning reference-only verification#275

Merged
jeremymanning merged 16 commits into
mainfrom
020-deterministic-claim-caching
Jun 5, 2026
Merged

spec 020: deterministic claim caching + planning reference-only verification#275
jeremymanning merged 16 commits into
mainfrom
020-deterministic-claim-caching

Conversation

@jeremymanning

Copy link
Copy Markdown
Member

Summary

Re-architects the claim-verification stack (specs 016→019) to fix two
maintainer-observed failures, reusing existing machinery rather than
duplicating it:

  1. Claim waffling — a verified value was made → flagged → corrected → then
    overwritten next round → re-flagged, never stabilizing.
  2. Planning thrash — planning docs asserted low-level empirical values
    (e.g. the prime-knot count), which were fully fetched/grounded and, when
    wrong, drove kickbacks that exhausted the convergence cap toward human
    escalation (the PROJ-552 "49 vs 9,988" stall).

Accuracy stays paramount: references are verified everywhere, and paper-stage
verification is made stronger (deterministic + immutable). Only the
location of low-level-claim verification changes — out of planning.

Full speckit pipeline: plan → tasks → analyze (7 findings, all fixed) → implement → verify.

Part A — planning = references-only

  • Stage signal (claims/stage.py): is_planning_stage SSoT; threaded from the
    speckit commands via claim_stage_label() and the convergence reviser's
    self-consistency pass (so planning revisions are gated too — plan_cmd/
    paper_plan_cmd can't be told apart by slash_command_name).
  • Strip/smooth (claims/smooth.py): a detected low-level claim is replaced by
    a higher-level statement (LLM rewrite → deterministic re-detect guard →
    clause-removal fallback). References still fail-closed via the existing path.
  • Recall + GUARANTEE: the LLM extractor is tuned for research claims and has
    low recall on planning scope/metadata, so a planning-recall prompt addendum
    raises it (0→6 on the real plan) and a deterministic final pass
    (claims/planning_scan.py) removes every high-confidence empirical value
    (comma-grouped counts, percentages, timed quantities) regardless of LLM recall,
    while preserving structural numbers (scope bounds, indices, ranges, versions,
    dates, hashes). This closes SC-001 deterministically.
  • Templates/prompts (FR-006): spec/plan templates + speckit SKILLs + the
    data-resources panel defer empirical specifics; the 20 stale per-project
    template copies are re-synced and a gate-enforced sync test prevents future drift.

Part B — deterministic frozen cache

  • Freeze (FR-009/010/011): reuse keyed by the value-independent subject_key;
    a VERIFIED (kind, subject_key) record is adopted without re-resolution and is
    never re-opened by a transient failure or a pending re-extraction.
  • Durable placeholder (FR-007/008/SC-007): the canonical stored doc carries a
    durable {{claim:id}} for each verified claim (never the baked value, so it is
    never re-extracted); render_view/render_artifact_view substitute values for
    the human/published view. The proven-good value-correction path is preserved
    (render()'s default mode is unchanged; citation repair re-anchors on the
    placeholder). Convergence operates on the placeholder form (per the clarification).
  • Value-independent cache key (FR-012): the fill verdict key drops the asserted
    value and keys on subject_key, so "49 at 13" and "9,988 at 13" share one entry.
  • Asserted-value SSoT: unified the two duplicated asserted-value
    implementations into one pointer.select_asserted_token/asserted_value (dup
    regex removed) and added copula-following disambiguation, ordered after the
    comma rule so every grouped case is byte-identical to before (zero regression)
    and only comma-less prose like "…13 is 27635"→27635 is fixed.

Verification

  • Offline gate: 2295 passed / 0 failed (baseline 2165 → +130, no regression).
  • Real-call (free Dartmouth models): planning strip on the real PROJ-552 plan
    (anti-stall + 27,635/2,000/95% stripped, structural numbers preserved); fabricated
    DOI still blocks; paper-stage freeze (FR-011) through the real pipeline; SC-005
    no-regression on the proven-good suite (10/10); PROJ-552 stall regression (SC-006).
  • ruff + mypy clean throughout.

Honest note

The LLM extractor is non-deterministic (one run found 6 empirical claims on the
real plan, another found 0). Improving its prompt raises recall but cannot
guarantee removal; the deterministic planning_scan pass is what provides the
guarantee. Tests reflect this: a deterministic test pins the addendum wiring + the
strip behavior, and the real-plan test passes even on a run where the LLM found 0.

Review guidance

Touches the claim layer that runs on every artifact across 600+ projects. Highest-
risk areas to review: the durable-placeholder render boundary (claims/pointer.py,
claims/service.py) and the subject_key cache key + asserted-value disambiguation
(claims/pointer.py, claims/canonical.py, fill/service.py).

🤖 Generated with Claude Code

jeremymanning and others added 16 commits June 4, 2026 12:58
…fication

Phase 0/1 planning artifacts for spec 020, grounded in the actual claim-layer
code (verified this session):
- plan.md: Technical Context + Constitution Check (no violations) + structure
- research.md: D1-D7 design decisions (stage signal, strip/smooth, freeze,
  durable placeholder, value-independent cache key, templates, testing)
- data-model.md: Claim identity/lifecycle, frozen store, placeholder, stage
  class, strip/smooth transform
- contracts/claim-layer-contracts.md: C1-C10 internal interface contracts
- quickstart.md: offline + real-call verification matrix (SC->evidence map)
- CLAUDE.md: agent-context plan pointer -> spec 020

Part A = planning (specify/clarify/plan/tasks) verifies references only and
strips/smooths low-level claims; Part B = deterministic frozen cache keyed by
value-independent subject_key. Reuses existing machinery (FR-015); no new deps.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- tasks.md: 35 tasks organized by user story (US1/US2 P1, US3 P2), TDD,
  with dependency graph + parallel examples + MVP strategy (US1).
- analyze remediations (all severities):
  * I1: spec FR-002/Key-Entities -> planning low-level class = all
    non-citation kinds (adds RESULT), consistent with US1 references-only.
  * C1: record per-project template-copy duplication in plan Complexity
    Tracking with justification + de-dup follow-up (Principle I honesty).
  * G1: T014 asserts distinct-subject non-collision (FR-009 edge case).
  * G2: T032 adds FR-015 reuse/no-duplication audit.
  * A1: T012 verifies actual emitted stage_label strings.
  * A2: T029 specifies the low-level-detector assertion mechanism.
  * X1: quickstart B7 row for the US3 doc-scope real-call test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…003/004)

Foundational + US1 (the MVP unblocker), TDD:
- claims/stage.py: is_planning_stage() SSoT (planning = spec/clarify/plan/tasks).
- claims/smooth.py: strip_and_smooth() — LLM rewrite -> deterministic re-detect
  guard (reuses canonical._asserted_value/pointer token logic) -> deterministic
  clause-removal fallback; idempotent + claim-free, citations preserved.
- claims/service.py: process_document(stage_label=...) planning branch — extract
  to FIND low-level claims, strip/smooth them, NO resolve/fill/ground/marker/
  kickback; references untouched (verified by the separate F-18 path).
- fill: channels_for/fill_claim stage_label gate (defense-in-depth; returns []/
  blocked for low-level kinds in planning).
- speckit: SlashCommandContext threads stage via claim_stage_label() (NOT derivable
  from slash_command_name — plan_cmd & paper_plan_cmd share "speckit.plan");
  specify/clarify->"spec", plan->"plan", tasks->"tasks"; paper/* inherit None=full.

Tests: 30 offline (stage truth table, strip/smooth idempotence+fallback, planning
skip+no-marker+no-block, fill-boundary gate) + T007 real-call (gated). Baseline
2165 unaffected (claim/fill suites green). ruff+mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…12/013)

- state/claims.py: load_verified_by_subject() — the frozen (kind, subject_key)
  index over VERIFIED records (single value-independent source of truth, FR-013).
- claims/service.py: resolve_registered_claims adopts a VERIFIED subject-twin
  WITHOUT re-resolving — closes the gap where a claim left PENDING in the registry
  (a prior round's transient failure / re-extraction) was re-resolved every round
  and waffled (FR-009/010/011). process_document step-2 reuse now calls the helper
  (SSoT, no inline duplicate).
- fill/service.py: drop resolved_value from the verdict cache key (FR-012) so the
  pre- and post-resolution lookups of one fact share an entry. Scoped note in the
  docstring: the stronger "different asserted values collide" needs hardening the
  shared canonical._asserted_value disambiguation primitive (broad blast radius);
  the practical waffling is already closed by the freeze + git-tracked store.

Tests: test_frozen_claim_cache (4: index, twin-adopt-no-resolve, no-reopen-on-
transient, distinct-subject-non-collision) + test_value_independent_cache_key (3).
No regression: claim suites 86 green, fill+grounding 58 green. ruff+mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
US3 — templates/prompts defer empirical specifics to implementation:
- .specify/templates/{spec,plan}-template.md: success criteria / technical
  context reframed to "what is measured + reference", deferral note added.
- .claude/skills/speckit-{specify,clarify,plan,tasks}/SKILL.md: RQ+method+
  references scope note; defer specific empirical values.
- agents/prompts/panels/panel_plan_data_resources.md: cite source for any
  quantity; do not kick back a plan for omitting a specific empirical number.
- PROJ-552 per-project template copies synced byte-identical (FR-006).

Real-call tests (gated LLMXIVE_REAL_TESTS, free models, no mocks):
- test_planning_references_only (US1: lowlevel stripped + fabricated DOI blocks)
- test_paper_stage_freeze (US2: exact-count verify + freeze, no waffle — SC-005/003)
- test_proj552_planning_no_kickback (SC-006 headline regression)
- test_planning_doc_scope (US3: reference-anchored doc has no low-level claims)

FR-013 verified: state/claims/ git-tracked, state/grounding-cache/ gitignored.
checklist updated with pipeline status + the open FR-007/008/SC-007 design fork.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
US1+US2+US3 done & offline-verified (2198 passed, no regression vs 2165 baseline);
FR-007/008/SC-007 durable placeholder surfaced as an open design fork.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ture

- All 4 spec-020 real-call tests use the free model qwen.qwen3.5-122b (the paid
  qwen-2.5-72b-instruct was rejected by the Principle-IV free-only guard).
- test_fabricated_doi_blocks: use a real sha256 artifact_hash and a schema-valid
  project id/path (citations store requires ^projects/PROJ-\d{3,}-[a-z0-9-]+/.+).

US1 real-call GREEN end-to-end: low-level "49" stripped in a planning stage (SC-001),
fabricated DOI still blocks (FR-004/SC-002), PROJ-552 stall gone (SC-006).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…determinism

The first real-call run exposed two FLAKY tests (not code bugs — the offline gate
2198/0 confirms all logic, incl. test_exact_count_no_regress):
- test_planning_doc_scope: the extractor (qwen) is non-deterministic on short docs,
  so an extraction-based "BAD doc trips the detector" check is unreliable. US3's
  real deliverable is the FR-006 template/skill/panel guidance — now pinned
  DETERMINISTICALLY (offline): each producing file carries the deferral guidance and
  the PROJ-552 per-project copies match the shared templates byte-for-byte.
- test_paper_stage_freeze: depended on flaky real OEIS resolution + extraction of
  9,988. Restructured to the robust FR-011 freeze assertion: seed a VERIFIED record,
  push a WRONG value (27,635) through the REAL pipeline, assert the frozen 9,988 is
  never re-opened/overwritten. PASSES (real-call, 36s). Proven-good initial
  verification (SC-005) is covered by the existing specs 016-019 real-call suite.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Final verification: offline gate 2206 passed / 0 failed (baseline 2165, +41, no
regression); mypy clean on all 11 changed source files; real-call sign-off green
(US1 3/3, US2 freeze 1/1, US3 8/8 deterministic, SC-005 existing suite 10/10).
Remaining: T016/T021/T022 = the durable-placeholder design fork (FR-007/008/SC-007),
surfaced for maintainer decision (see checklist + notes).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The canonical STORED form now carries a durable {{claim:id}} placeholder for each
VERIFIED claim instead of baking the value into prose — so a verified value is
never re-extractable round-to-round (SC-007), killing the waffling at its root.
The proven-good value-correction path is preserved (render()'s default mode is
unchanged → all existing render tests stay green).

- pointer.py: render(placeholder_verified=True) keeps the {{claim:id}}; new
  render_view() substitutes verified values for the human/review/publish view
  (FR-008); pointer_ids() + drop_orphan_pointers() for the round-trip.
- gate.py: strip_claim_artifacts(preserve_pointers=True) keeps durable
  placeholders (only transient [UNRESOLVED-CLAIM:] markers are stripped).
- service.py: process_document full path renders placeholders, carries over
  prior-round placeholders from the registry, drops orphans; the no-new-claims
  early return does the same. New render_artifact_view() = the FR-008 read-time
  projection from the frozen state/claims store.
- citation_repair.py: anchor on the {{claim:id}} placeholder (unambiguous) when
  present, else the value — so the citation is still corrected in the stored form.

Tests: test_durable_placeholder (5) + the round-trip is a fixed point. Full offline
gate 2211 passed / 0 failed (baseline 2165, +46, no regression). ruff + mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… (US1)

Discovered while wiring FR-008: the convergence reviser's self-consistency
claim pass (_self_consistency._verify_claims -> process_document) ran with NO
stage_label, so a PLANNING-stage REVISION still did full verification (resolve/
ground/mark low-level claims) — bypassing US1 inside the convergence loop. Now
run_with_self_consistency threads stage_label -> _clean_citations -> _verify_claims
-> process_document, and the planning revisers pass it: spec_reviser="spec",
plan_reviser="plan", tasks_reviser="tasks". flesh_out (idea stage, clarification
Q1) and paper_*/implementer keep full verification (default None).

Reviser suites green (27); ruff + mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
T016/T021/T022 done. Checklist + tracker updated: FR-007/008/SC-007 implemented
(durable placeholder + render_view + reviser stage threading); proven-good paths
preserved; offline gate 2211/0.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…key cache key

Hardens the shared asserted-value primitive (the follow-up): one source of truth
in pointer.py, deterministic disambiguation, value-independent cache key.

- pointer.select_asserted_token()/asserted_value(): the SINGLE primitive for
  "which number does this text assert". Disambiguation in priority order:
  thousands-separated magnitude -> copula-following answer (is/=/: ...) -> first.
  The copula rule is ordered AFTER the comma rule, so every comma-grouped case is
  byte-identical to before (zero regression) and copula only fixes comma-less
  multi-number prose like "... crossing number 13 is 27635" -> 27635 (not 13).
- canonical.py: _asserted_value delegates to pointer.asserted_value; the duplicate
  _NUMBER_IN_TEXT_RE regex is removed (imports pointer's) — Principle I.
- fill/service._cache_key_parts: keys on the value-EXCLUDED subject_key (now sound
  because asserted-value is robust), bare-number fallback to fact_fingerprint. So
  "49 at 13" and "9,988 at 13" (same subject, different asserted value) share one
  verdict entry (FR-012 strong form), the verdict carrying the correct value.

Template de-dup (Principle I): re-synced 20 stale per-project .specify/templates
copies to the shared templates (0 drift) + tests/test_template_sync enforces the
invariant in the gate so future drift fails instead of silently diverging.

Tests: strong cache-key tests (49==9988 same key; comma-less collides) + the real
PROJ-552 plan.md real-call test (the actual "27,635 at crossing 13" is stripped in
a planning stage). Unit suite 1781 passed / 0 failed; asserted-value blast radius
103 passed. ruff + mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ord recall gap

Real-project testing on the ACTUAL PROJ-552 plan.md confirmed the headline:
a planning stage on the real plan emits NO [UNRESOLVED-CLAIM:] marker, never
blocks, registers nothing — the PROJ-552 stall is genuinely fixed end-to-end.

It also surfaced a real limitation (now documented in the test): extract_claims
is tuned for asserted research claims and has LOW recall on planning-doc scope/
metadata — it returned 0 claims for this plan, so an UNDETECTED value like
"~27,635 at crossing number 13 (downloaded but not fully validated)" is not
stripped. The test now asserts the anti-stall guarantee (gates progress) + that
any DETECTED value IS smoothed, rather than over-asserting SC-001's "zero remain"
which detection recall cannot guarantee on arbitrary planning prose.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…w-up)

Real-project testing on the actual PROJ-552 plan found extract_claims returns 0
claims on planning scope/metadata (it is tuned for asserted research claims), so
the planning strip/smooth had nothing to act on and low-level values survived.

Per the maintainer's choice (improve the extractor, not add a separate
deterministic detector): thread stage_label into extract_claims and, for a
planning stage, append a PLANNING-RECALL addendum instructing it to ALSO flag
specific empirical values stated as scope/metadata/goals (exact counts, dataset
sizes, measured quantities/durations, percentages) while explicitly NOT extracting
structural numbers (phase/section/version, dates, identifiers, scope bounds like
"<=13 crossings"). Threaded service._process_planning_document -> extract_claims.

Diagnostic on the real plan: recall 0 -> 6 verbatim empirical claims (incl. the
wrong "27,635 at crossing 13", ">=95% completeness", "~2,000 prime knots",
"within 15 minutes"); no structural over-extraction.

HONEST LIMITATION (verified): extract_claims is an LLM call and is
NON-DETERMINISTIC — a later run found 0 again. So the addendum raises recall a lot
but cannot GUARANTEE SC-001 per run; only a deterministic detector could. Tests
reflect this: test_planning_recall_prompt pins the addendum wiring deterministically
(present for planning, absent for paper/impl); the real PROJ-552 plan test asserts
the RELIABLE guarantees (anti-stall + no over-stripping), value-removal best-effort.

Offline suites covering the change: 163 passed / 0 failed. ruff + mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…removal

The LLM extractor's recall is non-deterministic, so it alone cannot guarantee
SC-001 (a run found 0 claims on the real plan). claims/planning_scan adds a
DETERMINISTIC final pass to the planning branch that removes every high-confidence
empirical value regardless of LLM recall:

  STRIPPED  comma-grouped counts (27,635; 1,701,936; 2,000), percentages (≥95%,
            10%), timed/measured quantities (15 minutes, 60s, 1s)
  PRESERVED structural numbers — scope bounds (≤13 crossings), indices (crossing
            number 13), ranges (1-10), versions (1.0.0), dates (2026-05-29),
            identifiers (SHA-256, ds002800), bare decimal thresholds (0.7);
            markdown headers + fenced code untouched.

Runs AFTER the LLM strip/smooth (which handles prose quality on what it detected)
as the guarantee net. Idempotent. process_document(stage=planning) now strips the
empirical value even when extract_claims returns [] (proven in test_planning_scan).

Real PROJ-552 plan end-to-end: the wrong "27,635 at crossing 13", "~2,000",
"≥95%", "15 minutes" are removed while "crossing number 13"/"Phase 1"/dates/SHA
survive — and it passes even on a run where the LLM extractor found 0 claims.

Offline gate 2295 passed / 0 failed; planning_scan suite 18/0. ruff + mypy clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jeremymanning jeremymanning merged commit a375021 into main Jun 5, 2026
4 of 5 checks passed
jeremymanning added a commit that referenced this pull request Jun 7, 2026
…lips failing CI)

The real-call gate's pre-flight (llmxive.checks.backends) probed list_models()
single-shot, so a momentary non-JSON blip from Dartmouth's models endpoint
("Expecting value: line 2 column 1") failed the WHOLE gate even when the backend
was healthy — the recurring cause of red real-call CI on PRs #275/#281/#282 (the
job never reached the tests).

Probe now retries 3x with backoff (~6s total); a transient error clears, a genuine
sustained outage still fails fast. Test added (transient-clears / first-try /
sustained-outage-still-fails).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant