spec 020: deterministic claim caching + planning reference-only verification#275
Merged
Conversation
…fication Phase 0/1 planning artifacts for spec 020, grounded in the actual claim-layer code (verified this session): - plan.md: Technical Context + Constitution Check (no violations) + structure - research.md: D1-D7 design decisions (stage signal, strip/smooth, freeze, durable placeholder, value-independent cache key, templates, testing) - data-model.md: Claim identity/lifecycle, frozen store, placeholder, stage class, strip/smooth transform - contracts/claim-layer-contracts.md: C1-C10 internal interface contracts - quickstart.md: offline + real-call verification matrix (SC->evidence map) - CLAUDE.md: agent-context plan pointer -> spec 020 Part A = planning (specify/clarify/plan/tasks) verifies references only and strips/smooths low-level claims; Part B = deterministic frozen cache keyed by value-independent subject_key. Reuses existing machinery (FR-015); no new deps. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- tasks.md: 35 tasks organized by user story (US1/US2 P1, US3 P2), TDD,
with dependency graph + parallel examples + MVP strategy (US1).
- analyze remediations (all severities):
* I1: spec FR-002/Key-Entities -> planning low-level class = all
non-citation kinds (adds RESULT), consistent with US1 references-only.
* C1: record per-project template-copy duplication in plan Complexity
Tracking with justification + de-dup follow-up (Principle I honesty).
* G1: T014 asserts distinct-subject non-collision (FR-009 edge case).
* G2: T032 adds FR-015 reuse/no-duplication audit.
* A1: T012 verifies actual emitted stage_label strings.
* A2: T029 specifies the low-level-detector assertion mechanism.
* X1: quickstart B7 row for the US3 doc-scope real-call test.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…003/004) Foundational + US1 (the MVP unblocker), TDD: - claims/stage.py: is_planning_stage() SSoT (planning = spec/clarify/plan/tasks). - claims/smooth.py: strip_and_smooth() — LLM rewrite -> deterministic re-detect guard (reuses canonical._asserted_value/pointer token logic) -> deterministic clause-removal fallback; idempotent + claim-free, citations preserved. - claims/service.py: process_document(stage_label=...) planning branch — extract to FIND low-level claims, strip/smooth them, NO resolve/fill/ground/marker/ kickback; references untouched (verified by the separate F-18 path). - fill: channels_for/fill_claim stage_label gate (defense-in-depth; returns []/ blocked for low-level kinds in planning). - speckit: SlashCommandContext threads stage via claim_stage_label() (NOT derivable from slash_command_name — plan_cmd & paper_plan_cmd share "speckit.plan"); specify/clarify->"spec", plan->"plan", tasks->"tasks"; paper/* inherit None=full. Tests: 30 offline (stage truth table, strip/smooth idempotence+fallback, planning skip+no-marker+no-block, fill-boundary gate) + T007 real-call (gated). Baseline 2165 unaffected (claim/fill suites green). ruff+mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…12/013) - state/claims.py: load_verified_by_subject() — the frozen (kind, subject_key) index over VERIFIED records (single value-independent source of truth, FR-013). - claims/service.py: resolve_registered_claims adopts a VERIFIED subject-twin WITHOUT re-resolving — closes the gap where a claim left PENDING in the registry (a prior round's transient failure / re-extraction) was re-resolved every round and waffled (FR-009/010/011). process_document step-2 reuse now calls the helper (SSoT, no inline duplicate). - fill/service.py: drop resolved_value from the verdict cache key (FR-012) so the pre- and post-resolution lookups of one fact share an entry. Scoped note in the docstring: the stronger "different asserted values collide" needs hardening the shared canonical._asserted_value disambiguation primitive (broad blast radius); the practical waffling is already closed by the freeze + git-tracked store. Tests: test_frozen_claim_cache (4: index, twin-adopt-no-resolve, no-reopen-on- transient, distinct-subject-non-collision) + test_value_independent_cache_key (3). No regression: claim suites 86 green, fill+grounding 58 green. ruff+mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
US3 — templates/prompts defer empirical specifics to implementation:
- .specify/templates/{spec,plan}-template.md: success criteria / technical
context reframed to "what is measured + reference", deferral note added.
- .claude/skills/speckit-{specify,clarify,plan,tasks}/SKILL.md: RQ+method+
references scope note; defer specific empirical values.
- agents/prompts/panels/panel_plan_data_resources.md: cite source for any
quantity; do not kick back a plan for omitting a specific empirical number.
- PROJ-552 per-project template copies synced byte-identical (FR-006).
Real-call tests (gated LLMXIVE_REAL_TESTS, free models, no mocks):
- test_planning_references_only (US1: lowlevel stripped + fabricated DOI blocks)
- test_paper_stage_freeze (US2: exact-count verify + freeze, no waffle — SC-005/003)
- test_proj552_planning_no_kickback (SC-006 headline regression)
- test_planning_doc_scope (US3: reference-anchored doc has no low-level claims)
FR-013 verified: state/claims/ git-tracked, state/grounding-cache/ gitignored.
checklist updated with pipeline status + the open FR-007/008/SC-007 design fork.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
US1+US2+US3 done & offline-verified (2198 passed, no regression vs 2165 baseline); FR-007/008/SC-007 durable placeholder surfaced as an open design fork. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ture
- All 4 spec-020 real-call tests use the free model qwen.qwen3.5-122b (the paid
qwen-2.5-72b-instruct was rejected by the Principle-IV free-only guard).
- test_fabricated_doi_blocks: use a real sha256 artifact_hash and a schema-valid
project id/path (citations store requires ^projects/PROJ-\d{3,}-[a-z0-9-]+/.+).
US1 real-call GREEN end-to-end: low-level "49" stripped in a planning stage (SC-001),
fabricated DOI still blocks (FR-004/SC-002), PROJ-552 stall gone (SC-006).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…determinism The first real-call run exposed two FLAKY tests (not code bugs — the offline gate 2198/0 confirms all logic, incl. test_exact_count_no_regress): - test_planning_doc_scope: the extractor (qwen) is non-deterministic on short docs, so an extraction-based "BAD doc trips the detector" check is unreliable. US3's real deliverable is the FR-006 template/skill/panel guidance — now pinned DETERMINISTICALLY (offline): each producing file carries the deferral guidance and the PROJ-552 per-project copies match the shared templates byte-for-byte. - test_paper_stage_freeze: depended on flaky real OEIS resolution + extraction of 9,988. Restructured to the robust FR-011 freeze assertion: seed a VERIFIED record, push a WRONG value (27,635) through the REAL pipeline, assert the frozen 9,988 is never re-opened/overwritten. PASSES (real-call, 36s). Proven-good initial verification (SC-005) is covered by the existing specs 016-019 real-call suite. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Final verification: offline gate 2206 passed / 0 failed (baseline 2165, +41, no regression); mypy clean on all 11 changed source files; real-call sign-off green (US1 3/3, US2 freeze 1/1, US3 8/8 deterministic, SC-005 existing suite 10/10). Remaining: T016/T021/T022 = the durable-placeholder design fork (FR-007/008/SC-007), surfaced for maintainer decision (see checklist + notes). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The canonical STORED form now carries a durable {{claim:id}} placeholder for each
VERIFIED claim instead of baking the value into prose — so a verified value is
never re-extractable round-to-round (SC-007), killing the waffling at its root.
The proven-good value-correction path is preserved (render()'s default mode is
unchanged → all existing render tests stay green).
- pointer.py: render(placeholder_verified=True) keeps the {{claim:id}}; new
render_view() substitutes verified values for the human/review/publish view
(FR-008); pointer_ids() + drop_orphan_pointers() for the round-trip.
- gate.py: strip_claim_artifacts(preserve_pointers=True) keeps durable
placeholders (only transient [UNRESOLVED-CLAIM:] markers are stripped).
- service.py: process_document full path renders placeholders, carries over
prior-round placeholders from the registry, drops orphans; the no-new-claims
early return does the same. New render_artifact_view() = the FR-008 read-time
projection from the frozen state/claims store.
- citation_repair.py: anchor on the {{claim:id}} placeholder (unambiguous) when
present, else the value — so the citation is still corrected in the stored form.
Tests: test_durable_placeholder (5) + the round-trip is a fixed point. Full offline
gate 2211 passed / 0 failed (baseline 2165, +46, no regression). ruff + mypy clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… (US1) Discovered while wiring FR-008: the convergence reviser's self-consistency claim pass (_self_consistency._verify_claims -> process_document) ran with NO stage_label, so a PLANNING-stage REVISION still did full verification (resolve/ ground/mark low-level claims) — bypassing US1 inside the convergence loop. Now run_with_self_consistency threads stage_label -> _clean_citations -> _verify_claims -> process_document, and the planning revisers pass it: spec_reviser="spec", plan_reviser="plan", tasks_reviser="tasks". flesh_out (idea stage, clarification Q1) and paper_*/implementer keep full verification (default None). Reviser suites green (27); ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
T016/T021/T022 done. Checklist + tracker updated: FR-007/008/SC-007 implemented (durable placeholder + render_view + reviser stage threading); proven-good paths preserved; offline gate 2211/0. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…key cache key Hardens the shared asserted-value primitive (the follow-up): one source of truth in pointer.py, deterministic disambiguation, value-independent cache key. - pointer.select_asserted_token()/asserted_value(): the SINGLE primitive for "which number does this text assert". Disambiguation in priority order: thousands-separated magnitude -> copula-following answer (is/=/: ...) -> first. The copula rule is ordered AFTER the comma rule, so every comma-grouped case is byte-identical to before (zero regression) and copula only fixes comma-less multi-number prose like "... crossing number 13 is 27635" -> 27635 (not 13). - canonical.py: _asserted_value delegates to pointer.asserted_value; the duplicate _NUMBER_IN_TEXT_RE regex is removed (imports pointer's) — Principle I. - fill/service._cache_key_parts: keys on the value-EXCLUDED subject_key (now sound because asserted-value is robust), bare-number fallback to fact_fingerprint. So "49 at 13" and "9,988 at 13" (same subject, different asserted value) share one verdict entry (FR-012 strong form), the verdict carrying the correct value. Template de-dup (Principle I): re-synced 20 stale per-project .specify/templates copies to the shared templates (0 drift) + tests/test_template_sync enforces the invariant in the gate so future drift fails instead of silently diverging. Tests: strong cache-key tests (49==9988 same key; comma-less collides) + the real PROJ-552 plan.md real-call test (the actual "27,635 at crossing 13" is stripped in a planning stage). Unit suite 1781 passed / 0 failed; asserted-value blast radius 103 passed. ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ord recall gap Real-project testing on the ACTUAL PROJ-552 plan.md confirmed the headline: a planning stage on the real plan emits NO [UNRESOLVED-CLAIM:] marker, never blocks, registers nothing — the PROJ-552 stall is genuinely fixed end-to-end. It also surfaced a real limitation (now documented in the test): extract_claims is tuned for asserted research claims and has LOW recall on planning-doc scope/ metadata — it returned 0 claims for this plan, so an UNDETECTED value like "~27,635 at crossing number 13 (downloaded but not fully validated)" is not stripped. The test now asserts the anti-stall guarantee (gates progress) + that any DETECTED value IS smoothed, rather than over-asserting SC-001's "zero remain" which detection recall cannot guarantee on arbitrary planning prose. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…w-up) Real-project testing on the actual PROJ-552 plan found extract_claims returns 0 claims on planning scope/metadata (it is tuned for asserted research claims), so the planning strip/smooth had nothing to act on and low-level values survived. Per the maintainer's choice (improve the extractor, not add a separate deterministic detector): thread stage_label into extract_claims and, for a planning stage, append a PLANNING-RECALL addendum instructing it to ALSO flag specific empirical values stated as scope/metadata/goals (exact counts, dataset sizes, measured quantities/durations, percentages) while explicitly NOT extracting structural numbers (phase/section/version, dates, identifiers, scope bounds like "<=13 crossings"). Threaded service._process_planning_document -> extract_claims. Diagnostic on the real plan: recall 0 -> 6 verbatim empirical claims (incl. the wrong "27,635 at crossing 13", ">=95% completeness", "~2,000 prime knots", "within 15 minutes"); no structural over-extraction. HONEST LIMITATION (verified): extract_claims is an LLM call and is NON-DETERMINISTIC — a later run found 0 again. So the addendum raises recall a lot but cannot GUARANTEE SC-001 per run; only a deterministic detector could. Tests reflect this: test_planning_recall_prompt pins the addendum wiring deterministically (present for planning, absent for paper/impl); the real PROJ-552 plan test asserts the RELIABLE guarantees (anti-stall + no over-stripping), value-removal best-effort. Offline suites covering the change: 163 passed / 0 failed. ruff + mypy clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…removal
The LLM extractor's recall is non-deterministic, so it alone cannot guarantee
SC-001 (a run found 0 claims on the real plan). claims/planning_scan adds a
DETERMINISTIC final pass to the planning branch that removes every high-confidence
empirical value regardless of LLM recall:
STRIPPED comma-grouped counts (27,635; 1,701,936; 2,000), percentages (≥95%,
10%), timed/measured quantities (15 minutes, 60s, 1s)
PRESERVED structural numbers — scope bounds (≤13 crossings), indices (crossing
number 13), ranges (1-10), versions (1.0.0), dates (2026-05-29),
identifiers (SHA-256, ds002800), bare decimal thresholds (0.7);
markdown headers + fenced code untouched.
Runs AFTER the LLM strip/smooth (which handles prose quality on what it detected)
as the guarantee net. Idempotent. process_document(stage=planning) now strips the
empirical value even when extract_claims returns [] (proven in test_planning_scan).
Real PROJ-552 plan end-to-end: the wrong "27,635 at crossing 13", "~2,000",
"≥95%", "15 minutes" are removed while "crossing number 13"/"Phase 1"/dates/SHA
survive — and it passes even on a run where the LLM extractor found 0 claims.
Offline gate 2295 passed / 0 failed; planning_scan suite 18/0. ruff + mypy clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 5, 2026
jeremymanning
added a commit
that referenced
this pull request
Jun 7, 2026
…lips failing CI)
The real-call gate's pre-flight (llmxive.checks.backends) probed list_models()
single-shot, so a momentary non-JSON blip from Dartmouth's models endpoint
("Expecting value: line 2 column 1") failed the WHOLE gate even when the backend
was healthy — the recurring cause of red real-call CI on PRs #275/#281/#282 (the
job never reached the tests).
Probe now retries 3x with backoff (~6s total); a transient error clears, a genuine
sustained outage still fails fast. Test added (transient-clears / first-try /
sustained-outage-still-fails).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Re-architects the claim-verification stack (specs 016→019) to fix two
maintainer-observed failures, reusing existing machinery rather than
duplicating it:
overwritten next round → re-flagged, never stabilizing.
(e.g. the prime-knot count), which were fully fetched/grounded and, when
wrong, drove kickbacks that exhausted the convergence cap toward human
escalation (the PROJ-552 "49 vs 9,988" stall).
Accuracy stays paramount: references are verified everywhere, and paper-stage
verification is made stronger (deterministic + immutable). Only the
location of low-level-claim verification changes — out of planning.
Full speckit pipeline: plan → tasks → analyze (7 findings, all fixed) → implement → verify.
Part A — planning = references-only
claims/stage.py):is_planning_stageSSoT; threaded from thespeckit commands via
claim_stage_label()and the convergence reviser'sself-consistency pass (so planning revisions are gated too —
plan_cmd/paper_plan_cmdcan't be told apart byslash_command_name).claims/smooth.py): a detected low-level claim is replaced bya higher-level statement (LLM rewrite → deterministic re-detect guard →
clause-removal fallback). References still fail-closed via the existing path.
low recall on planning scope/metadata, so a planning-recall prompt addendum
raises it (0→6 on the real plan) and a deterministic final pass
(
claims/planning_scan.py) removes every high-confidence empirical value(comma-grouped counts, percentages, timed quantities) regardless of LLM recall,
while preserving structural numbers (scope bounds, indices, ranges, versions,
dates, hashes). This closes SC-001 deterministically.
data-resources panel defer empirical specifics; the 20 stale per-project
template copies are re-synced and a gate-enforced sync test prevents future drift.
Part B — deterministic frozen cache
subject_key;a VERIFIED
(kind, subject_key)record is adopted without re-resolution and isnever re-opened by a transient failure or a pending re-extraction.
durable
{{claim:id}}for each verified claim (never the baked value, so it isnever re-extracted);
render_view/render_artifact_viewsubstitute values forthe human/published view. The proven-good value-correction path is preserved
(
render()'s default mode is unchanged; citation repair re-anchors on theplaceholder). Convergence operates on the placeholder form (per the clarification).
value and keys on
subject_key, so "49 at 13" and "9,988 at 13" share one entry.implementations into one
pointer.select_asserted_token/asserted_value(dupregex removed) and added copula-following disambiguation, ordered after the
comma rule so every grouped case is byte-identical to before (zero regression)
and only comma-less prose like "…13 is 27635"→27635 is fixed.
Verification
(anti-stall + 27,635/2,000/95% stripped, structural numbers preserved); fabricated
DOI still blocks; paper-stage freeze (FR-011) through the real pipeline; SC-005
no-regression on the proven-good suite (10/10); PROJ-552 stall regression (SC-006).
Honest note
The LLM extractor is non-deterministic (one run found 6 empirical claims on the
real plan, another found 0). Improving its prompt raises recall but cannot
guarantee removal; the deterministic
planning_scanpass is what provides theguarantee. Tests reflect this: a deterministic test pins the addendum wiring + the
strip behavior, and the real-plan test passes even on a run where the LLM found 0.
Review guidance
Touches the claim layer that runs on every artifact across 600+ projects. Highest-
risk areas to review: the durable-placeholder render boundary (
claims/pointer.py,claims/service.py) and thesubject_keycache key + asserted-value disambiguation(
claims/pointer.py,claims/canonical.py,fill/service.py).🤖 Generated with Claude Code