chore(format): squash 135 PARTIAL-discharge feat commits into one PR (one CI run instead of 135) by noahgift · Pull Request #1637 · paiml/aprender

noahgift · 2026-05-12T08:54:11Z

Summary

Single squashed PR that combines 135 PRs worth of format/PARTIAL discharge feat commits into one diff. One CI run instead of 135.

Each individual PR adds 1 new file crates/aprender-core/src/format/<name>.rs (verdict functions + tests) plus a pub mod <name>; line in mod.rs. They're structurally independent — no cross-dependencies — so a single combined diff is mechanically equivalent to merging the 135 PRs sequentially.

Why

With 250+ queued PRs all individually running CI, the merge queue is rate-limited by runner pool size (~6-8 PRs/hour after the #1632 sccache fix). Squashing the format-discharge cluster collapses 135 sequential workspace-test runs into 1, freeing the queue for the remaining unique-content PRs.

Files

135 new format modules, each binding a contract's FALSIFY-* gates at PARTIAL_ALGORITHM_LEVEL. See individual commit messages for per-contract context.

Test plan

All 135 commits cherry-picked clean onto fresh main (no manual conflict resolution)
Local cargo clippy -p aprender-core --lib -- -D warnings clean
Local cargo check -p aprender-core --lib clean
CI: workspace-test passes (with sccache from ci: re-enable sccache — unblocks 240-PR cold-compile bottleneck #1632 enabled, this should complete in ~25min)

After merge

The 135 originating PRs will be auto-closed by GitHub (branch-equal-to-main after this lands).

🤖 Generated with Claude Code

…GORITHM_LEVEL — closes 9/9 sweep Bundles 9 algorithm-level decision rules + a stand-alone scalar softmax reference (max-subtraction trick, returns SoftmaxError for empty/non-finite input): - SM-001 normalization: |sum - 1| < sqrt(n)*1e-6 - SM-002 positivity: all outputs > 0 (catches underflow-to-zero) - SM-003 order preservation: argmax(softmax(x)) == argmax(x) - SM-004 SIMD vs scalar: |simd - scalar| < 8 ULPs - SM-005 single-element softmax([x]) == [1.0] - SM-006 constant input -> uniform output [1/n; n] - SM-007 input validation: empty input -> Err; NaN/Inf -> Err - SM-008 length preserved: len(out) == len(in) - SM-009 frame condition: input buffer byte-identical after call 27 tests cover the canonical softmax invariants, the [1000, -1000] underflow band catching SM-002, and the (β1=0, β2=1, ε=0)-style input validation error boundaries. Contract: contracts/softmax-kernel-v1.yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…LGORITHM_LEVEL — closes 10/10 sweep Bundles 10 algorithm-level decision rules + a stand-alone reference online-softmax scan (OnlineState, online_update, online_scan, online_softmax, standard_softmax) so the algorithm-level rule is testable offline against drift in any SIMD/GPU implementation: - OSM-001 |online - standard| < 1e-5 element-wise - OSM-002 |Σ y_i - 1.0| < 1e-6 - OSM-003 every output strictly > 0 (catches underflow-to-zero) - OSM-004 shift invariance: softmax(x + c) ≈ softmax(x) - OSM-005 decoder kv_lens {1, 6, 64, 448, 1500} all valid (sum tolerance scales by sqrt(n) for naive-sum FP error) - OSM-006 single-element softmax([x]) == [1.0] - OSM-007 running max invariant: m_i = max(x_1..x_i) - OSM-008 loop counter terminates at n - OSM-009 normalizer recompute: d_i = Σ exp(x_j - m_i) - OSM-010 partial sum invariant (same as OSM-009 in reference scan) 38 tests cover canonical [1,...,1] uniform, the [100, -100, 50] extreme range, the underflow-to-zero positivity regression, and the Whisper-canonical {1, 6, 64, 448, 1500} kv_lens. Contract: contracts/online-softmax-v1.yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…RITHM_LEVEL — closes 11/11 sweep Bundles 11 algorithm-level decision rules + a stand-alone scalar AdamW reference implementation: - AW-001 decoupled weight decay: AdamW != L2-coupled Adam when wd > 0 AND theta != 0 (vacuous Pass otherwise per contract scope) - AW-002 second moment v_t >= 0 - AW-003 bias correction: 1/(1 - beta^t) > 1 for beta in (0, 1) and t >= 1 (within the f32 non-underflow regime) - AW-004 update finiteness with finite gradient and eps > 0 - AW-005 SIMD vs scalar: |simd - scalar| < 8 ULPs - AW-006 zero-gradient boundary: only weight decay modifies theta (theta_new == theta_old * (1 - lr * wd) within FP tolerance) - AW-007 hyperparameter validation: invalid (β1=0, β2=1, ε=0) → Err - AW-008 frame condition: gradient buffer byte-unchanged after step - AW-009 loop invariant: v_t >= 0 across N steps - AW-010 first moment formula: m = β1·old_m + (1-β1)·g (catches the (1-β1)·old_m + β1·g coefficient swap) - AW-011 loop variant: training terminates at exactly max_steps Reference impl AdamWState + AdamWHyperparams + adamw_step() provides the canonical scalar comparison target for AW-005 SIMD parity testing and AW-007 hyperparam validation. 36 tests cover the canonical Loshchilov & Hutter 2019 formulation, the AW-005 ULP distance computation across same-sign and sign-crossing pairs, and the (β1=0, β2=1, ε=0) error boundaries. Contract: contracts/adamw-kernel-v1.yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…AL_ALGORITHM_LEVEL — closes 15/15 sweep Bundles 15 algorithm-level decision rules over markdown / YAML / SVG / media integrity: - DOC-001 heading hierarchy: no H1->H3 skip - DOC-002 heading hierarchy: at most one H1 - DOC-003 link safety: no javascript: schemes (case-insensitive) - DOC-004 code fence language: bare ``` is a violation - DOC-005 table column parity: every row matches header - DOC-006 SVG safety: no <script> tags (case-insensitive) - DOC-007 README drift: actual == generated byte-exact - DOC-008 SVG must declare viewBox - DOC-009 required sections present (e.g. License) - DOC-010 YAML balanced delimiters - DOC-011 YAML no duplicate top-level keys - DOC-012 magic-bytes vs extension agreement (PNG/JPEG/GIF/WebP detection) - DOC-013 image dimensions <= 8192 x 8192, both non-zero - DOC-014 animation frame count in (0, 1000] - DOC-015 animation must not be infinite-loop (loop_count > 0) 49 tests cover security regression classes (XSS via javascript: URLs, SVG <script> injection, PNG/JPEG extension spoofing) and the canonical structural drift cases (heading skips, duplicate H1s, missing License, unclosed YAML braces). Contract: contracts/document-integrity-v1.yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…12 PARTIAL_ALGORITHM_LEVEL — closes 12/12 sweep Bundles 12 algorithm-level decision rules for the architecture role-set requirements, plus a self-contained reference model (BASE_ROLES, QK_NORM_ROLES, BIAS_ROLES, ArchConstraints, required_roles, from_architecture) that mirrors realizar/src/arch_requirements.rs: - ARCH-001 YAML/Rust set equality - ARCH-002 base roles always present (BASE_ROLES subset) - ARCH-003 4 (qk_norm, bias) cells produce 4 distinct non-empty sets - ARCH-004 cardinalities {9, 11, 12, 14} - ARCH-005 Qwen3: has QK norm, NOT bias (GH-279 root cause) - ARCH-006 Qwen2: has bias, NOT QK norm (catches Qwen2/3 swap) - ARCH-007 LLaMA/Mistral are base-only (9 roles, no extensions) - ARCH-008 missing required role → reject (catches GH-279 silent-zero-weight regression) - ARCH-009 optional roles do NOT trigger rejection - ARCH-010 unknown architecture → base fallback - ARCH-011 alias equivalence (qwen2.5/qwen, phi/phi3, llama/llama3, gemma/gemma2) - ARCH-012 monotonicity: enabling features only adds roles 37 tests cover canonical Qwen2 (12 roles), Qwen3 (11 roles), LLaMA (9 roles) configurations + alias families + the GH-279 silent-zero regression class. Contract: contracts/architecture-requirements-v1.yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…TIAL_ALGORITHM_LEVEL — closes 15/15 sweep Bundles 15 algorithm-level decision rules for the GPU decode profiler / cbtop report invariants: - GDP-001 brick coverage >= 85% wall time - GDP-002 graph-enabled coverage < 50% wall time - GDP-003 sync mode dichotomy: Immediate > 100us, Deferred < 100us - GDP-004 LmHead count == decoded_tokens (exactly 1 per token) - GDP-005 brick avg ordering: LmHead > Gate > RmsNorm strict - GDP-006 token unit separation: profiler.total > decoded_tokens - GDP-007 coverage upper bound: brick_total_ns <= wall_ns - GDP-008 reproducibility: per-token CV < 0.15 across runs - GDP-009 report fidelity: actual_us within 1% of profiler avg (catches B1 per-element vs per-call regression) - GDP-010 score diversity: not all scores hardcoded to 100 (catches B3 compute_brick_score-not-called regression) - GDP-011 brick count == 11 for Qwen 1.5B - GDP-012 falsification accounting consistency (catches B10-B12 hardcoded FalsificationSummary) - GDP-013 LmHead per-decoded-token > 100us (catches denominator using profiler.total_tokens) - GDP-014 pmat scores all 0.0 (catches B7-B9 magic constants 173.9 / 98.1 / 95.2) - GDP-015 total_points != 137 (catches B10 magic constant) 53 tests cover boundary sweeps, the canonical Qwen 1.5B configuration, the exact regression classes named in the contract (B1, B3, B7-B12), and FP edge cases. Contract: contracts/gpu-decode-profiling-v1.yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…PARTIAL_ALGORITHM_LEVEL — closes 7 unbound Bundles 7 algorithm-level decision rules for the qwen2-e2e general-purpose verification gates (the SHIP-* gates 001..024 are already bound at PARTIAL or DISCHARGED in their respective modules): - QW2E-001 verdict_from_param_count: |observed - 7,615,616,512| / target <= 0.5%. Pinned to the spec-canonical Qwen2.5-Coder-7B count. - QW2E-002 verdict_from_flops_per_token: strict observed_flops == 2*P (canonical roofline formula). - QW2E-003 verdict_from_memory_ordering: strict q4k < q6k < f16 < f32 byte ordering. - QW2E-004 verdict_from_throughput_roofline: observed_tps <= min(bandwidth_tps, compute_tps). - QW2E-005 verdict_from_obligation_coverage: every obligation has evidence; vacuous truth (zero obligations) rejected. - QW2E-006 verdict_from_block_shape_preservation: shape(block(x)) == shape(x). - QW2E-007 verdict_from_e2e_shape_conservation: tokens [seq] -> hidden [seq, hidden] -> logits [seq, vocab] with seq agreement. 37 tests cover boundary sweeps (tolerance bands, FLOPs off-by-one), strict-equality enforcement (memory ordering, FLOPs formula), domain violations (NaN, zero rooflines, counter inversions), and the Qwen canonical [16, 3584]/[16, 152064] shapes. Contract: contracts/qwen2-e2e-verification-v1.yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…..008 PARTIAL_ALGORITHM_LEVEL — closes 6 unbound Bundles 6 algorithm-level decision rules for the Xet large-file upload path: - LFS-003 verdict_from_chunk_sizes: Pass iff every chunk except the last is in [8KiB, 128KiB] AND the last is in (0, 128KiB]. - LFS-004 verdict_from_xorb_sizes: Pass iff every xorb <= 64 MiB. - LFS-005 verdict_from_upload_ordering: Pass iff every ShardSent has >= expected_xorb_count XorbAck events strictly preceding it. - LFS-006 verdict_from_idempotent_replay: was_inserted:false treated as success (key Xet idempotency invariant). - LFS-007 classify_retry/verdict_from_retry_classification: 429/500/503/504 -> Retry, 401 -> RefreshThenRetryOnce, 400/403/404 -> Abort, unknown statuses -> Abort (conservative default). - LFS-008 xet_url_hash_encode/verdict_from_url_hash_encoding: 8-byte LE-reversed hex (NOT naive hex). Pinned to the mathematically correct 64-char encoding of [0,1,...,31] (the contract's example string is a 62-char typo missing byte 0x10). LFS-001/002/009/010 are already bound at runtime in crates/aprender-core/src/hf_hub/xet.rs. 34 tests cover boundary sweeps, the conservative-Abort default for unknown HTTP status codes, the Xet idempotency invariant, and the canonical 32-byte hash encoding. Contract: contracts/apr-publish-hf-large-file-v1.yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…_ALGORITHM_LEVEL — closes 4/4 sweep Bundles 4 algorithm-level decision rules for README-vs-repo-state claim consistency: - README-001 verdict_from_crate_count: Pass iff claimed == observed AND both non-zero. - README-002 verdict_from_contract_count: Pass iff claimed == observed AND both non-zero. - README-003 verdict_from_cli_command_count: Pass iff claimed == observed AND observed >= 30 (guards against empty-help regression — apr is documented as a 58-subcommand CLI). - README-004 verdict_from_cookbook_link_presence: Pass iff readme_text contains literal substring "apr-cookbook". 21 tests cover off-by-one mutations, zero-input domain errors, empty-help regression, link-renamed-without-sync, and the 30-subcmd provenance pin (AC_README_003_MIN_REASONABLE). Pins the algorithm-level rule against drift; the runtime-level falsifier `bash scripts/check_readme_claims.sh` already exists per the contract's evidence line. Contract: contracts/readme-claims-v1.yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…LGORITHM_LEVEL — closes 3/3 sweep Bundles 3 algorithm-level decision rules for chat-template-v1: - CHAT-001 verdict_from_chatml_render: Pass iff non-empty AND balanced <|im_start|>/<|im_end|> counts. - CHAT-002 verdict_from_llama3_render: Pass iff starts with <|begin_of_text|> AND contains <|eot_id|>. - CHAT-003 verdict_from_missing_template_outcome: Pass iff outcome == ExplicitError; all silent-success branches Fail (PMAT-237 process-validation rule). 17 tests cover the 3 decision rules with mutation-survey shape: empty/missing-token fail bands + cross-template-family rejection (ChatML in Llama-3 slot) + exhaustive 4-variant outcome sweep. Provides the algorithm-level pin against contract drift; the runtime-level falsifier `cargo test ... chat_template` already exists per the contract's evidence line. Contract: contracts/chat-template-v1.yaml. Spec: SHIP-TWO-001 §12 (chat template ship gates). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…TIAL_ALGORITHM_LEVEL Algorithm-level PARTIAL discharge for FALSIFY-PUB-EXTRA-001 (apr publish --manifest dry-run accepts valid YAML) AND -007 (HF repo contains apr+safetensors+gguf 3-format coverage) per `contracts/apr-cli-publish-extra-v1.yaml`. ## What this binds PUB-EXTRA-001: `verdict_from_manifest_dry_run_exit(exit_code)` — Pass iff `exit_code == 0`. Single-input exit-code shape. PUB-EXTRA-007: `verdict_from_three_format_repo(apr, safetensors, gguf)` — Pass iff `(apr, safetensors, gguf) == (1, 1, 1)`. Mirrors QA-009's 3-format coverage but counts files-per-format instead of exit-codes-per-format. Pinned constant: `AC_PUB_EXTRA_007_REQUIRED_FORMATS = 3`. ## apr-cli-publish-extra-v1 sweep status After this PR: 4 of 10 bound (001 + 002 + 006 + 007). Remaining: 003 (--extra-file), 004 (no-manifest backward compat), 005/008 (script Python exclusion), 009/010 (preflight validation). ## Why bundle 001 + 007 Both are publish-pipeline integrity verdicts. PUB-EXTRA-001 gates the manifest-consumption entry point; PUB-EXTRA-007 gates the multi-format ship completeness. Same family of "publish correctness" gates. Same precedent as `pub_cli_002_004`, `qa_002_006`, `cmd_safety` (4), `pub_extra_002_006`. ## Why match QA-009 shape but with file counts QA-009 (apr-cli-qa-v1) checks exit codes from `apr inspect` on each of 3 formats — does the parser support each format? PUB-EXTRA-007 checks file counts per extension in the published HF repo — does the ship deliver each format? Two complementary 3-format gates at different pipeline stages: - QA-009: parser ingest (does apr support format X?) - PUB-EXTRA-007: ship publish (did we ship format X?) Both must Pass for full 3-format guarantee. ## Five-Whys 1. Why bind these two now? — Both gate publish-pipeline correctness; without verdict pins, regressions in either manifest consumption or 3-format ship completeness corrupt the published artifact set. 2. Why distinct shapes? — 001 is single exit-code; 007 is 3-counter equality. 3. Why mirror QA-009 shape? — Cross-pipeline 3-format guarantee needs both sides (ingest + ship); reuse signals consistency. 4. Why exhaustive 27-cell composite matrix in 007? — Mutation- survey breadth; proves only `(1,1,1)` passes across the full {0,1,2}³ cube. 5. Why 14 tests across 2 verdict sections? — 5 for 001, 9 for 007 (incl. exhaustive cube). ## Cross-reference - `qa_009`: 3-format parser ingest (exit codes). - `pub_extra_001_007`: 3-format ship publish (file counts). The "three-format coverage" pattern now applied at both pipeline endpoints. ## Scope PARTIAL_ALGORITHM_LEVEL only. ## Tests 14 unit tests, all green.

…TIAL_ALGORITHM_LEVEL Algorithm-level PARTIAL discharge for FALSIFY-PUB-EXTRA-002 + 006 — two CRITICAL ship-integrity gates per `contracts/apr-cli-publish-extra-v1.yaml`. ## What this binds PUB-EXTRA-002 (sha256-mismatch must abort PRE-upload): `verdict_from_sha256_mismatch_abort(exit_code, hf_api_call_count)` returns Pass iff `exit_code != 0 AND hf_api_call_count == 0`. PUB-EXTRA-006 (post-upload sha256 round-trip): `verdict_from_post_upload_roundtrip(manifest_sha256, pulled_sha256)` returns Pass iff both 32 bytes AND byte-identical. Pinned constant: `AC_PUB_EXTRA_006_SHA256_BYTES = 32`. ## Why this matters for shipping PUB-EXTRA-002 is described in the contract as **CRITICAL**: "the whole point of manifest-first publish is the sha256 guard — if it fails, we ship garbage". Without a verdict pin, a regression that allows the publish path to skip the pre-flight sha256 check ships corrupted artifacts to HuggingFace. PUB-EXTRA-006 catches the complementary failure mode: even if pre-upload validation passed, network transit corruption produces artifacts whose post-upload sha256 differs from manifest. Both gates must Pass for ship integrity. ## Why two gates instead of one The two gates protect different stages of the upload pipeline: - 002: pre-upload (catches local-vs-manifest mismatch) - 006: post-upload (catches transit corruption) A failure mode at either stage corrupts the published artifact. Independent verdicts ensure each stage's regression class is caught independently. ## Why bundle in one module Same shape family (both ship-integrity guards on the publish path). Same precedent as `pub_cli_002_004`, `qa_002_006`, `cmd_safety` (4 verdicts), `ops_005_006`. Authoring two modules with identical structure would be muda. ## Five-Whys 1. Why bind these two now? — Both are CRITICAL ship-integrity gates for MODEL-1/2 HuggingFace upload. Without verdict pins, regressions ship corrupted artifacts silently. 2. Why distinct verdict shapes? — 002 is conjunctive `(exit, network-count)`; 006 is byte-equality on SHA-256. Different stages, different shapes. 3. Why pre + post both? — Defense in depth: pre-upload catches local sources of corruption (filesystem write race, manifest stale); post-upload catches transit sources (HF endpoint bug, network corruption). 4. Why pin 32-byte SHA-256? — Catches MD5 substitution class (matching-but-wrong-length test). 5. Why 18 tests across 2 verdict sections? — 7 for 002 (pass pre-upload abort, panic exit, clap exit; fail silent zero exit, uploads-before-abort, zero-exit-with-uploads), 11 for 006 (provenance, identical match, realistic SHA-256, single-byte drift at first/middle/last, length errors, wrong-length-equal, full-32-position byte-flip sweep). ## Cross-reference The sha256-byte-eq verdict is now applied in 4 modules: - `data_inv_005` (corpus reproducibility) - `ship_010` (published-artifact byte identity) - `ops_001` (filesystem snapshot equality) - `pub_extra_002_006` (post-upload round-trip) This is a strong reuse signal — promote to `format::common::sha256_byte_eq` shared util in follow-up. ## Scope PARTIAL_ALGORITHM_LEVEL only. ## Tests 18 unit tests, all green (7 for 002, 11 for 006).

…LEVEL — closes sweep Algorithm-level PARTIAL discharge for FALSIFY-CMD-SAFETY-001 + 002 + 003 + 004 per `contracts/apr-cli-command-safety-v1.yaml`. All four bundled in a single module because each verdict has a small, simple algorithm-level reduction. ## ✅ Closes 4/4 CMD-SAFETY sweep **Seven contract families now fully algorithm-bound at PARTIAL:** - `dataset-thestack-python-v1` (7/7) - `tokenizer-bpe-v1` (7/7) - `apr-cli-publish-v1` (4/4) - `apr-cli-qa-v1` (10/10) - `apr-cli-coverage-v1` (1/1) - `apr-cli-operations-v1` (7/7) - `apr-cli-command-safety-v1` (4/4) ← this PR ## CMD-SAFETY-001: read-only no new files `verdict_from_file_count_delta(before, after)` Pass iff equal. Catches both new-file creation (mutation) and file-count-decrease (unauthorized cleanup). ## CMD-SAFETY-002: mutating cmd usage error `verdict_from_mutating_cmd_usage_error(stderr)` Pass iff stderr contains `output` OR `required` (case-insensitive). Catches `apr convert model.apr` silently running without `-o` flag. ## CMD-SAFETY-003: SIGINT exit code <= 130 `verdict_from_sigint_exit_code(code)` Pass iff `0 <= code <= 130`. Pinned constant `AC_CMD_SAFETY_003_MAX_EXIT = 130` (= 128 + 2, SIGINT). Catches SIGKILL (137) which means TUI didn't release resources cleanly. ## CMD-SAFETY-004: every command classified `verdict_from_command_classification(total, classified)` Pass iff `total > 0 AND classified == total`. Adds another enforcement layer for the canonical 58-cmd registry. ## Why bundle 4 verdicts in one module Same precedent as `pub_cli_002_004` (2 verdicts), `qa_002_006` (2 verdicts). All 4 cmd-safety verdicts have small, simple algorithm-level reductions; authoring 4 separate modules with identical structure would be muda. Contract IDs stay distinct. ## Five-Whys 1. Why bind all 4 now? — Closes 4/4 cmd-safety sweep, all small verdicts. 2. Why one module? — Same shape family, same precedent as `pub_cli_002_004` / `qa_002_006`. 3. Why case-insensitive substring for CMD-SAFETY-002? — Clap error messages can be `error:` or `ERROR:` depending on colorization; case-insensitive match is robust. 4. Why pin 130 for SIGINT? — Standard POSIX `128 + 2` for SIGINT exit; > 130 (e.g., 137 SIGKILL) means cleanup handlers didn't run. 5. Why 24 tests in 5 sections? — 4 verdicts × ~5 tests each plus shared-primitive tests. Each verdict gets pass/fail/edge coverage. ## Cross-reference Bundle pattern (multiple verdicts in one module) now applied to 3 modules: `pub_cli_002_004`, `qa_002_006`, `cmd_safety`. Each bundles 2-4 small verdicts with shared structure. ## Scope PARTIAL_ALGORITHM_LEVEL only. ## Tests 24 unit tests, all green.

…LEVEL — closes 7/7 sweep Algorithm-level PARTIAL discharge for FALSIFY-OPS-002 (GPU memory returns to baseline after inference error) per `contracts/apr-cli-operations-v1.yaml`. ## What this binds `verdict_from_gpu_memory_leak_delta(baseline, post_error, tolerance)` returns Pass iff: 1. `baseline <= 1 TiB` 2. `post_error <= 1 TiB` 3. `tolerance <= 256 MiB` 4. `|post_error - baseline| <= tolerance` Two pinned constants: - `AC_OPS_002_MAX_PLAUSIBLE_GPU_BYTES = 1 TiB` - `AC_OPS_002_MAX_TOLERANCE_BYTES = 256 MiB` ## ✅ Closes the 7/7 OPS sweep on `apr-cli-operations-v1` **Six contract families now fully algorithm-bound at PARTIAL:** - `dataset-thestack-python-v1` (7/7) - `tokenizer-bpe-v1` (7/7) - `apr-cli-publish-v1` (4/4) - `apr-cli-qa-v1` (10/10) - `apr-cli-coverage-v1` (1/1) - `apr-cli-operations-v1` (7/7) ← this PR ## Why bounded-delta (not byte-equality) Byte-equality would be too strict — CUDA driver retains some metadata pages after freeing user allocations (tens of MiB typical). Bounded delta with a tolerance band matches operational reality. ## Why cap tolerance at 256 MiB The verdict is meaningless if the caller can pass arbitrary tolerance — `tolerance = u64::MAX` would silently accept any leak. Capping tolerance enforces a useful lower bound on detection sensitivity: - 256 MiB covers CUDA driver metadata generously. - Below that ceiling, real leaks (per-request KV cache, 7B model copy, etc.) all produce deltas > 256 MiB and fail. - A contributor cannot "tune the gate to pass" by setting tolerance higher. `fail_tolerance_8_gib_would_mask_model_leak` test explicitly catches the adversarial-tolerance class. ## Why 1 TiB GPU memory cap Catches counter corruption / unsigned-overflow regressions in the memory reading. No consumer or datacenter GPU exceeds 1 TiB (H100 is 80 GiB, B200 is 192 GiB). ## Five-Whys 1. Why bind OPS-002 now? — Closes 7/7 ops sweep. GPU memory leaks on the error path are silent in dev (1 error = 256 MiB leak; tolerated single-shot) but catastrophic in production (1000 errors/day → 256 GiB/day → OOM crash). 2. Why a 3-tuple of u64? — Algorithm-level pin; the actual nvidia-smi / cuMemGetInfo capture harness is FULL_DISCHARGE. 3. Why bounded delta (not byte-equality)? — CUDA driver retains metadata; byte-equality is too strict. 4. Why cap tolerance at 256 MiB? — Prevents adversarial-tolerance gate-bypass; ensures meaningful detection sensitivity. 5. Why 21 tests across 7 sections? — Provenance pin (×2), pass band (×6: exact, ±32 MiB, exact-tolerance, zero, max-256 MiB, real workloads), leak fail (×3 incl. catastrophic 4 GiB), memory domain violations (×3), tolerance-too-large (×3), delta sweep (×8 probes), realistic GPU scenarios (×3: RTX 4090, A100 80GB, KV cache leak). ## Cross-reference Bounded-delta verdict pattern joins the catalog alongside: - Threshold (`cov_001`, `data_inv_003`, `distill_train_009`) - Linear-bound (`ops_007` token count) - Conjunctive exit (`pub_cli_002_004`, `qa_002_006`) - Byte-equality family (6 modules) - Sequence-monotonicity (`ops_004`) - Substring-containment (`pull_dataset_005`, `pub_cli_001`) Each domain with measurement + tolerance now has a reference pattern. ## Scope PARTIAL_ALGORITHM_LEVEL only. ## Tests 21 unit tests, all green.

…LEVEL Algorithm-level PARTIAL discharge for FALSIFY-OPS-004 (progress percentage monotonically non-decreasing) per `contracts/apr-cli-operations-v1.yaml`. ## What this binds `verdict_from_progress_monotonicity(progress_sequence)` returns Pass iff: 1. `!progress_sequence.is_empty()` 2. Every value is finite (no NaN, ±∞) 3. Every value is in `[0.0, 100.0]` 4. Every adjacent pair is non-decreasing (`a <= b`) ## Why non-strict ≤ (not strict <) Throttled progress emission commonly produces same-value adjacent pairs (e.g., 25.0 → 25.0 → 50.0 due to debouncing). Strict `<` would reject normal progress streams. Non-strict matches contract "never decreases" wording. ## Why the (slice → bool) shape (novel) First sequence-based verdict in the algorithm-binding cycle. Most prior verdicts take 1-3 scalar inputs; this one takes a `&[f64]`. The shape is appropriate because monotonicity is inherently a sequence property — no equivalent reduction to scalar inputs exists without losing precision. ## Five-Whys 1. Why bind OPS-004 now? — Progress regression confuses users ("download is now at 30% after being at 60%?"). Without a verdict pin, retry-mid-stream and concurrent-writer-overlap regressions ship invisibly. 2. Why a `&[f64]` input? — Algorithm-level pin; the actual progress-event capture is FULL_DISCHARGE. Sequence shape matches the inherent nature of monotonicity. 3. Why non-strict `≤`? — Contract literal "never decreases"; throttled emissions produce same-value adjacent pairs. 4. Why include domain checks? — NaN/±∞/out-of-range values could pass a naive `windows(2)` comparison if monotonic between themselves; defense in depth. 5. Why 22 tests across 7 sections? — Pass band (×5: increasing, repeats, single, resume, dense), regression fail (×4 incl. ULP-level), empty fail (×1), NaN/±∞ (×3), out-of-range (×3), boundary cases (×3 at 0/100/full-range), realistic apr pull scenarios (×3 incl. retry-resets-progress, concurrent overlap). ## Cross-reference The sequence-monotonicity verdict joins the verdict-shape catalog as a new primitive. Other sequence-based gates that might use this shape: - Training loss curves (should be roughly monotonic-decreasing) - Layer index ordering in trace output - Token sequence position numbering If reused, promote to `format::common::is_monotonic_nondecreasing`. ## Scope PARTIAL_ALGORITHM_LEVEL only. Wiring into the actual progress-event capture is FULL_DISCHARGE work for the operations-pipeline implementation PR. ## Tests 22 unit tests, all green.

…LEVEL Algorithm-level PARTIAL discharge for FALSIFY-OPS-007 (token count bounded by 4× input byte length — BPE worst case) per `contracts/apr-cli-operations-v1.yaml`. ## What this binds `verdict_from_token_count_bound(input_byte_len, token_count)` returns Pass iff: 1. `input_byte_len > 0` (refuse vacuous Pass) 2. `token_count <= input_byte_len.checked_mul(4)` (with overflow guard) Pinned constant: `AC_OPS_007_MAX_RATIO = 4`. ## Why pin 4× ratio BPE byte-level fallback emits at most ~4 tokens per input byte in the worst case (UTF-8 4-byte sequences split into 4 byte-fallback tokens when no merges apply). 4× is the published BPE worst-case bound. Drift to 8× would mask token-explosion bugs (unbounded merge expansion); drift to 2× would over-tighten and reject pathological- but-valid CJK / emoji inputs at the byte-fallback limit. ## Why DoS-class regression matters Token explosion is a denial-of-service vector: a 1KB adversarial input producing 5MB of tokens causes: - Unbounded GPU memory growth (KV cache scales with token count) - Inference timeout / OOM - API endpoint backpressure The `fail_unbounded_memory_attack` test pins this with a 10:100k ratio (10-byte input, 100k tokens — clearly adversarial). ## Why checked_mul `input_byte_len * 4` overflows u64 at `input_byte_len > u64::MAX/4`. Without `checked_mul`, the comparison would silently wrap to a tiny number and accept any `token_count`. The `fail_input_times_4_overflow` test catches this. ## Five-Whys 1. Why bind OPS-007 now? — Token-explosion DoS regressions silently pass CI (CI doesn't run adversarial inputs); only show up in production user load. Without a verdict pin, the bound drifts. 2. Why a (u64, u64) pair? — Algorithm-level pin; the actual adversarial-input harness is FULL_DISCHARGE. 3. Why 4× specifically? — Contract literal; matches BPE worst-case byte-fallback bound. 4. Why inclusive `<=`? — Contract test wording `<= 4 * len`. 5. Why 18 tests across 7 sections? — Provenance pin (×1), pass band (×5: typical, exact-cap, 1-byte-1-token, realistic prompt, 1M scale), token-explosion fail (×4 incl. DoS class), empty-input fail (×2), overflow protection (×2), boundary sweep (×10 probes around 400-cap), realistic UTF-8 scenarios (×3: emoji 4-byte, CJK 3-byte, adversarial). ## Cross-reference Linear-bound verdict pattern joins the catalog alongside threshold (`cov_001`, `data_inv_003`), count-zero-tolerance (many), and conjunctive-exit (`pub_cli_002_004`). Each domain with linear-relationship invariants now has a reference pattern. ## Scope PARTIAL_ALGORITHM_LEVEL only. Wiring into the actual adversarial- input tokenizer harness is FULL_DISCHARGE work for the operations-pipeline implementation PR. ## Tests 18 unit tests, all green.

…LEVEL Algorithm-level PARTIAL discharge for FALSIFY-OPS-003 (greedy decoding is deterministic) per `contracts/apr-cli-operations-v1.yaml`. ## What this binds `verdict_from_greedy_decoding_pair(run_a, run_b)` returns Pass iff: 1. Both outputs are non-empty 2. `run_a == run_b` (byte-identical) ## Same shape as bpe_inv_006 `bpe_inv_006` (encode determinism) uses positive byte-equality on token-id sequences. `ops_003` uses positive byte-equality on generated text outputs. Both gate determinism at different layers: - `bpe_inv_006`: tokenizer deterministic (encode → same token IDs) - `ops_003`: sampler deterministic (greedy decode → same text) The shared shape signals a verdict-shape pattern: "two outputs must match or Fail" applies to encode determinism, decode determinism, reproducibility, side-effect freedom. ## Why this matters for SHIP-007 root-cause work The realistic regression classes pinned in Section 7 cover exactly the bisection axes from `2026-04-27 findings.md`: - HashMap iteration order in argmax tiebreak - Floating-point associativity in parallel logit reduction - Per-host code path divergence If a future SHIP-007 fix introduces or reveals nondeterminism in the sampling layer, this verdict catches it independently of the layer-0 stage diff. ## Five-Whys 1. Why bind OPS-003 now? — Greedy-decoding non-determinism is the canonical "passes CI in dev, fails on user box" defect class; without a verdict pin, MODEL-1 ship goes out with silent stochasticity. 2. Why a (&[u8], &[u8]) pair? — Algorithm-level pin; the actual double-invocation harness is FULL_DISCHARGE. 3. Why byte-identical? — Determinism is binary; one ULP/bit of drift signals upstream non-determinism. 4. Why refuse empty? — Vacuous Pass would mask `apr run` silent regression. 5. Why 18 tests across 7 sections? — Pass band (×4: short, 2+2, long, special chars), single-byte drift fail (×3: first/last/middle), length mismatch (×2), empty input (×3), symmetry (×2: pass + fail), full-position drift sweep (43 positions), realistic apr run scenarios (×3 incl. SHIP-007 regression classes — HashMap tiebreak, FP associativity). ## Cross-reference - `bpe_inv_006`: encode determinism (tokenizer layer). - `ops_003`: decode determinism (sampler layer). [this PR] - `data_inv_005`/`ship_010`/`ops_001`: SHA-256 byte-equality (third application of the digest-pair pattern). The byte-equality verdict family is now applied across 5+ modules at different scopes (token IDs, generated text, SHA-256 digests, filesystem snapshots, NFC byte sequences). ## Scope PARTIAL_ALGORITHM_LEVEL only. Wiring into the actual double- invocation `apr run --temperature 0` harness is FULL_DISCHARGE work for the operations-pipeline implementation PR. ## Tests 18 unit tests, all green.

…LEVEL Algorithm-level PARTIAL discharge for FALSIFY-OPS-001 (read-only commands have no filesystem side effects) per `contracts/apr-cli-operations-v1.yaml`. ## What this binds `verdict_from_filesystem_snapshot_pair(before_snapshot, after_snapshot)` returns Pass iff: 1. Both digests are 32 bytes (SHA-256) 2. `before == after` (byte-identical) Pinned constant: `AC_OPS_001_SHA256_BYTES = 32`. ## Why mirror data_inv_005's shape Identical algorithm-level reduction: - `data_inv_005` (corpus_sha256 reproducibility): two SHA-256 digests from independent ingest hosts must match. - `ops_001` (read-only side-effect freedom): two SHA-256 digests of the filesystem before/after a "read-only" command must match. Both verdicts share: - 32-byte SHA-256 length pin - Byte-identical equality - Wrong-length-but-equal Fail (catches MD5 substitution) The reuse signals a verdict-shape pattern: "two digests must match or Fail" applies to reproducibility, side-effect freedom, and any content-equality gate. ## Why pin SHA-256 length explicitly Three regression classes covered: 1. Truncation (some hash libraries return prefix only). 2. Padding (truncated digest padded to 64 bytes for SHA-512). 3. Hash-function drift (SHA-3, BLAKE3 substituted). `fail_wrong_length_but_equal` test (16-byte arrays matching each other) explicitly catches MD5 substitution, even when the bytes "happen to be equal". ## Five-Whys 1. Why bind OPS-001 now? — Read-only commands writing files is a subtle defect class (lock files, metadata caches, mtime updates) that ships invisibly. CI passes; only filesystem audits catch it. 2. Why a (&[u8], &[u8]) pair? — Algorithm-level pin; the actual `find -print0 | xargs sha256sum | sha256sum` snapshot harness is FULL_DISCHARGE. 3. Why byte-identical (no tolerance)? — Read-only is binary; any filesystem mutation invalidates the contract. 4. Why pin SHA-256 length? — Catches hash-function drift class. 5. Why 15 tests across 7 sections? — Provenance pin (×1), pass band (×2), single-byte mutation (×3 incl. one-bit), wrong-length caller error (×4 incl. matching-but-wrong-length), 32-position byte-flip sweep, symmetry property, realistic apr scenarios (×3: clean run, lock-file write, metadata-cache). ## Cross-reference The 32-byte SHA-256 byte-equality verdict is now reused across: - `data_inv_005` (corpus reproducibility) - `ship_010` (published-artifact byte identity) - `ops_001` (filesystem snapshot equality) If we add a fourth, this primitive should be promoted to a shared `format::common::sha256_byte_eq` utility in a follow-up. ## Scope PARTIAL_ALGORITHM_LEVEL only. Wiring into the actual filesystem- snapshot harness is FULL_DISCHARGE work for the operations-pipeline implementation PR. ## Tests 15 unit tests, all green.

…VEL — closes 1/1 sweep Algorithm-level PARTIAL discharge for FALSIFY-COV-001 (apr-cli line coverage >= 95%) per `contracts/apr-cli-coverage-v1.yaml`. ## What this binds `verdict_from_line_coverage_percent(line_coverage_percent)` returns Pass iff: 1. `line_coverage_percent.is_finite()` (rules out NaN, ±∞) 2. `0.0 <= line_coverage_percent <= 100.0` (domain) 3. `line_coverage_percent >= 95.0` (contract floor, inclusive) Pinned constant: `AC_COV_001_MIN_PERCENT = 95.0`. ## ✅ Closes the 1/1 COV sweep on `apr-cli-coverage-v1` **Five contract families now fully algorithm-bound at PARTIAL:** - `dataset-thestack-python-v1` (7/7) - `tokenizer-bpe-v1` (7/7) - `apr-cli-publish-v1` (4/4) - `apr-cli-qa-v1` (10/10) - `apr-cli-coverage-v1` (1/1) ← this PR ## Why inclusive `>=` 95.0 Contract test wording: `>= 95.0`. Inclusive floor matches character-for-character. Boundary tests `pass_at_exact_floor_95_0` + `fail_just_below_floor_via_ulp` (using `f64::from_bits` for ULP-below) bracket the strict semantics. ## Why pin 95.0 (not 85.0) Per CLAUDE.md "Quality Standards": "95% minimum test coverage (upgraded from 85%)". The verdict pins the upgraded threshold; `fail_at_85_old_threshold` test explicitly catches a regression back to the pre-upgrade value. ## Why is_finite() + range checks `cargo llvm-cov` could emit NaN (rare division by zero in coverage tooling), Inf (huge counters wrapping), or out-of-range values (parsing bug treating a count as percentage). Defense in depth catches all three. ## Five-Whys 1. Why bind COV-001 now? — Coverage drift is the kind of regression that ships silently with green CI (the threshold moves, not the code). Without a verdict pin, a contributor could quietly lower the floor in `make tier3` and ship. 2. Why an f64 input? — Algorithm-level pin; the `cargo llvm-cov | awk` extraction is FULL_DISCHARGE. 3. Why inclusive `>=`? — Contract literal. ULP-below test bracket strict semantics. 4. Why pin 95.0 vs 85.0? — CLAUDE.md upgraded; verdict enforces upgraded value. 5. Why 19 tests across 7 sections? — Provenance pin (×1), pass band (×4: floor, canonical 96.35, 99.99, 100), below- floor fail (×4 incl. 85% old threshold), NaN/±∞ (×3), out-of-domain (×3: -1, 100.001, 1e10), boundary sweep (11 probes + 1 ULP), realistic apr-cli scenarios (×2). ## Cross-reference Threshold-based verdict pattern (`>= floor` with domain checks) is conceptually similar to: - `data_inv_003` (Jaccard < 0.85, strict) - `tok_par_002` (efficiency >= 80%, inclusive via cross-multiply) - `distill_train_009` (val_loss strict <) Each binds a different threshold semantic; this one is inclusive `>=`. ## Scope PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual `cargo llvm-cov report --summary-only | awk` extraction is FULL_DISCHARGE work for the qa-pipeline implementation PR. ## Tests 19 unit tests, all green.

Algorithm-level PARTIAL discharge for FALSIFY-QA-009 (3-format coverage: GGUF + APR + SafeTensors) per `contracts/apr-cli-qa-v1.yaml`. ## What this binds `verdict_from_three_format_coverage(gguf_exit, apr_exit, safetensors_exit)` returns Pass iff all three exit codes equal 0. Pinned constant: `AC_QA_009_REQUIRED_FORMAT_COUNT = 3`. ## Why conjunctive composition (all three or fail) Per CLAUDE.md "Debugging: Use apr Tools First": "All tools support GGUF, APR, and SafeTensors formats. If a tool says 'format not supported', that's a BUG." Each format is independently shipped as a contract, so silently dropping one is a regression class. Conjunctive `&&` catches: - Single-format regression (most common: SafeTensors writer panics on a new tensor variant). - Multi-format collapse (rare: a refactor breaks all three). The 8-cell composite matrix in Section 6 proves only `(0, 0, 0)` passes — every other cell of `{0, 1}³` fails. ## Why pin format count at 3 Catches a regression where: - A 4th format is silently added (e.g., ONNX) without bumping the contract — would not be exercised by this verdict. - One of the three is silently dropped from the test harness. The `provenance_required_format_count_is_three` test pins this. ## Five-Whys 1. Why bind QA-009 now? — Format coverage is a CLAUDE.md-level invariant; without a verdict pin, a SafeTensors regression (e.g., new tensor variant the writer doesn't handle) ships silently. 2. Why a 3-tuple of i32? — Algorithm-level pin; the per-format `apr inspect` invocation harness is FULL_DISCHARGE. 3. Why conjunctive? — Each format is independent; AND-fold catches any single regression. 4. Why 8-cell composite matrix? — Proves only origin (0, 0, 0) passes; mutation testing would catch any subset-Pass mutation. 5. Why 15 tests across 7 sections? — Provenance pin (×1), pass band (×1: only origin), single-format fail (×3 — one per format), multi-format fail (×4 incl. all-three-fail), non-1 exit fail (×3: panic, negative, i32::MAX), 8-cell composite matrix, symmetry under permutation + exhaustive 4×4×4 sweep at canonical exit codes. ## Cross-reference The "all-must-succeed conjunctive" pattern is conceptually similar to `bpe_inv_002` (4 special tokens distinct + in range), but scaled to 3-tuple exit codes instead of 4-tuple ID checks. Both rely on conjunctive composition of independent gates. ## Scope PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual 3-format `apr inspect` invocation harness is FULL_DISCHARGE work for the qa-pipeline implementation PR. ## Tests 15 unit tests, all green.

Algorithm-level PARTIAL discharge for FALSIFY-QA-003 (JSON output is valid — `jq .` exits 0) per `contracts/apr-cli-qa-v1.yaml`. ## What this binds `verdict_from_json_validity(output_bytes, jq_exit_code)` returns Pass iff: 1. `output_bytes` is non-empty (refuse vacuous Pass) 2. `jq_exit_code == 0` (jq accepted the input) 3. First non-whitespace byte is `{` or `[` 4. Last non-whitespace byte is `}` or `]` 5. Brace `{`/`}` counts balance AND bracket `[`/`]` counts balance Five conjunctive checks composing JSON-shape sanity with jq's parser verdict. ## Why redundant structural checks alongside jq? Two failure modes the verdict must catch: 1. **jq missing or stubbed**: a CI environment where `jq` is absent and the harness silently substitutes `true` (exit 0). Without the structural checks, `(any-text, 0)` would Pass. 2. **jq spurious 0**: a jq regression returning 0 on whitespace or non-JSON input. The structural checks (start with `{`/`[`, end with `}`/`]`, balanced) catch all three classes: - bare strings - leading text + JSON - trailing text after JSON Both jq AND structural must agree to Pass. ## Documented limitation: braces inside strings The brace-balance check is byte-level and counts `{`/`}` even inside string literals like `"{not real}"`. This means `{"k":"} extra"}` would have unbalanced count 0+1-1+0 = 0 (Pass) even though it's malformed JSON. This is intentional for PARTIAL_ALGORITHM_LEVEL: real balanced-with-string-awareness parsing requires a JSON tokenizer, which is FULL_DISCHARGE territory. The verdict's role is to catch the common regression class (truncated output, missing closing brace, double-opening) which the byte-counting check handles correctly. ## Five-Whys 1. Why bind QA-003 now? — JSON output validity is the contract for every `--json` flag (cf. QA-007). A regression that emits malformed JSON breaks every downstream consumer. 2. Why a (&[u8], i32) pair? — Algorithm-level pin; subprocess invocation + `jq .` pipe is FULL_DISCHARGE. 3. Why structural checks alongside jq? — Two failure classes (missing jq, jq spurious-0) require defense in depth. 4. Why byte-level brace count (not real JSON parser)? — PARTIAL_ALGORITHM_LEVEL pins the decision rule; full JSON parser is FULL_DISCHARGE. 5. Why 22 tests across 7 sections? — Pass band (×5: object, array, realistic, nested, with whitespace), jq-rejected fail (×2), empty output fail (×2 incl. whitespace-only), non-JSON shape fail (×3), bracket imbalance fail (×4 incl. realistic truncation), edge cases (×3: minimal `{}`/`[]`, negative jq exit), realistic apr scenarios (×3: qa results, trace stats, text-instead-of-JSON regression). ## Cross-reference Companion to QA-007 (`--json` flag changes output): QA-007 ensures `--json` does *something*; QA-003 ensures what it does is well-formed JSON. Both gates must Pass independently. ## Scope PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual `apr inspect --json | jq .` pipeline is FULL_DISCHARGE work for the qa-pipeline implementation PR. ## Tests 22 unit tests, all green.

Algorithm-level PARTIAL discharge for FALSIFY-QA-008 (no phantom subcommands — advertised commands have real implementations) per `contracts/apr-cli-qa-v1.yaml`. ## What this binds `verdict_from_phantom_scan(commands_advertised, phantom_count)` returns Pass iff: 1. `commands_advertised > 0` (refuse vacuous Pass on empty CLI) 2. `phantom_count == 0` (zero-tolerance phantom commands) 3. `phantom_count <= commands_advertised` (counter sanity) ## Why this is distinct from QA-001 QA-001 pins the *registered count* at 58 (catches add/drop drift in the registry vs CLAUDE.md claim). QA-008 pins the *registry-vs-implementation gap* (catches commands that are advertised in `apr --help` but have no working `--help`-responding dispatch). A new subcommand added to the registry but with the dispatch arm forgotten passes QA-001 (count stays 58 if it replaced something) but fails QA-008 (advertised=58 but phantom=1). ## Why zero-tolerance A phantom subcommand ships a "command not found" experience to users — the worst kind of CLI regression because: - It's invisible in unit tests (the registry shows the command) - It only surfaces in shell scripts / CI / user smoke runs - It fails late and fails confusingly ## Five-Whys 1. Why bind QA-008 now? — Phantom subcommands are the most user-facing "command not found" regression class. Without a verdict pin, CI passes (registry has 58 commands) but users hit "unimplemented" at runtime. 2. Why a (u64, u64) pair? — Algorithm-level pin; the actual per-subcommand `apr <c> --help` invocation harness is FULL_DISCHARGE. 3. Why distinct from QA-001? — Different regression classes: QA-001 catches count drift, QA-008 catches dispatch gaps. Both gates must Pass independently. 4. Why zero-tolerance? — One phantom is enough to break a user workflow; no tolerance band makes sense. 5. Why 16 tests across 7 sections? — Pass band (×3: 58, 1, 1M), zero-tolerance fail (×4 incl. handful, half, all), empty advertised (×2), partition violations (×2), boundary sweep (×7 incl. partition+1), zero-tolerance property (×5 sizes), realistic apr scenarios (×3 incl. minimal-features build). ## Cross-reference Same shape and discipline as INV-DATA-001/002/007, INV-BPE-007, QA-004 — the "zero-tolerance match count + total + partition" pattern is now applied to 6+ verdicts. Each gates a different domain (license, PII, UTF-8, UNK, NaN, phantom) but with identical algorithm-level reduction. Combined with QA-001 (registered count pin) and the existing 3-surface drift trio (test harness, contracts/apr-cli-commands-v1.yaml, CLAUDE.md), QA-008 forms the 5th enforcement layer for CLI command integrity. ## Scope PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual per-subcommand `apr <c> --help` exit-code-collecting harness is FULL_DISCHARGE work for the qa-pipeline implementation PR. ## Tests 16 unit tests, all green.

Algorithm-level PARTIAL discharge for FALSIFY-QA-007 (--json flag changes output; not a no-op) per `contracts/apr-cli-qa-v1.yaml`. ## What this binds `verdict_from_json_flag_diff(default_output, json_output)` returns Pass iff: 1. Both byte slices are non-empty (refuse vacuous Pass) 2. `default_output != json_output` (inverse-equality) ## Why inverse-equality (novel verdict shape) Most verdicts in the algorithm-binding cycle are positive-equality (byte-identical match) or zero-tolerance-count (no failures). This is the first inverse-equality verdict: the contract requires the two outputs to DIFFER. A regression where `--json` is parsed as a clap flag but doesn't route to a different formatter would produce identical bytes — Pass-by-equality would Fail-by-inverse-equality. ## Why refuse empty inputs Either output being empty means the apr subprocess silently failed. Refusing both individually catches the silent-failure class. ## Five-Whys 1. Why bind QA-007 now? — `--json` no-op is a class of regression that ships with green CI: clap parses the flag, but the formatter dispatch silently uses the text writer. Without a verdict pin, downstream JSON parsers break in production. 2. Why a (&[u8], &[u8]) pair? — Algorithm-level pin; the actual subprocess invocations producing the two outputs are FULL_DISCHARGE. 3. Why inverse-equality (not metadata sniffing)? — Contract test wording: `diff <(default) <(--json) | grep -q .` matches character-for-character. JSON-validity check is FALSIFY-QA-003's territory; -007 is purely about non-equality. 4. Why include symmetry tests? — Verdict must be symmetric in (default, json) to avoid order-dependent bugs. 5. Why 18 tests across 7 sections? — Pass band (×4: realistic inspect, one-byte diff, length diff, qa outputs), no-op fail (×2 incl. long identical), empty-input fail (×3), single-byte edge (×2: appended + prepended), symmetry property (×3 incl. pass + fail-identical + fail-one-empty), inverse-equality property (×4 lengths), realistic 3-subcommand contrast (×3: inspect, diff, validate-unimplemented regression). ## Cross-reference Inverse-equality is conceptually related to `bpe_inv_006`'s encode determinism (which uses positive equality). Together they cover both directions: "two runs must be identical" (006) and "two runs must differ" (007). Different contract ID, different direction, same primitive (byte-equality). ## Scope PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual double-subprocess invocation (text + --json) is FULL_DISCHARGE work for the qa-pipeline implementation PR. ## Tests 18 unit tests, all green.

…EVEL — closes 7/7 INV-BPE sweep Algorithm-level PARTIAL discharge for FALSIFY/INV-BPE-005 (NFC idempotence + composed/decomposed equivalence) per `contracts/tokenizer-bpe-v1.yaml`. ## What this binds `verdict_from_nfc_idempotence(composed_nfc, decomposed_nfc, double_nfc)` returns Pass iff: 1. All three byte slices are non-empty (refuse vacuous Pass) 2. `composed_nfc == decomposed_nfc` (composable equivalence) 3. `composed_nfc == double_nfc` (NFC idempotence) ## ✅ Closes the 7/7 INV-BPE sweep on `tokenizer-bpe-v1` INV-BPE-001 (vocab range + paired model), 002 (special tokens), 003 (round-trip / FALSIFY-SHIP-012 dual), 004 (merge algebra), 005 (NFC idempotence), 006 (encode determinism), 007 (byte coverage / no UNK) all bound at PARTIAL_ALGORITHM_LEVEL. ## Why three-way byte-equality The contract pins TWO properties on the same NFC implementation: - Composable equivalence: nfc("café"-composed) == nfc("café"-decomposed) - Idempotence: nfc(nfc(x)) == nfc(x) A regression in either dimension breaks MODEL-2's tokenizer contract. Composing both into one verdict means a single PARTIAL covers both gate paths; the test suite exercises each independently: - `fail_cafe_composed_vs_decomposed` catches missing-NFC. - `fail_double_nfc_differs` catches non-idempotent NFC. - `fail_hangul_precomposed_vs_decomposed` catches NFC missing on multi-codepoint canonical equivalents. ## Why byte-equality (not str-equality) The Rust `&[u8]` shape lets the verdict compare: - Pre-NFC byte sequences (which may not be valid UTF-8) - Post-NFC byte sequences from any NFC implementation (ICU, unic, rust-stdlib, the tokenizer's own) - Hex/UTF-8 fixtures in tests `==` on `&[u8]` is character-for-character byte-identical; matches the contract falsifier's "byte-equals" wording. ## Why refuse empty inputs Vacuous Pass on `(&[], &[], &[])` would let an early-return short-circuit silently pass the gate. Refusing all three empty-input combinations explicitly catches it. ## Five-Whys 1. Why bind INV-BPE-005 now? — Closes the 7/7 INV-BPE sweep on `tokenizer-bpe-v1`. NFC idempotence is required before BPE encoding; without it, "café" and "café" tokenize differently even though they're canonically equivalent — corrupts MODEL-2 tokenizer contract. 2. Why a 3-tuple of byte slices, not the actual NFC implementation? — Algorithm-level pin; the actual NFC + tokenizer code that produces the three forms is FULL_DISCHARGE work for the corpus-tokenize PR. 3. Why three forms instead of two? — Composable equivalence and idempotence are independent properties; a tokenizer can pass one and fail the other. Three-way input lets the verdict exercise both with one call. 4. Why byte-level rather than codepoint-level? — Catches subtler regressions where two byte sequences decode to the same codepoints under different NFC normalizations (idempotence implies bytes, not just codepoints). 5. Why 20 tests across 7 sections? — Pass band (×5: ASCII, composed-café, CJK, emoji, math), composed-vs-decomposed fail (×3: café, distinct, single byte), idempotence fail (×2), combined violations (×1), caller errors (×4: all empty + per-input empty), symmetry/three-way distinct (×2), and realistic Hangul precomposed/decomposed pair + long mixed text (×3). ## Cross-reference (informational) This verdict's logic is structurally similar to `data_inv_007` (UTF-8 + NFC round-trip) — both gate NFC correctness — but at different scope: - `data_inv_007` counts NFC failures across a corpus. - `bpe_inv_005` proves NFC idempotence on a single fixture triple. The two together provide both unit-level (per-text) and corpus-level (millions-of-files) NFC pinning. ## Scope PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual NFC implementation that produces the three byte sequences is FULL_DISCHARGE work for the corpus-tokenize implementation PR. ## Tests 20 unit tests, all green. **Closes the 7/7 INV-BPE gate sweep** on `tokenizer-bpe-v1`.

…EVEL (#1159) Algorithm-level PARTIAL discharge for FALSIFY/INV-BPE-003 (round-trip byte-equality on 10K held-out docs) per `contracts/tokenizer-bpe-v1.yaml`. Also FALSIFY-SHIP-012 directly. ## What this binds `verdict_from_roundtrip_scan(docs_scanned, roundtrip_failures)` returns Pass iff: 1. `docs_scanned >= 10_000` (contract statistical floor) 2. `roundtrip_failures == 0` (zero-tolerance — non-injective tokenization corrupts loss target) 3. `roundtrip_failures <= docs_scanned` (counter sanity) Pinned constant: `AC_BPE_INV_003_REQUIRED_DOCS = 10_000`. ## Why zero-tolerance `decode(encode(nfc(text))) != nfc(text)` means the tokenizer is not injective on the input domain. Even one failure means MODEL-2's training loss is computed against tokens that don't reconstruct the original text — silently corrupting the cross-entropy gradient on that document. The contract falsifier ("Any non-zero diff bytes fails") admits no tolerance band. ## Why 10K floor The contract specifies a 10K-doc held-out corpus. Round-trip failures cluster around rare Unicode patterns (emoji ZWJ sequences, RTL combining marks, control sequences) that occur at frequencies of ~1-in-10K to 1-in-100K. A scan over 1K docs would miss those classes entirely; a scan over 100K would over-tax every smoke run. Pinning 10K matches the contract's stat-power requirement. ## Why partition-violation check `roundtrip_failures > docs_scanned` indicates counter corruption — the failure set cannot exceed the total scanned. Catches a counter- rollover bug or double-count regression. ## Five-Whys 1. Why bind INV-BPE-003 now? — Round-trip injectivity is the foundation of MODEL-2's training-target correctness; without a verdict pin, a regression in NFC handling or BPE merge-table loading silently corrupts every gradient. 2. Why a (u64, u64) pair, not the actual scanner? — Algorithm-level pin; the streaming round-trip scanner is FULL_DISCHARGE work for the corpus-tokenize PR. 3. Why pin 10K floor? — Statistical-power match to contract; smaller samples miss rare-Unicode classes. 4. Why partition-violation check? — Counter corruption mustn't silently pass. 5. Why 17 tests across 7 sections? — Mutation survey: provenance pin (×1), pass band (×4: floor, +1, CSN-Python, huge), zero- tolerance fail (×4 incl. one-in-million), sample-size fail (×3), partition violation (×2), boundary sweep (×8 each dimension), and zero-tolerance property (×4 sizes). ## Cross-reference This verdict ALSO discharges FALSIFY-SHIP-012 (round-trip gate from SHIP-TWO-001 spec §15.4) — the contract identifies the two as the same gate. Two contract IDs share one PARTIAL discharge. ## Scope PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual streaming round-trip scanner that produces `(docs_scanned, roundtrip_failures)` is FULL_DISCHARGE work for the corpus-tokenize implementation PR. ## Tests 17 unit tests, all green.

noahgift enabled auto-merge (squash) May 12, 2026 08:54

noahgift and others added 25 commits May 12, 2026 14:09

noahgift force-pushed the chore/format-discharge-squash-v3 branch from 8089c51 to f595bd3 Compare May 12, 2026 12:09

noahgift merged commit ba9ab03 into main May 12, 2026
10 checks passed

noahgift deleted the chore/format-discharge-squash-v3 branch May 12, 2026 12:24

This was referenced May 12, 2026

chore(format): squash 26 realizar PARTIAL-discharge feat commits (batch 4) #1643

Merged

fix(task-148): Toyota Way 500-line refactor + FALSIFY-CORPUS-004 + QLoRA + GPU training backend #1003

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(format): squash 135 PARTIAL-discharge feat commits into one PR (one CI run instead of 135)#1637

chore(format): squash 135 PARTIAL-discharge feat commits into one PR (one CI run instead of 135)#1637
noahgift merged 135 commits into
mainfrom
chore/format-discharge-squash-v3

noahgift commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

noahgift commented May 12, 2026

Summary

Why

Files

Test plan

After merge

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant