Skip to content

chore(format): squash 135 PARTIAL-discharge feat commits into one PR (one CI run instead of 135)#1637

Merged
noahgift merged 135 commits into
mainfrom
chore/format-discharge-squash-v3
May 12, 2026
Merged

chore(format): squash 135 PARTIAL-discharge feat commits into one PR (one CI run instead of 135)#1637
noahgift merged 135 commits into
mainfrom
chore/format-discharge-squash-v3

Conversation

@noahgift
Copy link
Copy Markdown
Contributor

Summary

Single squashed PR that combines 135 PRs worth of format/PARTIAL discharge feat commits into one diff. One CI run instead of 135.

Each individual PR adds 1 new file crates/aprender-core/src/format/<name>.rs (verdict functions + tests) plus a pub mod <name>; line in mod.rs. They're structurally independent — no cross-dependencies — so a single combined diff is mechanically equivalent to merging the 135 PRs sequentially.

Why

With 250+ queued PRs all individually running CI, the merge queue is rate-limited by runner pool size (~6-8 PRs/hour after the #1632 sccache fix). Squashing the format-discharge cluster collapses 135 sequential workspace-test runs into 1, freeing the queue for the remaining unique-content PRs.

Files

135 new format modules, each binding a contract's FALSIFY-* gates at PARTIAL_ALGORITHM_LEVEL. See individual commit messages for per-contract context.

Test plan

After merge

The 135 originating PRs will be auto-closed by GitHub (branch-equal-to-main after this lands).

🤖 Generated with Claude Code

@noahgift noahgift enabled auto-merge (squash) May 12, 2026 08:54
This was referenced May 12, 2026
noahgift and others added 25 commits May 12, 2026 14:09
…GORITHM_LEVEL — closes 9/9 sweep

Bundles 9 algorithm-level decision rules + a stand-alone scalar
softmax reference (max-subtraction trick, returns SoftmaxError for
empty/non-finite input):

- SM-001 normalization: |sum - 1| < sqrt(n)*1e-6
- SM-002 positivity: all outputs > 0 (catches underflow-to-zero)
- SM-003 order preservation: argmax(softmax(x)) == argmax(x)
- SM-004 SIMD vs scalar: |simd - scalar| < 8 ULPs
- SM-005 single-element softmax([x]) == [1.0]
- SM-006 constant input -> uniform output [1/n; n]
- SM-007 input validation: empty input -> Err; NaN/Inf -> Err
- SM-008 length preserved: len(out) == len(in)
- SM-009 frame condition: input buffer byte-identical after call

27 tests cover the canonical softmax invariants, the [1000, -1000]
underflow band catching SM-002, and the (β1=0, β2=1, ε=0)-style
input validation error boundaries.

Contract: contracts/softmax-kernel-v1.yaml.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…LGORITHM_LEVEL — closes 10/10 sweep

Bundles 10 algorithm-level decision rules + a stand-alone reference
online-softmax scan (OnlineState, online_update, online_scan,
online_softmax, standard_softmax) so the algorithm-level rule is
testable offline against drift in any SIMD/GPU implementation:

- OSM-001 |online - standard| < 1e-5 element-wise
- OSM-002 |Σ y_i - 1.0| < 1e-6
- OSM-003 every output strictly > 0 (catches underflow-to-zero)
- OSM-004 shift invariance: softmax(x + c) ≈ softmax(x)
- OSM-005 decoder kv_lens {1, 6, 64, 448, 1500} all valid
  (sum tolerance scales by sqrt(n) for naive-sum FP error)
- OSM-006 single-element softmax([x]) == [1.0]
- OSM-007 running max invariant: m_i = max(x_1..x_i)
- OSM-008 loop counter terminates at n
- OSM-009 normalizer recompute: d_i = Σ exp(x_j - m_i)
- OSM-010 partial sum invariant (same as OSM-009 in reference scan)

38 tests cover canonical [1,...,1] uniform, the [100, -100, 50]
extreme range, the underflow-to-zero positivity regression, and the
Whisper-canonical {1, 6, 64, 448, 1500} kv_lens.

Contract: contracts/online-softmax-v1.yaml.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…RITHM_LEVEL — closes 11/11 sweep

Bundles 11 algorithm-level decision rules + a stand-alone scalar
AdamW reference implementation:

- AW-001 decoupled weight decay: AdamW != L2-coupled Adam when
  wd > 0 AND theta != 0 (vacuous Pass otherwise per contract scope)
- AW-002 second moment v_t >= 0
- AW-003 bias correction: 1/(1 - beta^t) > 1 for beta in (0, 1) and
  t >= 1 (within the f32 non-underflow regime)
- AW-004 update finiteness with finite gradient and eps > 0
- AW-005 SIMD vs scalar: |simd - scalar| < 8 ULPs
- AW-006 zero-gradient boundary: only weight decay modifies theta
  (theta_new == theta_old * (1 - lr * wd) within FP tolerance)
- AW-007 hyperparameter validation: invalid (β1=0, β2=1, ε=0) → Err
- AW-008 frame condition: gradient buffer byte-unchanged after step
- AW-009 loop invariant: v_t >= 0 across N steps
- AW-010 first moment formula: m = β1·old_m + (1-β1)·g (catches
  the (1-β1)·old_m + β1·g coefficient swap)
- AW-011 loop variant: training terminates at exactly max_steps

Reference impl AdamWState + AdamWHyperparams + adamw_step()
provides the canonical scalar comparison target for AW-005 SIMD
parity testing and AW-007 hyperparam validation.

36 tests cover the canonical Loshchilov & Hutter 2019 formulation,
the AW-005 ULP distance computation across same-sign and
sign-crossing pairs, and the (β1=0, β2=1, ε=0) error boundaries.

Contract: contracts/adamw-kernel-v1.yaml.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…AL_ALGORITHM_LEVEL — closes 15/15 sweep

Bundles 15 algorithm-level decision rules over markdown / YAML /
SVG / media integrity:

- DOC-001 heading hierarchy: no H1->H3 skip
- DOC-002 heading hierarchy: at most one H1
- DOC-003 link safety: no javascript: schemes (case-insensitive)
- DOC-004 code fence language: bare ``` is a violation
- DOC-005 table column parity: every row matches header
- DOC-006 SVG safety: no <script> tags (case-insensitive)
- DOC-007 README drift: actual == generated byte-exact
- DOC-008 SVG must declare viewBox
- DOC-009 required sections present (e.g. License)
- DOC-010 YAML balanced delimiters
- DOC-011 YAML no duplicate top-level keys
- DOC-012 magic-bytes vs extension agreement
  (PNG/JPEG/GIF/WebP detection)
- DOC-013 image dimensions <= 8192 x 8192, both non-zero
- DOC-014 animation frame count in (0, 1000]
- DOC-015 animation must not be infinite-loop (loop_count > 0)

49 tests cover security regression classes (XSS via javascript:
URLs, SVG <script> injection, PNG/JPEG extension spoofing) and the
canonical structural drift cases (heading skips, duplicate H1s,
missing License, unclosed YAML braces).

Contract: contracts/document-integrity-v1.yaml.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…12 PARTIAL_ALGORITHM_LEVEL — closes 12/12 sweep

Bundles 12 algorithm-level decision rules for the architecture
role-set requirements, plus a self-contained reference model
(BASE_ROLES, QK_NORM_ROLES, BIAS_ROLES, ArchConstraints,
required_roles, from_architecture) that mirrors
realizar/src/arch_requirements.rs:

- ARCH-001 YAML/Rust set equality
- ARCH-002 base roles always present (BASE_ROLES subset)
- ARCH-003 4 (qk_norm, bias) cells produce 4 distinct non-empty sets
- ARCH-004 cardinalities {9, 11, 12, 14}
- ARCH-005 Qwen3: has QK norm, NOT bias (GH-279 root cause)
- ARCH-006 Qwen2: has bias, NOT QK norm (catches Qwen2/3 swap)
- ARCH-007 LLaMA/Mistral are base-only (9 roles, no extensions)
- ARCH-008 missing required role → reject (catches GH-279
  silent-zero-weight regression)
- ARCH-009 optional roles do NOT trigger rejection
- ARCH-010 unknown architecture → base fallback
- ARCH-011 alias equivalence (qwen2.5/qwen, phi/phi3, llama/llama3,
  gemma/gemma2)
- ARCH-012 monotonicity: enabling features only adds roles

37 tests cover canonical Qwen2 (12 roles), Qwen3 (11 roles), LLaMA
(9 roles) configurations + alias families + the GH-279 silent-zero
regression class.

Contract: contracts/architecture-requirements-v1.yaml.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…TIAL_ALGORITHM_LEVEL — closes 15/15 sweep

Bundles 15 algorithm-level decision rules for the GPU decode
profiler / cbtop report invariants:

- GDP-001 brick coverage >= 85% wall time
- GDP-002 graph-enabled coverage < 50% wall time
- GDP-003 sync mode dichotomy: Immediate > 100us, Deferred < 100us
- GDP-004 LmHead count == decoded_tokens (exactly 1 per token)
- GDP-005 brick avg ordering: LmHead > Gate > RmsNorm strict
- GDP-006 token unit separation: profiler.total > decoded_tokens
- GDP-007 coverage upper bound: brick_total_ns <= wall_ns
- GDP-008 reproducibility: per-token CV < 0.15 across runs
- GDP-009 report fidelity: actual_us within 1% of profiler avg
  (catches B1 per-element vs per-call regression)
- GDP-010 score diversity: not all scores hardcoded to 100
  (catches B3 compute_brick_score-not-called regression)
- GDP-011 brick count == 11 for Qwen 1.5B
- GDP-012 falsification accounting consistency
  (catches B10-B12 hardcoded FalsificationSummary)
- GDP-013 LmHead per-decoded-token > 100us
  (catches denominator using profiler.total_tokens)
- GDP-014 pmat scores all 0.0 (catches B7-B9 magic constants
  173.9 / 98.1 / 95.2)
- GDP-015 total_points != 137 (catches B10 magic constant)

53 tests cover boundary sweeps, the canonical Qwen 1.5B
configuration, the exact regression classes named in the
contract (B1, B3, B7-B12), and FP edge cases.

Contract: contracts/gpu-decode-profiling-v1.yaml.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…PARTIAL_ALGORITHM_LEVEL — closes 7 unbound

Bundles 7 algorithm-level decision rules for the qwen2-e2e
general-purpose verification gates (the SHIP-* gates 001..024 are
already bound at PARTIAL or DISCHARGED in their respective modules):

- QW2E-001 verdict_from_param_count: |observed - 7,615,616,512| / target
  <= 0.5%. Pinned to the spec-canonical Qwen2.5-Coder-7B count.
- QW2E-002 verdict_from_flops_per_token: strict observed_flops == 2*P
  (canonical roofline formula).
- QW2E-003 verdict_from_memory_ordering: strict q4k < q6k < f16 < f32
  byte ordering.
- QW2E-004 verdict_from_throughput_roofline: observed_tps <=
  min(bandwidth_tps, compute_tps).
- QW2E-005 verdict_from_obligation_coverage: every obligation has
  evidence; vacuous truth (zero obligations) rejected.
- QW2E-006 verdict_from_block_shape_preservation: shape(block(x)) == shape(x).
- QW2E-007 verdict_from_e2e_shape_conservation: tokens [seq] ->
  hidden [seq, hidden] -> logits [seq, vocab] with seq agreement.

37 tests cover boundary sweeps (tolerance bands, FLOPs off-by-one),
strict-equality enforcement (memory ordering, FLOPs formula), domain
violations (NaN, zero rooflines, counter inversions), and the Qwen
canonical [16, 3584]/[16, 152064] shapes.

Contract: contracts/qwen2-e2e-verification-v1.yaml.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…..008 PARTIAL_ALGORITHM_LEVEL — closes 6 unbound

Bundles 6 algorithm-level decision rules for the Xet large-file
upload path:

- LFS-003 verdict_from_chunk_sizes: Pass iff every chunk except the
  last is in [8KiB, 128KiB] AND the last is in (0, 128KiB].
- LFS-004 verdict_from_xorb_sizes: Pass iff every xorb <= 64 MiB.
- LFS-005 verdict_from_upload_ordering: Pass iff every ShardSent has
  >= expected_xorb_count XorbAck events strictly preceding it.
- LFS-006 verdict_from_idempotent_replay: was_inserted:false treated
  as success (key Xet idempotency invariant).
- LFS-007 classify_retry/verdict_from_retry_classification:
  429/500/503/504 -> Retry, 401 -> RefreshThenRetryOnce, 400/403/404
  -> Abort, unknown statuses -> Abort (conservative default).
- LFS-008 xet_url_hash_encode/verdict_from_url_hash_encoding:
  8-byte LE-reversed hex (NOT naive hex). Pinned to the
  mathematically correct 64-char encoding of [0,1,...,31] (the
  contract's example string is a 62-char typo missing byte 0x10).

LFS-001/002/009/010 are already bound at runtime in
crates/aprender-core/src/hf_hub/xet.rs.

34 tests cover boundary sweeps, the conservative-Abort default for
unknown HTTP status codes, the Xet idempotency invariant, and the
canonical 32-byte hash encoding.

Contract: contracts/apr-publish-hf-large-file-v1.yaml.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…_ALGORITHM_LEVEL — closes 4/4 sweep

Bundles 4 algorithm-level decision rules for README-vs-repo-state
claim consistency:

- README-001 verdict_from_crate_count: Pass iff claimed == observed
  AND both non-zero.
- README-002 verdict_from_contract_count: Pass iff claimed == observed
  AND both non-zero.
- README-003 verdict_from_cli_command_count: Pass iff claimed == observed
  AND observed >= 30 (guards against empty-help regression — apr is
  documented as a 58-subcommand CLI).
- README-004 verdict_from_cookbook_link_presence: Pass iff readme_text
  contains literal substring "apr-cookbook".

21 tests cover off-by-one mutations, zero-input domain errors,
empty-help regression, link-renamed-without-sync, and the 30-subcmd
provenance pin (AC_README_003_MIN_REASONABLE).

Pins the algorithm-level rule against drift; the runtime-level
falsifier `bash scripts/check_readme_claims.sh` already exists per
the contract's evidence line.

Contract: contracts/readme-claims-v1.yaml.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…LGORITHM_LEVEL — closes 3/3 sweep

Bundles 3 algorithm-level decision rules for chat-template-v1:

- CHAT-001 verdict_from_chatml_render: Pass iff non-empty AND
  balanced <|im_start|>/<|im_end|> counts.
- CHAT-002 verdict_from_llama3_render: Pass iff starts with
  <|begin_of_text|> AND contains <|eot_id|>.
- CHAT-003 verdict_from_missing_template_outcome: Pass iff
  outcome == ExplicitError; all silent-success branches Fail
  (PMAT-237 process-validation rule).

17 tests cover the 3 decision rules with mutation-survey shape:
empty/missing-token fail bands + cross-template-family rejection
(ChatML in Llama-3 slot) + exhaustive 4-variant outcome sweep.

Provides the algorithm-level pin against contract drift; the
runtime-level falsifier `cargo test ... chat_template` already
exists per the contract's evidence line.

Contract: contracts/chat-template-v1.yaml.
Spec: SHIP-TWO-001 §12 (chat template ship gates).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…TIAL_ALGORITHM_LEVEL

Algorithm-level PARTIAL discharge for FALSIFY-PUB-EXTRA-001 (apr
publish --manifest dry-run accepts valid YAML) AND -007 (HF repo
contains apr+safetensors+gguf 3-format coverage) per
`contracts/apr-cli-publish-extra-v1.yaml`.

## What this binds

PUB-EXTRA-001: `verdict_from_manifest_dry_run_exit(exit_code)` —
Pass iff `exit_code == 0`. Single-input exit-code shape.

PUB-EXTRA-007: `verdict_from_three_format_repo(apr, safetensors,
gguf)` — Pass iff `(apr, safetensors, gguf) == (1, 1, 1)`. Mirrors
QA-009's 3-format coverage but counts files-per-format instead of
exit-codes-per-format.

Pinned constant: `AC_PUB_EXTRA_007_REQUIRED_FORMATS = 3`.

## apr-cli-publish-extra-v1 sweep status

After this PR: 4 of 10 bound (001 + 002 + 006 + 007).
Remaining: 003 (--extra-file), 004 (no-manifest backward compat),
005/008 (script Python exclusion), 009/010 (preflight validation).

## Why bundle 001 + 007

Both are publish-pipeline integrity verdicts. PUB-EXTRA-001 gates
the manifest-consumption entry point; PUB-EXTRA-007 gates the
multi-format ship completeness. Same family of "publish
correctness" gates. Same precedent as `pub_cli_002_004`,
`qa_002_006`, `cmd_safety` (4), `pub_extra_002_006`.

## Why match QA-009 shape but with file counts

QA-009 (apr-cli-qa-v1) checks exit codes from `apr inspect`
on each of 3 formats — does the parser support each format?
PUB-EXTRA-007 checks file counts per extension in the published
HF repo — does the ship deliver each format?

Two complementary 3-format gates at different pipeline stages:
- QA-009: parser ingest (does apr support format X?)
- PUB-EXTRA-007: ship publish (did we ship format X?)

Both must Pass for full 3-format guarantee.

## Five-Whys

1. Why bind these two now? — Both gate publish-pipeline
   correctness; without verdict pins, regressions in either
   manifest consumption or 3-format ship completeness corrupt
   the published artifact set.
2. Why distinct shapes? — 001 is single exit-code; 007 is
   3-counter equality.
3. Why mirror QA-009 shape? — Cross-pipeline 3-format guarantee
   needs both sides (ingest + ship); reuse signals consistency.
4. Why exhaustive 27-cell composite matrix in 007? — Mutation-
   survey breadth; proves only `(1,1,1)` passes across the full
   {0,1,2}³ cube.
5. Why 14 tests across 2 verdict sections? — 5 for 001, 9 for
   007 (incl. exhaustive cube).

## Cross-reference

- `qa_009`: 3-format parser ingest (exit codes).
- `pub_extra_001_007`: 3-format ship publish (file counts).

The "three-format coverage" pattern now applied at both pipeline
endpoints.

## Scope

PARTIAL_ALGORITHM_LEVEL only.

## Tests

14 unit tests, all green.
…TIAL_ALGORITHM_LEVEL

Algorithm-level PARTIAL discharge for FALSIFY-PUB-EXTRA-002 + 006
— two CRITICAL ship-integrity gates per
`contracts/apr-cli-publish-extra-v1.yaml`.

## What this binds

PUB-EXTRA-002 (sha256-mismatch must abort PRE-upload):
`verdict_from_sha256_mismatch_abort(exit_code, hf_api_call_count)`
returns Pass iff `exit_code != 0 AND hf_api_call_count == 0`.

PUB-EXTRA-006 (post-upload sha256 round-trip):
`verdict_from_post_upload_roundtrip(manifest_sha256, pulled_sha256)`
returns Pass iff both 32 bytes AND byte-identical.

Pinned constant: `AC_PUB_EXTRA_006_SHA256_BYTES = 32`.

## Why this matters for shipping

PUB-EXTRA-002 is described in the contract as **CRITICAL**: "the
whole point of manifest-first publish is the sha256 guard — if it
fails, we ship garbage". Without a verdict pin, a regression that
allows the publish path to skip the pre-flight sha256 check ships
corrupted artifacts to HuggingFace.

PUB-EXTRA-006 catches the complementary failure mode: even if
pre-upload validation passed, network transit corruption produces
artifacts whose post-upload sha256 differs from manifest. Both
gates must Pass for ship integrity.

## Why two gates instead of one

The two gates protect different stages of the upload pipeline:
- 002: pre-upload (catches local-vs-manifest mismatch)
- 006: post-upload (catches transit corruption)

A failure mode at either stage corrupts the published artifact.
Independent verdicts ensure each stage's regression class is
caught independently.

## Why bundle in one module

Same shape family (both ship-integrity guards on the publish
path). Same precedent as `pub_cli_002_004`, `qa_002_006`,
`cmd_safety` (4 verdicts), `ops_005_006`. Authoring two modules
with identical structure would be muda.

## Five-Whys

1. Why bind these two now? — Both are CRITICAL ship-integrity
   gates for MODEL-1/2 HuggingFace upload. Without verdict pins,
   regressions ship corrupted artifacts silently.
2. Why distinct verdict shapes? — 002 is conjunctive
   `(exit, network-count)`; 006 is byte-equality on SHA-256.
   Different stages, different shapes.
3. Why pre + post both? — Defense in depth: pre-upload catches
   local sources of corruption (filesystem write race, manifest
   stale); post-upload catches transit sources (HF endpoint
   bug, network corruption).
4. Why pin 32-byte SHA-256? — Catches MD5 substitution
   class (matching-but-wrong-length test).
5. Why 18 tests across 2 verdict sections? — 7 for 002 (pass
   pre-upload abort, panic exit, clap exit; fail silent zero
   exit, uploads-before-abort, zero-exit-with-uploads), 11 for
   006 (provenance, identical match, realistic SHA-256,
   single-byte drift at first/middle/last, length errors,
   wrong-length-equal, full-32-position byte-flip sweep).

## Cross-reference

The sha256-byte-eq verdict is now applied in 4 modules:
- `data_inv_005` (corpus reproducibility)
- `ship_010` (published-artifact byte identity)
- `ops_001` (filesystem snapshot equality)
- `pub_extra_002_006` (post-upload round-trip)

This is a strong reuse signal — promote to
`format::common::sha256_byte_eq` shared util in follow-up.

## Scope

PARTIAL_ALGORITHM_LEVEL only.

## Tests

18 unit tests, all green (7 for 002, 11 for 006).
…LEVEL — closes sweep

Algorithm-level PARTIAL discharge for FALSIFY-CMD-SAFETY-001 +
002 + 003 + 004 per `contracts/apr-cli-command-safety-v1.yaml`.
All four bundled in a single module because each verdict has a
small, simple algorithm-level reduction.

## ✅ Closes 4/4 CMD-SAFETY sweep

**Seven contract families now fully algorithm-bound at PARTIAL:**
- `dataset-thestack-python-v1` (7/7)
- `tokenizer-bpe-v1` (7/7)
- `apr-cli-publish-v1` (4/4)
- `apr-cli-qa-v1` (10/10)
- `apr-cli-coverage-v1` (1/1)
- `apr-cli-operations-v1` (7/7)
- `apr-cli-command-safety-v1` (4/4) ← this PR

## CMD-SAFETY-001: read-only no new files

`verdict_from_file_count_delta(before, after)` Pass iff equal.
Catches both new-file creation (mutation) and file-count-decrease
(unauthorized cleanup).

## CMD-SAFETY-002: mutating cmd usage error

`verdict_from_mutating_cmd_usage_error(stderr)` Pass iff stderr
contains `output` OR `required` (case-insensitive). Catches
`apr convert model.apr` silently running without `-o` flag.

## CMD-SAFETY-003: SIGINT exit code <= 130

`verdict_from_sigint_exit_code(code)` Pass iff `0 <= code <= 130`.
Pinned constant `AC_CMD_SAFETY_003_MAX_EXIT = 130` (= 128 + 2,
SIGINT). Catches SIGKILL (137) which means TUI didn't release
resources cleanly.

## CMD-SAFETY-004: every command classified

`verdict_from_command_classification(total, classified)` Pass iff
`total > 0 AND classified == total`. Adds another enforcement
layer for the canonical 58-cmd registry.

## Why bundle 4 verdicts in one module

Same precedent as `pub_cli_002_004` (2 verdicts), `qa_002_006`
(2 verdicts). All 4 cmd-safety verdicts have small, simple
algorithm-level reductions; authoring 4 separate modules with
identical structure would be muda. Contract IDs stay distinct.

## Five-Whys

1. Why bind all 4 now? — Closes 4/4 cmd-safety sweep, all small
   verdicts.
2. Why one module? — Same shape family, same precedent as
   `pub_cli_002_004` / `qa_002_006`.
3. Why case-insensitive substring for CMD-SAFETY-002? — Clap
   error messages can be `error:` or `ERROR:` depending on
   colorization; case-insensitive match is robust.
4. Why pin 130 for SIGINT? — Standard POSIX `128 + 2` for
   SIGINT exit; > 130 (e.g., 137 SIGKILL) means cleanup
   handlers didn't run.
5. Why 24 tests in 5 sections? — 4 verdicts × ~5 tests each
   plus shared-primitive tests. Each verdict gets pass/fail/edge
   coverage.

## Cross-reference

Bundle pattern (multiple verdicts in one module) now applied
to 3 modules: `pub_cli_002_004`, `qa_002_006`, `cmd_safety`.
Each bundles 2-4 small verdicts with shared structure.

## Scope

PARTIAL_ALGORITHM_LEVEL only.

## Tests

24 unit tests, all green.
…LEVEL — closes 7/7 sweep

Algorithm-level PARTIAL discharge for FALSIFY-OPS-002 (GPU memory
returns to baseline after inference error) per
`contracts/apr-cli-operations-v1.yaml`.

## What this binds

`verdict_from_gpu_memory_leak_delta(baseline, post_error, tolerance)`
returns Pass iff:

1. `baseline <= 1 TiB`
2. `post_error <= 1 TiB`
3. `tolerance <= 256 MiB`
4. `|post_error - baseline| <= tolerance`

Two pinned constants:
- `AC_OPS_002_MAX_PLAUSIBLE_GPU_BYTES = 1 TiB`
- `AC_OPS_002_MAX_TOLERANCE_BYTES = 256 MiB`

## ✅ Closes the 7/7 OPS sweep on `apr-cli-operations-v1`

**Six contract families now fully algorithm-bound at PARTIAL:**
- `dataset-thestack-python-v1` (7/7)
- `tokenizer-bpe-v1` (7/7)
- `apr-cli-publish-v1` (4/4)
- `apr-cli-qa-v1` (10/10)
- `apr-cli-coverage-v1` (1/1)
- `apr-cli-operations-v1` (7/7) ← this PR

## Why bounded-delta (not byte-equality)

Byte-equality would be too strict — CUDA driver retains some
metadata pages after freeing user allocations (tens of MiB
typical). Bounded delta with a tolerance band matches
operational reality.

## Why cap tolerance at 256 MiB

The verdict is meaningless if the caller can pass arbitrary
tolerance — `tolerance = u64::MAX` would silently accept any
leak. Capping tolerance enforces a useful lower bound on
detection sensitivity:

- 256 MiB covers CUDA driver metadata generously.
- Below that ceiling, real leaks (per-request KV cache, 7B
  model copy, etc.) all produce deltas > 256 MiB and fail.
- A contributor cannot "tune the gate to pass" by setting
  tolerance higher.

`fail_tolerance_8_gib_would_mask_model_leak` test explicitly
catches the adversarial-tolerance class.

## Why 1 TiB GPU memory cap

Catches counter corruption / unsigned-overflow regressions in
the memory reading. No consumer or datacenter GPU exceeds 1 TiB
(H100 is 80 GiB, B200 is 192 GiB).

## Five-Whys

1. Why bind OPS-002 now? — Closes 7/7 ops sweep. GPU memory
   leaks on the error path are silent in dev (1 error = 256 MiB
   leak; tolerated single-shot) but catastrophic in production
   (1000 errors/day → 256 GiB/day → OOM crash).
2. Why a 3-tuple of u64? — Algorithm-level pin; the actual
   nvidia-smi / cuMemGetInfo capture harness is FULL_DISCHARGE.
3. Why bounded delta (not byte-equality)? — CUDA driver retains
   metadata; byte-equality is too strict.
4. Why cap tolerance at 256 MiB? — Prevents adversarial-tolerance
   gate-bypass; ensures meaningful detection sensitivity.
5. Why 21 tests across 7 sections? — Provenance pin (×2), pass
   band (×6: exact, ±32 MiB, exact-tolerance, zero, max-256 MiB,
   real workloads), leak fail (×3 incl. catastrophic 4 GiB),
   memory domain violations (×3), tolerance-too-large (×3),
   delta sweep (×8 probes), realistic GPU scenarios (×3:
   RTX 4090, A100 80GB, KV cache leak).

## Cross-reference

Bounded-delta verdict pattern joins the catalog alongside:
- Threshold (`cov_001`, `data_inv_003`, `distill_train_009`)
- Linear-bound (`ops_007` token count)
- Conjunctive exit (`pub_cli_002_004`, `qa_002_006`)
- Byte-equality family (6 modules)
- Sequence-monotonicity (`ops_004`)
- Substring-containment (`pull_dataset_005`, `pub_cli_001`)

Each domain with measurement + tolerance now has a reference
pattern.

## Scope

PARTIAL_ALGORITHM_LEVEL only.

## Tests

21 unit tests, all green.
…LEVEL

Algorithm-level PARTIAL discharge for FALSIFY-OPS-004 (progress
percentage monotonically non-decreasing) per
`contracts/apr-cli-operations-v1.yaml`.

## What this binds

`verdict_from_progress_monotonicity(progress_sequence)` returns
Pass iff:

1. `!progress_sequence.is_empty()`
2. Every value is finite (no NaN, ±∞)
3. Every value is in `[0.0, 100.0]`
4. Every adjacent pair is non-decreasing (`a <= b`)

## Why non-strict ≤ (not strict <)

Throttled progress emission commonly produces same-value adjacent
pairs (e.g., 25.0 → 25.0 → 50.0 due to debouncing). Strict `<`
would reject normal progress streams. Non-strict matches contract
"never decreases" wording.

## Why the (slice → bool) shape (novel)

First sequence-based verdict in the algorithm-binding cycle. Most
prior verdicts take 1-3 scalar inputs; this one takes a `&[f64]`.
The shape is appropriate because monotonicity is inherently a
sequence property — no equivalent reduction to scalar inputs
exists without losing precision.

## Five-Whys

1. Why bind OPS-004 now? — Progress regression confuses users
   ("download is now at 30% after being at 60%?"). Without a
   verdict pin, retry-mid-stream and concurrent-writer-overlap
   regressions ship invisibly.
2. Why a `&[f64]` input? — Algorithm-level pin; the actual
   progress-event capture is FULL_DISCHARGE. Sequence shape
   matches the inherent nature of monotonicity.
3. Why non-strict `≤`? — Contract literal "never decreases";
   throttled emissions produce same-value adjacent pairs.
4. Why include domain checks? — NaN/±∞/out-of-range values
   could pass a naive `windows(2)` comparison if monotonic
   between themselves; defense in depth.
5. Why 22 tests across 7 sections? — Pass band (×5: increasing,
   repeats, single, resume, dense), regression fail (×4 incl.
   ULP-level), empty fail (×1), NaN/±∞ (×3), out-of-range (×3),
   boundary cases (×3 at 0/100/full-range), realistic apr pull
   scenarios (×3 incl. retry-resets-progress, concurrent
   overlap).

## Cross-reference

The sequence-monotonicity verdict joins the verdict-shape
catalog as a new primitive. Other sequence-based gates that
might use this shape:
- Training loss curves (should be roughly monotonic-decreasing)
- Layer index ordering in trace output
- Token sequence position numbering

If reused, promote to `format::common::is_monotonic_nondecreasing`.

## Scope

PARTIAL_ALGORITHM_LEVEL only. Wiring into the actual
progress-event capture is FULL_DISCHARGE work for the
operations-pipeline implementation PR.

## Tests

22 unit tests, all green.
…LEVEL

Algorithm-level PARTIAL discharge for FALSIFY-OPS-007 (token count
bounded by 4× input byte length — BPE worst case) per
`contracts/apr-cli-operations-v1.yaml`.

## What this binds

`verdict_from_token_count_bound(input_byte_len, token_count)`
returns Pass iff:

1. `input_byte_len > 0` (refuse vacuous Pass)
2. `token_count <= input_byte_len.checked_mul(4)` (with overflow
   guard)

Pinned constant: `AC_OPS_007_MAX_RATIO = 4`.

## Why pin 4× ratio

BPE byte-level fallback emits at most ~4 tokens per input byte in
the worst case (UTF-8 4-byte sequences split into 4 byte-fallback
tokens when no merges apply). 4× is the published BPE worst-case
bound.

Drift to 8× would mask token-explosion bugs (unbounded merge
expansion); drift to 2× would over-tighten and reject pathological-
but-valid CJK / emoji inputs at the byte-fallback limit.

## Why DoS-class regression matters

Token explosion is a denial-of-service vector: a 1KB adversarial
input producing 5MB of tokens causes:
- Unbounded GPU memory growth (KV cache scales with token count)
- Inference timeout / OOM
- API endpoint backpressure

The `fail_unbounded_memory_attack` test pins this with a 10:100k
ratio (10-byte input, 100k tokens — clearly adversarial).

## Why checked_mul

`input_byte_len * 4` overflows u64 at `input_byte_len >
u64::MAX/4`. Without `checked_mul`, the comparison would silently
wrap to a tiny number and accept any `token_count`. The
`fail_input_times_4_overflow` test catches this.

## Five-Whys

1. Why bind OPS-007 now? — Token-explosion DoS regressions silently
   pass CI (CI doesn't run adversarial inputs); only show up in
   production user load. Without a verdict pin, the bound drifts.
2. Why a (u64, u64) pair? — Algorithm-level pin; the actual
   adversarial-input harness is FULL_DISCHARGE.
3. Why 4× specifically? — Contract literal; matches BPE worst-case
   byte-fallback bound.
4. Why inclusive `<=`? — Contract test wording `<= 4 * len`.
5. Why 18 tests across 7 sections? — Provenance pin (×1), pass
   band (×5: typical, exact-cap, 1-byte-1-token, realistic prompt,
   1M scale), token-explosion fail (×4 incl. DoS class),
   empty-input fail (×2), overflow protection (×2), boundary sweep
   (×10 probes around 400-cap), realistic UTF-8 scenarios (×3:
   emoji 4-byte, CJK 3-byte, adversarial).

## Cross-reference

Linear-bound verdict pattern joins the catalog alongside
threshold (`cov_001`, `data_inv_003`), count-zero-tolerance
(many), and conjunctive-exit (`pub_cli_002_004`). Each domain
with linear-relationship invariants now has a reference pattern.

## Scope

PARTIAL_ALGORITHM_LEVEL only. Wiring into the actual adversarial-
input tokenizer harness is FULL_DISCHARGE work for the
operations-pipeline implementation PR.

## Tests

18 unit tests, all green.
…LEVEL

Algorithm-level PARTIAL discharge for FALSIFY-OPS-003 (greedy
decoding is deterministic) per
`contracts/apr-cli-operations-v1.yaml`.

## What this binds

`verdict_from_greedy_decoding_pair(run_a, run_b)` returns Pass iff:

1. Both outputs are non-empty
2. `run_a == run_b` (byte-identical)

## Same shape as bpe_inv_006

`bpe_inv_006` (encode determinism) uses positive byte-equality on
token-id sequences. `ops_003` uses positive byte-equality on
generated text outputs. Both gate determinism at different layers:

- `bpe_inv_006`: tokenizer deterministic (encode → same token IDs)
- `ops_003`: sampler deterministic (greedy decode → same text)

The shared shape signals a verdict-shape pattern: "two outputs
must match or Fail" applies to encode determinism, decode
determinism, reproducibility, side-effect freedom.

## Why this matters for SHIP-007 root-cause work

The realistic regression classes pinned in Section 7 cover
exactly the bisection axes from `2026-04-27 findings.md`:
- HashMap iteration order in argmax tiebreak
- Floating-point associativity in parallel logit reduction
- Per-host code path divergence

If a future SHIP-007 fix introduces or reveals nondeterminism in
the sampling layer, this verdict catches it independently of
the layer-0 stage diff.

## Five-Whys

1. Why bind OPS-003 now? — Greedy-decoding non-determinism is
   the canonical "passes CI in dev, fails on user box" defect
   class; without a verdict pin, MODEL-1 ship goes out with
   silent stochasticity.
2. Why a (&[u8], &[u8]) pair? — Algorithm-level pin; the actual
   double-invocation harness is FULL_DISCHARGE.
3. Why byte-identical? — Determinism is binary; one ULP/bit of
   drift signals upstream non-determinism.
4. Why refuse empty? — Vacuous Pass would mask `apr run` silent
   regression.
5. Why 18 tests across 7 sections? — Pass band (×4: short,
   2+2, long, special chars), single-byte drift fail (×3:
   first/last/middle), length mismatch (×2), empty input (×3),
   symmetry (×2: pass + fail), full-position drift sweep
   (43 positions), realistic apr run scenarios (×3 incl. SHIP-007
   regression classes — HashMap tiebreak, FP associativity).

## Cross-reference

- `bpe_inv_006`: encode determinism (tokenizer layer).
- `ops_003`: decode determinism (sampler layer). [this PR]
- `data_inv_005`/`ship_010`/`ops_001`: SHA-256 byte-equality
  (third application of the digest-pair pattern).

The byte-equality verdict family is now applied across 5+
modules at different scopes (token IDs, generated text, SHA-256
digests, filesystem snapshots, NFC byte sequences).

## Scope

PARTIAL_ALGORITHM_LEVEL only. Wiring into the actual double-
invocation `apr run --temperature 0` harness is FULL_DISCHARGE
work for the operations-pipeline implementation PR.

## Tests

18 unit tests, all green.
…LEVEL

Algorithm-level PARTIAL discharge for FALSIFY-OPS-001 (read-only
commands have no filesystem side effects) per
`contracts/apr-cli-operations-v1.yaml`.

## What this binds

`verdict_from_filesystem_snapshot_pair(before_snapshot, after_snapshot)`
returns Pass iff:

1. Both digests are 32 bytes (SHA-256)
2. `before == after` (byte-identical)

Pinned constant: `AC_OPS_001_SHA256_BYTES = 32`.

## Why mirror data_inv_005's shape

Identical algorithm-level reduction:
- `data_inv_005` (corpus_sha256 reproducibility): two SHA-256
  digests from independent ingest hosts must match.
- `ops_001` (read-only side-effect freedom): two SHA-256 digests
  of the filesystem before/after a "read-only" command must match.

Both verdicts share:
- 32-byte SHA-256 length pin
- Byte-identical equality
- Wrong-length-but-equal Fail (catches MD5 substitution)

The reuse signals a verdict-shape pattern: "two digests must match
or Fail" applies to reproducibility, side-effect freedom, and any
content-equality gate.

## Why pin SHA-256 length explicitly

Three regression classes covered:
1. Truncation (some hash libraries return prefix only).
2. Padding (truncated digest padded to 64 bytes for SHA-512).
3. Hash-function drift (SHA-3, BLAKE3 substituted).

`fail_wrong_length_but_equal` test (16-byte arrays matching each
other) explicitly catches MD5 substitution, even when the bytes
"happen to be equal".

## Five-Whys

1. Why bind OPS-001 now? — Read-only commands writing files is a
   subtle defect class (lock files, metadata caches, mtime
   updates) that ships invisibly. CI passes; only filesystem
   audits catch it.
2. Why a (&[u8], &[u8]) pair? — Algorithm-level pin; the actual
   `find -print0 | xargs sha256sum | sha256sum` snapshot harness
   is FULL_DISCHARGE.
3. Why byte-identical (no tolerance)? — Read-only is binary;
   any filesystem mutation invalidates the contract.
4. Why pin SHA-256 length? — Catches hash-function drift class.
5. Why 15 tests across 7 sections? — Provenance pin (×1),
   pass band (×2), single-byte mutation (×3 incl. one-bit),
   wrong-length caller error (×4 incl. matching-but-wrong-length),
   32-position byte-flip sweep, symmetry property, realistic apr
   scenarios (×3: clean run, lock-file write, metadata-cache).

## Cross-reference

The 32-byte SHA-256 byte-equality verdict is now reused across:
- `data_inv_005` (corpus reproducibility)
- `ship_010` (published-artifact byte identity)
- `ops_001` (filesystem snapshot equality)

If we add a fourth, this primitive should be promoted to a
shared `format::common::sha256_byte_eq` utility in a follow-up.

## Scope

PARTIAL_ALGORITHM_LEVEL only. Wiring into the actual filesystem-
snapshot harness is FULL_DISCHARGE work for the operations-pipeline
implementation PR.

## Tests

15 unit tests, all green.
…VEL — closes 1/1 sweep

Algorithm-level PARTIAL discharge for FALSIFY-COV-001 (apr-cli line
coverage >= 95%) per `contracts/apr-cli-coverage-v1.yaml`.

## What this binds

`verdict_from_line_coverage_percent(line_coverage_percent)` returns
Pass iff:

1. `line_coverage_percent.is_finite()` (rules out NaN, ±∞)
2. `0.0 <= line_coverage_percent <= 100.0` (domain)
3. `line_coverage_percent >= 95.0` (contract floor, inclusive)

Pinned constant: `AC_COV_001_MIN_PERCENT = 95.0`.

## ✅ Closes the 1/1 COV sweep on `apr-cli-coverage-v1`

**Five contract families now fully algorithm-bound at PARTIAL:**
- `dataset-thestack-python-v1` (7/7)
- `tokenizer-bpe-v1` (7/7)
- `apr-cli-publish-v1` (4/4)
- `apr-cli-qa-v1` (10/10)
- `apr-cli-coverage-v1` (1/1) ← this PR

## Why inclusive `>=` 95.0

Contract test wording: `>= 95.0`. Inclusive floor matches
character-for-character. Boundary tests `pass_at_exact_floor_95_0`
+ `fail_just_below_floor_via_ulp` (using `f64::from_bits` for
ULP-below) bracket the strict semantics.

## Why pin 95.0 (not 85.0)

Per CLAUDE.md "Quality Standards": "95% minimum test coverage
(upgraded from 85%)". The verdict pins the upgraded threshold;
`fail_at_85_old_threshold` test explicitly catches a regression
back to the pre-upgrade value.

## Why is_finite() + range checks

`cargo llvm-cov` could emit NaN (rare division by zero in
coverage tooling), Inf (huge counters wrapping), or out-of-range
values (parsing bug treating a count as percentage). Defense in
depth catches all three.

## Five-Whys

1. Why bind COV-001 now? — Coverage drift is the kind of
   regression that ships silently with green CI (the threshold
   moves, not the code). Without a verdict pin, a contributor
   could quietly lower the floor in `make tier3` and ship.
2. Why an f64 input? — Algorithm-level pin; the
   `cargo llvm-cov | awk` extraction is FULL_DISCHARGE.
3. Why inclusive `>=`? — Contract literal. ULP-below test
   bracket strict semantics.
4. Why pin 95.0 vs 85.0? — CLAUDE.md upgraded; verdict
   enforces upgraded value.
5. Why 19 tests across 7 sections? — Provenance pin (×1),
   pass band (×4: floor, canonical 96.35, 99.99, 100), below-
   floor fail (×4 incl. 85% old threshold), NaN/±∞ (×3),
   out-of-domain (×3: -1, 100.001, 1e10), boundary sweep (11
   probes + 1 ULP), realistic apr-cli scenarios (×2).

## Cross-reference

Threshold-based verdict pattern (`>= floor` with domain checks)
is conceptually similar to:
- `data_inv_003` (Jaccard < 0.85, strict)
- `tok_par_002` (efficiency >= 80%, inclusive via cross-multiply)
- `distill_train_009` (val_loss strict <)

Each binds a different threshold semantic; this one is
inclusive `>=`.

## Scope

PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual
`cargo llvm-cov report --summary-only | awk` extraction is
FULL_DISCHARGE work for the qa-pipeline implementation PR.

## Tests

19 unit tests, all green.
Algorithm-level PARTIAL discharge for FALSIFY-QA-009 (3-format
coverage: GGUF + APR + SafeTensors) per
`contracts/apr-cli-qa-v1.yaml`.

## What this binds

`verdict_from_three_format_coverage(gguf_exit, apr_exit, safetensors_exit)`
returns Pass iff all three exit codes equal 0.

Pinned constant: `AC_QA_009_REQUIRED_FORMAT_COUNT = 3`.

## Why conjunctive composition (all three or fail)

Per CLAUDE.md "Debugging: Use apr Tools First":
"All tools support GGUF, APR, and SafeTensors formats. If a tool
says 'format not supported', that's a BUG."

Each format is independently shipped as a contract, so
silently dropping one is a regression class. Conjunctive `&&`
catches:
- Single-format regression (most common: SafeTensors writer
  panics on a new tensor variant).
- Multi-format collapse (rare: a refactor breaks all three).

The 8-cell composite matrix in Section 6 proves only `(0, 0, 0)`
passes — every other cell of `{0, 1}³` fails.

## Why pin format count at 3

Catches a regression where:
- A 4th format is silently added (e.g., ONNX) without bumping
  the contract — would not be exercised by this verdict.
- One of the three is silently dropped from the test harness.

The `provenance_required_format_count_is_three` test pins this.

## Five-Whys

1. Why bind QA-009 now? — Format coverage is a CLAUDE.md-level
   invariant; without a verdict pin, a SafeTensors regression
   (e.g., new tensor variant the writer doesn't handle) ships
   silently.
2. Why a 3-tuple of i32? — Algorithm-level pin; the
   per-format `apr inspect` invocation harness is FULL_DISCHARGE.
3. Why conjunctive? — Each format is independent; AND-fold
   catches any single regression.
4. Why 8-cell composite matrix? — Proves only origin (0, 0, 0)
   passes; mutation testing would catch any subset-Pass mutation.
5. Why 15 tests across 7 sections? — Provenance pin (×1), pass
   band (×1: only origin), single-format fail (×3 — one per
   format), multi-format fail (×4 incl. all-three-fail),
   non-1 exit fail (×3: panic, negative, i32::MAX), 8-cell
   composite matrix, symmetry under permutation + exhaustive
   4×4×4 sweep at canonical exit codes.

## Cross-reference

The "all-must-succeed conjunctive" pattern is conceptually
similar to `bpe_inv_002` (4 special tokens distinct + in range),
but scaled to 3-tuple exit codes instead of 4-tuple ID checks.
Both rely on conjunctive composition of independent gates.

## Scope

PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual
3-format `apr inspect` invocation harness is FULL_DISCHARGE
work for the qa-pipeline implementation PR.

## Tests

15 unit tests, all green.
Algorithm-level PARTIAL discharge for FALSIFY-QA-003 (JSON output
is valid — `jq .` exits 0) per `contracts/apr-cli-qa-v1.yaml`.

## What this binds

`verdict_from_json_validity(output_bytes, jq_exit_code)` returns
Pass iff:

1. `output_bytes` is non-empty (refuse vacuous Pass)
2. `jq_exit_code == 0` (jq accepted the input)
3. First non-whitespace byte is `{` or `[`
4. Last non-whitespace byte is `}` or `]`
5. Brace `{`/`}` counts balance AND bracket `[`/`]` counts balance

Five conjunctive checks composing JSON-shape sanity with jq's
parser verdict.

## Why redundant structural checks alongside jq?

Two failure modes the verdict must catch:

1. **jq missing or stubbed**: a CI environment where `jq` is
   absent and the harness silently substitutes `true` (exit 0).
   Without the structural checks, `(any-text, 0)` would Pass.
2. **jq spurious 0**: a jq regression returning 0 on whitespace
   or non-JSON input. The structural checks (start with `{`/`[`,
   end with `}`/`]`, balanced) catch all three classes:
   - bare strings
   - leading text + JSON
   - trailing text after JSON

Both jq AND structural must agree to Pass.

## Documented limitation: braces inside strings

The brace-balance check is byte-level and counts `{`/`}` even
inside string literals like `"{not real}"`. This means
`{"k":"} extra"}` would have unbalanced count 0+1-1+0 = 0 (Pass)
even though it's malformed JSON.

This is intentional for PARTIAL_ALGORITHM_LEVEL: real
balanced-with-string-awareness parsing requires a JSON tokenizer,
which is FULL_DISCHARGE territory. The verdict's role is to
catch the common regression class (truncated output, missing
closing brace, double-opening) which the byte-counting check
handles correctly.

## Five-Whys

1. Why bind QA-003 now? — JSON output validity is the contract
   for every `--json` flag (cf. QA-007). A regression that emits
   malformed JSON breaks every downstream consumer.
2. Why a (&[u8], i32) pair? — Algorithm-level pin; subprocess
   invocation + `jq .` pipe is FULL_DISCHARGE.
3. Why structural checks alongside jq? — Two failure classes
   (missing jq, jq spurious-0) require defense in depth.
4. Why byte-level brace count (not real JSON parser)? —
   PARTIAL_ALGORITHM_LEVEL pins the decision rule; full JSON
   parser is FULL_DISCHARGE.
5. Why 22 tests across 7 sections? — Pass band (×5: object,
   array, realistic, nested, with whitespace), jq-rejected fail
   (×2), empty output fail (×2 incl. whitespace-only), non-JSON
   shape fail (×3), bracket imbalance fail (×4 incl. realistic
   truncation), edge cases (×3: minimal `{}`/`[]`, negative jq
   exit), realistic apr scenarios (×3: qa results, trace stats,
   text-instead-of-JSON regression).

## Cross-reference

Companion to QA-007 (`--json` flag changes output): QA-007
ensures `--json` does *something*; QA-003 ensures what it does
is well-formed JSON. Both gates must Pass independently.

## Scope

PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual
`apr inspect --json | jq .` pipeline is FULL_DISCHARGE work for
the qa-pipeline implementation PR.

## Tests

22 unit tests, all green.
Algorithm-level PARTIAL discharge for FALSIFY-QA-008 (no phantom
subcommands — advertised commands have real implementations) per
`contracts/apr-cli-qa-v1.yaml`.

## What this binds

`verdict_from_phantom_scan(commands_advertised, phantom_count)`
returns Pass iff:

1. `commands_advertised > 0` (refuse vacuous Pass on empty CLI)
2. `phantom_count == 0` (zero-tolerance phantom commands)
3. `phantom_count <= commands_advertised` (counter sanity)

## Why this is distinct from QA-001

QA-001 pins the *registered count* at 58 (catches add/drop drift in
the registry vs CLAUDE.md claim).
QA-008 pins the *registry-vs-implementation gap* (catches commands
that are advertised in `apr --help` but have no working
`--help`-responding dispatch).

A new subcommand added to the registry but with the dispatch arm
forgotten passes QA-001 (count stays 58 if it replaced something)
but fails QA-008 (advertised=58 but phantom=1).

## Why zero-tolerance

A phantom subcommand ships a "command not found" experience to
users — the worst kind of CLI regression because:
- It's invisible in unit tests (the registry shows the command)
- It only surfaces in shell scripts / CI / user smoke runs
- It fails late and fails confusingly

## Five-Whys

1. Why bind QA-008 now? — Phantom subcommands are the most
   user-facing "command not found" regression class. Without a
   verdict pin, CI passes (registry has 58 commands) but users
   hit "unimplemented" at runtime.
2. Why a (u64, u64) pair? — Algorithm-level pin; the actual
   per-subcommand `apr <c> --help` invocation harness is
   FULL_DISCHARGE.
3. Why distinct from QA-001? — Different regression classes:
   QA-001 catches count drift, QA-008 catches dispatch gaps.
   Both gates must Pass independently.
4. Why zero-tolerance? — One phantom is enough to break a user
   workflow; no tolerance band makes sense.
5. Why 16 tests across 7 sections? — Pass band (×3: 58, 1, 1M),
   zero-tolerance fail (×4 incl. handful, half, all), empty
   advertised (×2), partition violations (×2), boundary sweep
   (×7 incl. partition+1), zero-tolerance property (×5 sizes),
   realistic apr scenarios (×3 incl. minimal-features build).

## Cross-reference

Same shape and discipline as INV-DATA-001/002/007, INV-BPE-007,
QA-004 — the "zero-tolerance match count + total + partition"
pattern is now applied to 6+ verdicts. Each gates a different
domain (license, PII, UTF-8, UNK, NaN, phantom) but with
identical algorithm-level reduction.

Combined with QA-001 (registered count pin) and the existing
3-surface drift trio (test harness, contracts/apr-cli-commands-v1.yaml,
CLAUDE.md), QA-008 forms the 5th enforcement layer for CLI
command integrity.

## Scope

PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual
per-subcommand `apr <c> --help` exit-code-collecting harness is
FULL_DISCHARGE work for the qa-pipeline implementation PR.

## Tests

16 unit tests, all green.
Algorithm-level PARTIAL discharge for FALSIFY-QA-007 (--json flag
changes output; not a no-op) per `contracts/apr-cli-qa-v1.yaml`.

## What this binds

`verdict_from_json_flag_diff(default_output, json_output)` returns
Pass iff:

1. Both byte slices are non-empty (refuse vacuous Pass)
2. `default_output != json_output` (inverse-equality)

## Why inverse-equality (novel verdict shape)

Most verdicts in the algorithm-binding cycle are positive-equality
(byte-identical match) or zero-tolerance-count (no failures). This
is the first inverse-equality verdict: the contract requires the
two outputs to DIFFER. A regression where `--json` is parsed as a
clap flag but doesn't route to a different formatter would produce
identical bytes — Pass-by-equality would Fail-by-inverse-equality.

## Why refuse empty inputs

Either output being empty means the apr subprocess silently
failed. Refusing both individually catches the silent-failure
class.

## Five-Whys

1. Why bind QA-007 now? — `--json` no-op is a class of regression
   that ships with green CI: clap parses the flag, but the
   formatter dispatch silently uses the text writer. Without a
   verdict pin, downstream JSON parsers break in production.
2. Why a (&[u8], &[u8]) pair? — Algorithm-level pin; the actual
   subprocess invocations producing the two outputs are
   FULL_DISCHARGE.
3. Why inverse-equality (not metadata sniffing)? — Contract test
   wording: `diff <(default) <(--json) | grep -q .` matches
   character-for-character. JSON-validity check is FALSIFY-QA-003's
   territory; -007 is purely about non-equality.
4. Why include symmetry tests? — Verdict must be symmetric in
   (default, json) to avoid order-dependent bugs.
5. Why 18 tests across 7 sections? — Pass band (×4: realistic
   inspect, one-byte diff, length diff, qa outputs), no-op fail
   (×2 incl. long identical), empty-input fail (×3), single-byte
   edge (×2: appended + prepended), symmetry property (×3 incl.
   pass + fail-identical + fail-one-empty), inverse-equality
   property (×4 lengths), realistic 3-subcommand contrast (×3:
   inspect, diff, validate-unimplemented regression).

## Cross-reference

Inverse-equality is conceptually related to `bpe_inv_006`'s
encode determinism (which uses positive equality). Together they
cover both directions: "two runs must be identical" (006) and
"two runs must differ" (007). Different contract ID, different
direction, same primitive (byte-equality).

## Scope

PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual
double-subprocess invocation (text + --json) is FULL_DISCHARGE
work for the qa-pipeline implementation PR.

## Tests

18 unit tests, all green.
…EVEL — closes 7/7 INV-BPE sweep

Algorithm-level PARTIAL discharge for FALSIFY/INV-BPE-005 (NFC
idempotence + composed/decomposed equivalence) per
`contracts/tokenizer-bpe-v1.yaml`.

## What this binds

`verdict_from_nfc_idempotence(composed_nfc, decomposed_nfc, double_nfc)`
returns Pass iff:

1. All three byte slices are non-empty (refuse vacuous Pass)
2. `composed_nfc == decomposed_nfc` (composable equivalence)
3. `composed_nfc == double_nfc` (NFC idempotence)

## ✅ Closes the 7/7 INV-BPE sweep on `tokenizer-bpe-v1`
INV-BPE-001 (vocab range + paired model), 002 (special tokens),
003 (round-trip / FALSIFY-SHIP-012 dual), 004 (merge algebra),
005 (NFC idempotence), 006 (encode determinism), 007 (byte
coverage / no UNK) all bound at PARTIAL_ALGORITHM_LEVEL.

## Why three-way byte-equality

The contract pins TWO properties on the same NFC implementation:
- Composable equivalence: nfc("café"-composed) == nfc("café"-decomposed)
- Idempotence: nfc(nfc(x)) == nfc(x)

A regression in either dimension breaks MODEL-2's tokenizer
contract. Composing both into one verdict means a single PARTIAL
covers both gate paths; the test suite exercises each
independently:
- `fail_cafe_composed_vs_decomposed` catches missing-NFC.
- `fail_double_nfc_differs` catches non-idempotent NFC.
- `fail_hangul_precomposed_vs_decomposed` catches NFC missing on
  multi-codepoint canonical equivalents.

## Why byte-equality (not str-equality)

The Rust `&[u8]` shape lets the verdict compare:
- Pre-NFC byte sequences (which may not be valid UTF-8)
- Post-NFC byte sequences from any NFC implementation (ICU, unic,
  rust-stdlib, the tokenizer's own)
- Hex/UTF-8 fixtures in tests

`==` on `&[u8]` is character-for-character byte-identical; matches
the contract falsifier's "byte-equals" wording.

## Why refuse empty inputs

Vacuous Pass on `(&[], &[], &[])` would let an early-return
short-circuit silently pass the gate. Refusing all three
empty-input combinations explicitly catches it.

## Five-Whys

1. Why bind INV-BPE-005 now? — Closes the 7/7 INV-BPE sweep on
   `tokenizer-bpe-v1`. NFC idempotence is required before BPE
   encoding; without it, "café" and "café" tokenize differently
   even though they're canonically equivalent — corrupts MODEL-2
   tokenizer contract.
2. Why a 3-tuple of byte slices, not the actual NFC implementation? —
   Algorithm-level pin; the actual NFC + tokenizer code that
   produces the three forms is FULL_DISCHARGE work for the
   corpus-tokenize PR.
3. Why three forms instead of two? — Composable equivalence and
   idempotence are independent properties; a tokenizer can pass
   one and fail the other. Three-way input lets the verdict
   exercise both with one call.
4. Why byte-level rather than codepoint-level? — Catches
   subtler regressions where two byte sequences decode to the
   same codepoints under different NFC normalizations
   (idempotence implies bytes, not just codepoints).
5. Why 20 tests across 7 sections? — Pass band (×5: ASCII,
   composed-café, CJK, emoji, math), composed-vs-decomposed
   fail (×3: café, distinct, single byte), idempotence fail
   (×2), combined violations (×1), caller errors (×4: all empty
   + per-input empty), symmetry/three-way distinct (×2), and
   realistic Hangul precomposed/decomposed pair + long mixed
   text (×3).

## Cross-reference (informational)

This verdict's logic is structurally similar to `data_inv_007`
(UTF-8 + NFC round-trip) — both gate NFC correctness — but at
different scope:
- `data_inv_007` counts NFC failures across a corpus.
- `bpe_inv_005` proves NFC idempotence on a single fixture
  triple.

The two together provide both unit-level (per-text) and
corpus-level (millions-of-files) NFC pinning.

## Scope

PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual NFC
implementation that produces the three byte sequences is
FULL_DISCHARGE work for the corpus-tokenize implementation PR.

## Tests

20 unit tests, all green. **Closes the 7/7 INV-BPE gate sweep**
on `tokenizer-bpe-v1`.
…EVEL (#1159)

Algorithm-level PARTIAL discharge for FALSIFY/INV-BPE-003 (round-trip
byte-equality on 10K held-out docs) per
`contracts/tokenizer-bpe-v1.yaml`. Also FALSIFY-SHIP-012 directly.

## What this binds

`verdict_from_roundtrip_scan(docs_scanned, roundtrip_failures)`
returns Pass iff:

1. `docs_scanned >= 10_000` (contract statistical floor)
2. `roundtrip_failures == 0` (zero-tolerance — non-injective
   tokenization corrupts loss target)
3. `roundtrip_failures <= docs_scanned` (counter sanity)

Pinned constant: `AC_BPE_INV_003_REQUIRED_DOCS = 10_000`.

## Why zero-tolerance

`decode(encode(nfc(text))) != nfc(text)` means the tokenizer is not
injective on the input domain. Even one failure means MODEL-2's
training loss is computed against tokens that don't reconstruct the
original text — silently corrupting the cross-entropy gradient on
that document. The contract falsifier ("Any non-zero diff bytes
fails") admits no tolerance band.

## Why 10K floor

The contract specifies a 10K-doc held-out corpus. Round-trip
failures cluster around rare Unicode patterns (emoji ZWJ sequences,
RTL combining marks, control sequences) that occur at frequencies
of ~1-in-10K to 1-in-100K. A scan over 1K docs would miss those
classes entirely; a scan over 100K would over-tax every smoke run.
Pinning 10K matches the contract's stat-power requirement.

## Why partition-violation check

`roundtrip_failures > docs_scanned` indicates counter corruption —
the failure set cannot exceed the total scanned. Catches a counter-
rollover bug or double-count regression.

## Five-Whys

1. Why bind INV-BPE-003 now? — Round-trip injectivity is the
   foundation of MODEL-2's training-target correctness; without a
   verdict pin, a regression in NFC handling or BPE merge-table
   loading silently corrupts every gradient.
2. Why a (u64, u64) pair, not the actual scanner? —
   Algorithm-level pin; the streaming round-trip scanner is
   FULL_DISCHARGE work for the corpus-tokenize PR.
3. Why pin 10K floor? — Statistical-power match to contract;
   smaller samples miss rare-Unicode classes.
4. Why partition-violation check? — Counter corruption mustn't
   silently pass.
5. Why 17 tests across 7 sections? — Mutation survey: provenance
   pin (×1), pass band (×4: floor, +1, CSN-Python, huge), zero-
   tolerance fail (×4 incl. one-in-million), sample-size fail (×3),
   partition violation (×2), boundary sweep (×8 each dimension), and
   zero-tolerance property (×4 sizes).

## Cross-reference

This verdict ALSO discharges FALSIFY-SHIP-012 (round-trip gate from
SHIP-TWO-001 spec §15.4) — the contract identifies the two as the
same gate. Two contract IDs share one PARTIAL discharge.

## Scope

PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual streaming
round-trip scanner that produces `(docs_scanned, roundtrip_failures)`
is FULL_DISCHARGE work for the corpus-tokenize implementation PR.

## Tests

17 unit tests, all green.
@noahgift noahgift force-pushed the chore/format-discharge-squash-v3 branch from 8089c51 to f595bd3 Compare May 12, 2026 12:09
@noahgift noahgift merged commit ba9ab03 into main May 12, 2026
10 checks passed
@noahgift noahgift deleted the chore/format-discharge-squash-v3 branch May 12, 2026 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant