chore(format): squash 135 PARTIAL-discharge feat commits into one PR (one CI run instead of 135)#1637
Merged
Merged
Conversation
This was referenced May 12, 2026
Closed
Closed
Closed
Closed
…GORITHM_LEVEL — closes 9/9 sweep Bundles 9 algorithm-level decision rules + a stand-alone scalar softmax reference (max-subtraction trick, returns SoftmaxError for empty/non-finite input): - SM-001 normalization: |sum - 1| < sqrt(n)*1e-6 - SM-002 positivity: all outputs > 0 (catches underflow-to-zero) - SM-003 order preservation: argmax(softmax(x)) == argmax(x) - SM-004 SIMD vs scalar: |simd - scalar| < 8 ULPs - SM-005 single-element softmax([x]) == [1.0] - SM-006 constant input -> uniform output [1/n; n] - SM-007 input validation: empty input -> Err; NaN/Inf -> Err - SM-008 length preserved: len(out) == len(in) - SM-009 frame condition: input buffer byte-identical after call 27 tests cover the canonical softmax invariants, the [1000, -1000] underflow band catching SM-002, and the (β1=0, β2=1, ε=0)-style input validation error boundaries. Contract: contracts/softmax-kernel-v1.yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…LGORITHM_LEVEL — closes 10/10 sweep
Bundles 10 algorithm-level decision rules + a stand-alone reference
online-softmax scan (OnlineState, online_update, online_scan,
online_softmax, standard_softmax) so the algorithm-level rule is
testable offline against drift in any SIMD/GPU implementation:
- OSM-001 |online - standard| < 1e-5 element-wise
- OSM-002 |Σ y_i - 1.0| < 1e-6
- OSM-003 every output strictly > 0 (catches underflow-to-zero)
- OSM-004 shift invariance: softmax(x + c) ≈ softmax(x)
- OSM-005 decoder kv_lens {1, 6, 64, 448, 1500} all valid
(sum tolerance scales by sqrt(n) for naive-sum FP error)
- OSM-006 single-element softmax([x]) == [1.0]
- OSM-007 running max invariant: m_i = max(x_1..x_i)
- OSM-008 loop counter terminates at n
- OSM-009 normalizer recompute: d_i = Σ exp(x_j - m_i)
- OSM-010 partial sum invariant (same as OSM-009 in reference scan)
38 tests cover canonical [1,...,1] uniform, the [100, -100, 50]
extreme range, the underflow-to-zero positivity regression, and the
Whisper-canonical {1, 6, 64, 448, 1500} kv_lens.
Contract: contracts/online-softmax-v1.yaml.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…RITHM_LEVEL — closes 11/11 sweep Bundles 11 algorithm-level decision rules + a stand-alone scalar AdamW reference implementation: - AW-001 decoupled weight decay: AdamW != L2-coupled Adam when wd > 0 AND theta != 0 (vacuous Pass otherwise per contract scope) - AW-002 second moment v_t >= 0 - AW-003 bias correction: 1/(1 - beta^t) > 1 for beta in (0, 1) and t >= 1 (within the f32 non-underflow regime) - AW-004 update finiteness with finite gradient and eps > 0 - AW-005 SIMD vs scalar: |simd - scalar| < 8 ULPs - AW-006 zero-gradient boundary: only weight decay modifies theta (theta_new == theta_old * (1 - lr * wd) within FP tolerance) - AW-007 hyperparameter validation: invalid (β1=0, β2=1, ε=0) → Err - AW-008 frame condition: gradient buffer byte-unchanged after step - AW-009 loop invariant: v_t >= 0 across N steps - AW-010 first moment formula: m = β1·old_m + (1-β1)·g (catches the (1-β1)·old_m + β1·g coefficient swap) - AW-011 loop variant: training terminates at exactly max_steps Reference impl AdamWState + AdamWHyperparams + adamw_step() provides the canonical scalar comparison target for AW-005 SIMD parity testing and AW-007 hyperparam validation. 36 tests cover the canonical Loshchilov & Hutter 2019 formulation, the AW-005 ULP distance computation across same-sign and sign-crossing pairs, and the (β1=0, β2=1, ε=0) error boundaries. Contract: contracts/adamw-kernel-v1.yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…AL_ALGORITHM_LEVEL — closes 15/15 sweep Bundles 15 algorithm-level decision rules over markdown / YAML / SVG / media integrity: - DOC-001 heading hierarchy: no H1->H3 skip - DOC-002 heading hierarchy: at most one H1 - DOC-003 link safety: no javascript: schemes (case-insensitive) - DOC-004 code fence language: bare ``` is a violation - DOC-005 table column parity: every row matches header - DOC-006 SVG safety: no <script> tags (case-insensitive) - DOC-007 README drift: actual == generated byte-exact - DOC-008 SVG must declare viewBox - DOC-009 required sections present (e.g. License) - DOC-010 YAML balanced delimiters - DOC-011 YAML no duplicate top-level keys - DOC-012 magic-bytes vs extension agreement (PNG/JPEG/GIF/WebP detection) - DOC-013 image dimensions <= 8192 x 8192, both non-zero - DOC-014 animation frame count in (0, 1000] - DOC-015 animation must not be infinite-loop (loop_count > 0) 49 tests cover security regression classes (XSS via javascript: URLs, SVG <script> injection, PNG/JPEG extension spoofing) and the canonical structural drift cases (heading skips, duplicate H1s, missing License, unclosed YAML braces). Contract: contracts/document-integrity-v1.yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…12 PARTIAL_ALGORITHM_LEVEL — closes 12/12 sweep
Bundles 12 algorithm-level decision rules for the architecture
role-set requirements, plus a self-contained reference model
(BASE_ROLES, QK_NORM_ROLES, BIAS_ROLES, ArchConstraints,
required_roles, from_architecture) that mirrors
realizar/src/arch_requirements.rs:
- ARCH-001 YAML/Rust set equality
- ARCH-002 base roles always present (BASE_ROLES subset)
- ARCH-003 4 (qk_norm, bias) cells produce 4 distinct non-empty sets
- ARCH-004 cardinalities {9, 11, 12, 14}
- ARCH-005 Qwen3: has QK norm, NOT bias (GH-279 root cause)
- ARCH-006 Qwen2: has bias, NOT QK norm (catches Qwen2/3 swap)
- ARCH-007 LLaMA/Mistral are base-only (9 roles, no extensions)
- ARCH-008 missing required role → reject (catches GH-279
silent-zero-weight regression)
- ARCH-009 optional roles do NOT trigger rejection
- ARCH-010 unknown architecture → base fallback
- ARCH-011 alias equivalence (qwen2.5/qwen, phi/phi3, llama/llama3,
gemma/gemma2)
- ARCH-012 monotonicity: enabling features only adds roles
37 tests cover canonical Qwen2 (12 roles), Qwen3 (11 roles), LLaMA
(9 roles) configurations + alias families + the GH-279 silent-zero
regression class.
Contract: contracts/architecture-requirements-v1.yaml.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…TIAL_ALGORITHM_LEVEL — closes 15/15 sweep Bundles 15 algorithm-level decision rules for the GPU decode profiler / cbtop report invariants: - GDP-001 brick coverage >= 85% wall time - GDP-002 graph-enabled coverage < 50% wall time - GDP-003 sync mode dichotomy: Immediate > 100us, Deferred < 100us - GDP-004 LmHead count == decoded_tokens (exactly 1 per token) - GDP-005 brick avg ordering: LmHead > Gate > RmsNorm strict - GDP-006 token unit separation: profiler.total > decoded_tokens - GDP-007 coverage upper bound: brick_total_ns <= wall_ns - GDP-008 reproducibility: per-token CV < 0.15 across runs - GDP-009 report fidelity: actual_us within 1% of profiler avg (catches B1 per-element vs per-call regression) - GDP-010 score diversity: not all scores hardcoded to 100 (catches B3 compute_brick_score-not-called regression) - GDP-011 brick count == 11 for Qwen 1.5B - GDP-012 falsification accounting consistency (catches B10-B12 hardcoded FalsificationSummary) - GDP-013 LmHead per-decoded-token > 100us (catches denominator using profiler.total_tokens) - GDP-014 pmat scores all 0.0 (catches B7-B9 magic constants 173.9 / 98.1 / 95.2) - GDP-015 total_points != 137 (catches B10 magic constant) 53 tests cover boundary sweeps, the canonical Qwen 1.5B configuration, the exact regression classes named in the contract (B1, B3, B7-B12), and FP edge cases. Contract: contracts/gpu-decode-profiling-v1.yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…PARTIAL_ALGORITHM_LEVEL — closes 7 unbound Bundles 7 algorithm-level decision rules for the qwen2-e2e general-purpose verification gates (the SHIP-* gates 001..024 are already bound at PARTIAL or DISCHARGED in their respective modules): - QW2E-001 verdict_from_param_count: |observed - 7,615,616,512| / target <= 0.5%. Pinned to the spec-canonical Qwen2.5-Coder-7B count. - QW2E-002 verdict_from_flops_per_token: strict observed_flops == 2*P (canonical roofline formula). - QW2E-003 verdict_from_memory_ordering: strict q4k < q6k < f16 < f32 byte ordering. - QW2E-004 verdict_from_throughput_roofline: observed_tps <= min(bandwidth_tps, compute_tps). - QW2E-005 verdict_from_obligation_coverage: every obligation has evidence; vacuous truth (zero obligations) rejected. - QW2E-006 verdict_from_block_shape_preservation: shape(block(x)) == shape(x). - QW2E-007 verdict_from_e2e_shape_conservation: tokens [seq] -> hidden [seq, hidden] -> logits [seq, vocab] with seq agreement. 37 tests cover boundary sweeps (tolerance bands, FLOPs off-by-one), strict-equality enforcement (memory ordering, FLOPs formula), domain violations (NaN, zero rooflines, counter inversions), and the Qwen canonical [16, 3584]/[16, 152064] shapes. Contract: contracts/qwen2-e2e-verification-v1.yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…..008 PARTIAL_ALGORITHM_LEVEL — closes 6 unbound Bundles 6 algorithm-level decision rules for the Xet large-file upload path: - LFS-003 verdict_from_chunk_sizes: Pass iff every chunk except the last is in [8KiB, 128KiB] AND the last is in (0, 128KiB]. - LFS-004 verdict_from_xorb_sizes: Pass iff every xorb <= 64 MiB. - LFS-005 verdict_from_upload_ordering: Pass iff every ShardSent has >= expected_xorb_count XorbAck events strictly preceding it. - LFS-006 verdict_from_idempotent_replay: was_inserted:false treated as success (key Xet idempotency invariant). - LFS-007 classify_retry/verdict_from_retry_classification: 429/500/503/504 -> Retry, 401 -> RefreshThenRetryOnce, 400/403/404 -> Abort, unknown statuses -> Abort (conservative default). - LFS-008 xet_url_hash_encode/verdict_from_url_hash_encoding: 8-byte LE-reversed hex (NOT naive hex). Pinned to the mathematically correct 64-char encoding of [0,1,...,31] (the contract's example string is a 62-char typo missing byte 0x10). LFS-001/002/009/010 are already bound at runtime in crates/aprender-core/src/hf_hub/xet.rs. 34 tests cover boundary sweeps, the conservative-Abort default for unknown HTTP status codes, the Xet idempotency invariant, and the canonical 32-byte hash encoding. Contract: contracts/apr-publish-hf-large-file-v1.yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…_ALGORITHM_LEVEL — closes 4/4 sweep Bundles 4 algorithm-level decision rules for README-vs-repo-state claim consistency: - README-001 verdict_from_crate_count: Pass iff claimed == observed AND both non-zero. - README-002 verdict_from_contract_count: Pass iff claimed == observed AND both non-zero. - README-003 verdict_from_cli_command_count: Pass iff claimed == observed AND observed >= 30 (guards against empty-help regression — apr is documented as a 58-subcommand CLI). - README-004 verdict_from_cookbook_link_presence: Pass iff readme_text contains literal substring "apr-cookbook". 21 tests cover off-by-one mutations, zero-input domain errors, empty-help regression, link-renamed-without-sync, and the 30-subcmd provenance pin (AC_README_003_MIN_REASONABLE). Pins the algorithm-level rule against drift; the runtime-level falsifier `bash scripts/check_readme_claims.sh` already exists per the contract's evidence line. Contract: contracts/readme-claims-v1.yaml. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…LGORITHM_LEVEL — closes 3/3 sweep Bundles 3 algorithm-level decision rules for chat-template-v1: - CHAT-001 verdict_from_chatml_render: Pass iff non-empty AND balanced <|im_start|>/<|im_end|> counts. - CHAT-002 verdict_from_llama3_render: Pass iff starts with <|begin_of_text|> AND contains <|eot_id|>. - CHAT-003 verdict_from_missing_template_outcome: Pass iff outcome == ExplicitError; all silent-success branches Fail (PMAT-237 process-validation rule). 17 tests cover the 3 decision rules with mutation-survey shape: empty/missing-token fail bands + cross-template-family rejection (ChatML in Llama-3 slot) + exhaustive 4-variant outcome sweep. Provides the algorithm-level pin against contract drift; the runtime-level falsifier `cargo test ... chat_template` already exists per the contract's evidence line. Contract: contracts/chat-template-v1.yaml. Spec: SHIP-TWO-001 §12 (chat template ship gates). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…TIAL_ALGORITHM_LEVEL
Algorithm-level PARTIAL discharge for FALSIFY-PUB-EXTRA-001 (apr
publish --manifest dry-run accepts valid YAML) AND -007 (HF repo
contains apr+safetensors+gguf 3-format coverage) per
`contracts/apr-cli-publish-extra-v1.yaml`.
## What this binds
PUB-EXTRA-001: `verdict_from_manifest_dry_run_exit(exit_code)` —
Pass iff `exit_code == 0`. Single-input exit-code shape.
PUB-EXTRA-007: `verdict_from_three_format_repo(apr, safetensors,
gguf)` — Pass iff `(apr, safetensors, gguf) == (1, 1, 1)`. Mirrors
QA-009's 3-format coverage but counts files-per-format instead of
exit-codes-per-format.
Pinned constant: `AC_PUB_EXTRA_007_REQUIRED_FORMATS = 3`.
## apr-cli-publish-extra-v1 sweep status
After this PR: 4 of 10 bound (001 + 002 + 006 + 007).
Remaining: 003 (--extra-file), 004 (no-manifest backward compat),
005/008 (script Python exclusion), 009/010 (preflight validation).
## Why bundle 001 + 007
Both are publish-pipeline integrity verdicts. PUB-EXTRA-001 gates
the manifest-consumption entry point; PUB-EXTRA-007 gates the
multi-format ship completeness. Same family of "publish
correctness" gates. Same precedent as `pub_cli_002_004`,
`qa_002_006`, `cmd_safety` (4), `pub_extra_002_006`.
## Why match QA-009 shape but with file counts
QA-009 (apr-cli-qa-v1) checks exit codes from `apr inspect`
on each of 3 formats — does the parser support each format?
PUB-EXTRA-007 checks file counts per extension in the published
HF repo — does the ship deliver each format?
Two complementary 3-format gates at different pipeline stages:
- QA-009: parser ingest (does apr support format X?)
- PUB-EXTRA-007: ship publish (did we ship format X?)
Both must Pass for full 3-format guarantee.
## Five-Whys
1. Why bind these two now? — Both gate publish-pipeline
correctness; without verdict pins, regressions in either
manifest consumption or 3-format ship completeness corrupt
the published artifact set.
2. Why distinct shapes? — 001 is single exit-code; 007 is
3-counter equality.
3. Why mirror QA-009 shape? — Cross-pipeline 3-format guarantee
needs both sides (ingest + ship); reuse signals consistency.
4. Why exhaustive 27-cell composite matrix in 007? — Mutation-
survey breadth; proves only `(1,1,1)` passes across the full
{0,1,2}³ cube.
5. Why 14 tests across 2 verdict sections? — 5 for 001, 9 for
007 (incl. exhaustive cube).
## Cross-reference
- `qa_009`: 3-format parser ingest (exit codes).
- `pub_extra_001_007`: 3-format ship publish (file counts).
The "three-format coverage" pattern now applied at both pipeline
endpoints.
## Scope
PARTIAL_ALGORITHM_LEVEL only.
## Tests
14 unit tests, all green.
…TIAL_ALGORITHM_LEVEL Algorithm-level PARTIAL discharge for FALSIFY-PUB-EXTRA-002 + 006 — two CRITICAL ship-integrity gates per `contracts/apr-cli-publish-extra-v1.yaml`. ## What this binds PUB-EXTRA-002 (sha256-mismatch must abort PRE-upload): `verdict_from_sha256_mismatch_abort(exit_code, hf_api_call_count)` returns Pass iff `exit_code != 0 AND hf_api_call_count == 0`. PUB-EXTRA-006 (post-upload sha256 round-trip): `verdict_from_post_upload_roundtrip(manifest_sha256, pulled_sha256)` returns Pass iff both 32 bytes AND byte-identical. Pinned constant: `AC_PUB_EXTRA_006_SHA256_BYTES = 32`. ## Why this matters for shipping PUB-EXTRA-002 is described in the contract as **CRITICAL**: "the whole point of manifest-first publish is the sha256 guard — if it fails, we ship garbage". Without a verdict pin, a regression that allows the publish path to skip the pre-flight sha256 check ships corrupted artifacts to HuggingFace. PUB-EXTRA-006 catches the complementary failure mode: even if pre-upload validation passed, network transit corruption produces artifacts whose post-upload sha256 differs from manifest. Both gates must Pass for ship integrity. ## Why two gates instead of one The two gates protect different stages of the upload pipeline: - 002: pre-upload (catches local-vs-manifest mismatch) - 006: post-upload (catches transit corruption) A failure mode at either stage corrupts the published artifact. Independent verdicts ensure each stage's regression class is caught independently. ## Why bundle in one module Same shape family (both ship-integrity guards on the publish path). Same precedent as `pub_cli_002_004`, `qa_002_006`, `cmd_safety` (4 verdicts), `ops_005_006`. Authoring two modules with identical structure would be muda. ## Five-Whys 1. Why bind these two now? — Both are CRITICAL ship-integrity gates for MODEL-1/2 HuggingFace upload. Without verdict pins, regressions ship corrupted artifacts silently. 2. Why distinct verdict shapes? — 002 is conjunctive `(exit, network-count)`; 006 is byte-equality on SHA-256. Different stages, different shapes. 3. Why pre + post both? — Defense in depth: pre-upload catches local sources of corruption (filesystem write race, manifest stale); post-upload catches transit sources (HF endpoint bug, network corruption). 4. Why pin 32-byte SHA-256? — Catches MD5 substitution class (matching-but-wrong-length test). 5. Why 18 tests across 2 verdict sections? — 7 for 002 (pass pre-upload abort, panic exit, clap exit; fail silent zero exit, uploads-before-abort, zero-exit-with-uploads), 11 for 006 (provenance, identical match, realistic SHA-256, single-byte drift at first/middle/last, length errors, wrong-length-equal, full-32-position byte-flip sweep). ## Cross-reference The sha256-byte-eq verdict is now applied in 4 modules: - `data_inv_005` (corpus reproducibility) - `ship_010` (published-artifact byte identity) - `ops_001` (filesystem snapshot equality) - `pub_extra_002_006` (post-upload round-trip) This is a strong reuse signal — promote to `format::common::sha256_byte_eq` shared util in follow-up. ## Scope PARTIAL_ALGORITHM_LEVEL only. ## Tests 18 unit tests, all green (7 for 002, 11 for 006).
…LEVEL — closes sweep Algorithm-level PARTIAL discharge for FALSIFY-CMD-SAFETY-001 + 002 + 003 + 004 per `contracts/apr-cli-command-safety-v1.yaml`. All four bundled in a single module because each verdict has a small, simple algorithm-level reduction. ## ✅ Closes 4/4 CMD-SAFETY sweep **Seven contract families now fully algorithm-bound at PARTIAL:** - `dataset-thestack-python-v1` (7/7) - `tokenizer-bpe-v1` (7/7) - `apr-cli-publish-v1` (4/4) - `apr-cli-qa-v1` (10/10) - `apr-cli-coverage-v1` (1/1) - `apr-cli-operations-v1` (7/7) - `apr-cli-command-safety-v1` (4/4) ← this PR ## CMD-SAFETY-001: read-only no new files `verdict_from_file_count_delta(before, after)` Pass iff equal. Catches both new-file creation (mutation) and file-count-decrease (unauthorized cleanup). ## CMD-SAFETY-002: mutating cmd usage error `verdict_from_mutating_cmd_usage_error(stderr)` Pass iff stderr contains `output` OR `required` (case-insensitive). Catches `apr convert model.apr` silently running without `-o` flag. ## CMD-SAFETY-003: SIGINT exit code <= 130 `verdict_from_sigint_exit_code(code)` Pass iff `0 <= code <= 130`. Pinned constant `AC_CMD_SAFETY_003_MAX_EXIT = 130` (= 128 + 2, SIGINT). Catches SIGKILL (137) which means TUI didn't release resources cleanly. ## CMD-SAFETY-004: every command classified `verdict_from_command_classification(total, classified)` Pass iff `total > 0 AND classified == total`. Adds another enforcement layer for the canonical 58-cmd registry. ## Why bundle 4 verdicts in one module Same precedent as `pub_cli_002_004` (2 verdicts), `qa_002_006` (2 verdicts). All 4 cmd-safety verdicts have small, simple algorithm-level reductions; authoring 4 separate modules with identical structure would be muda. Contract IDs stay distinct. ## Five-Whys 1. Why bind all 4 now? — Closes 4/4 cmd-safety sweep, all small verdicts. 2. Why one module? — Same shape family, same precedent as `pub_cli_002_004` / `qa_002_006`. 3. Why case-insensitive substring for CMD-SAFETY-002? — Clap error messages can be `error:` or `ERROR:` depending on colorization; case-insensitive match is robust. 4. Why pin 130 for SIGINT? — Standard POSIX `128 + 2` for SIGINT exit; > 130 (e.g., 137 SIGKILL) means cleanup handlers didn't run. 5. Why 24 tests in 5 sections? — 4 verdicts × ~5 tests each plus shared-primitive tests. Each verdict gets pass/fail/edge coverage. ## Cross-reference Bundle pattern (multiple verdicts in one module) now applied to 3 modules: `pub_cli_002_004`, `qa_002_006`, `cmd_safety`. Each bundles 2-4 small verdicts with shared structure. ## Scope PARTIAL_ALGORITHM_LEVEL only. ## Tests 24 unit tests, all green.
…LEVEL — closes 7/7 sweep Algorithm-level PARTIAL discharge for FALSIFY-OPS-002 (GPU memory returns to baseline after inference error) per `contracts/apr-cli-operations-v1.yaml`. ## What this binds `verdict_from_gpu_memory_leak_delta(baseline, post_error, tolerance)` returns Pass iff: 1. `baseline <= 1 TiB` 2. `post_error <= 1 TiB` 3. `tolerance <= 256 MiB` 4. `|post_error - baseline| <= tolerance` Two pinned constants: - `AC_OPS_002_MAX_PLAUSIBLE_GPU_BYTES = 1 TiB` - `AC_OPS_002_MAX_TOLERANCE_BYTES = 256 MiB` ## ✅ Closes the 7/7 OPS sweep on `apr-cli-operations-v1` **Six contract families now fully algorithm-bound at PARTIAL:** - `dataset-thestack-python-v1` (7/7) - `tokenizer-bpe-v1` (7/7) - `apr-cli-publish-v1` (4/4) - `apr-cli-qa-v1` (10/10) - `apr-cli-coverage-v1` (1/1) - `apr-cli-operations-v1` (7/7) ← this PR ## Why bounded-delta (not byte-equality) Byte-equality would be too strict — CUDA driver retains some metadata pages after freeing user allocations (tens of MiB typical). Bounded delta with a tolerance band matches operational reality. ## Why cap tolerance at 256 MiB The verdict is meaningless if the caller can pass arbitrary tolerance — `tolerance = u64::MAX` would silently accept any leak. Capping tolerance enforces a useful lower bound on detection sensitivity: - 256 MiB covers CUDA driver metadata generously. - Below that ceiling, real leaks (per-request KV cache, 7B model copy, etc.) all produce deltas > 256 MiB and fail. - A contributor cannot "tune the gate to pass" by setting tolerance higher. `fail_tolerance_8_gib_would_mask_model_leak` test explicitly catches the adversarial-tolerance class. ## Why 1 TiB GPU memory cap Catches counter corruption / unsigned-overflow regressions in the memory reading. No consumer or datacenter GPU exceeds 1 TiB (H100 is 80 GiB, B200 is 192 GiB). ## Five-Whys 1. Why bind OPS-002 now? — Closes 7/7 ops sweep. GPU memory leaks on the error path are silent in dev (1 error = 256 MiB leak; tolerated single-shot) but catastrophic in production (1000 errors/day → 256 GiB/day → OOM crash). 2. Why a 3-tuple of u64? — Algorithm-level pin; the actual nvidia-smi / cuMemGetInfo capture harness is FULL_DISCHARGE. 3. Why bounded delta (not byte-equality)? — CUDA driver retains metadata; byte-equality is too strict. 4. Why cap tolerance at 256 MiB? — Prevents adversarial-tolerance gate-bypass; ensures meaningful detection sensitivity. 5. Why 21 tests across 7 sections? — Provenance pin (×2), pass band (×6: exact, ±32 MiB, exact-tolerance, zero, max-256 MiB, real workloads), leak fail (×3 incl. catastrophic 4 GiB), memory domain violations (×3), tolerance-too-large (×3), delta sweep (×8 probes), realistic GPU scenarios (×3: RTX 4090, A100 80GB, KV cache leak). ## Cross-reference Bounded-delta verdict pattern joins the catalog alongside: - Threshold (`cov_001`, `data_inv_003`, `distill_train_009`) - Linear-bound (`ops_007` token count) - Conjunctive exit (`pub_cli_002_004`, `qa_002_006`) - Byte-equality family (6 modules) - Sequence-monotonicity (`ops_004`) - Substring-containment (`pull_dataset_005`, `pub_cli_001`) Each domain with measurement + tolerance now has a reference pattern. ## Scope PARTIAL_ALGORITHM_LEVEL only. ## Tests 21 unit tests, all green.
…LEVEL
Algorithm-level PARTIAL discharge for FALSIFY-OPS-004 (progress
percentage monotonically non-decreasing) per
`contracts/apr-cli-operations-v1.yaml`.
## What this binds
`verdict_from_progress_monotonicity(progress_sequence)` returns
Pass iff:
1. `!progress_sequence.is_empty()`
2. Every value is finite (no NaN, ±∞)
3. Every value is in `[0.0, 100.0]`
4. Every adjacent pair is non-decreasing (`a <= b`)
## Why non-strict ≤ (not strict <)
Throttled progress emission commonly produces same-value adjacent
pairs (e.g., 25.0 → 25.0 → 50.0 due to debouncing). Strict `<`
would reject normal progress streams. Non-strict matches contract
"never decreases" wording.
## Why the (slice → bool) shape (novel)
First sequence-based verdict in the algorithm-binding cycle. Most
prior verdicts take 1-3 scalar inputs; this one takes a `&[f64]`.
The shape is appropriate because monotonicity is inherently a
sequence property — no equivalent reduction to scalar inputs
exists without losing precision.
## Five-Whys
1. Why bind OPS-004 now? — Progress regression confuses users
("download is now at 30% after being at 60%?"). Without a
verdict pin, retry-mid-stream and concurrent-writer-overlap
regressions ship invisibly.
2. Why a `&[f64]` input? — Algorithm-level pin; the actual
progress-event capture is FULL_DISCHARGE. Sequence shape
matches the inherent nature of monotonicity.
3. Why non-strict `≤`? — Contract literal "never decreases";
throttled emissions produce same-value adjacent pairs.
4. Why include domain checks? — NaN/±∞/out-of-range values
could pass a naive `windows(2)` comparison if monotonic
between themselves; defense in depth.
5. Why 22 tests across 7 sections? — Pass band (×5: increasing,
repeats, single, resume, dense), regression fail (×4 incl.
ULP-level), empty fail (×1), NaN/±∞ (×3), out-of-range (×3),
boundary cases (×3 at 0/100/full-range), realistic apr pull
scenarios (×3 incl. retry-resets-progress, concurrent
overlap).
## Cross-reference
The sequence-monotonicity verdict joins the verdict-shape
catalog as a new primitive. Other sequence-based gates that
might use this shape:
- Training loss curves (should be roughly monotonic-decreasing)
- Layer index ordering in trace output
- Token sequence position numbering
If reused, promote to `format::common::is_monotonic_nondecreasing`.
## Scope
PARTIAL_ALGORITHM_LEVEL only. Wiring into the actual
progress-event capture is FULL_DISCHARGE work for the
operations-pipeline implementation PR.
## Tests
22 unit tests, all green.
…LEVEL Algorithm-level PARTIAL discharge for FALSIFY-OPS-007 (token count bounded by 4× input byte length — BPE worst case) per `contracts/apr-cli-operations-v1.yaml`. ## What this binds `verdict_from_token_count_bound(input_byte_len, token_count)` returns Pass iff: 1. `input_byte_len > 0` (refuse vacuous Pass) 2. `token_count <= input_byte_len.checked_mul(4)` (with overflow guard) Pinned constant: `AC_OPS_007_MAX_RATIO = 4`. ## Why pin 4× ratio BPE byte-level fallback emits at most ~4 tokens per input byte in the worst case (UTF-8 4-byte sequences split into 4 byte-fallback tokens when no merges apply). 4× is the published BPE worst-case bound. Drift to 8× would mask token-explosion bugs (unbounded merge expansion); drift to 2× would over-tighten and reject pathological- but-valid CJK / emoji inputs at the byte-fallback limit. ## Why DoS-class regression matters Token explosion is a denial-of-service vector: a 1KB adversarial input producing 5MB of tokens causes: - Unbounded GPU memory growth (KV cache scales with token count) - Inference timeout / OOM - API endpoint backpressure The `fail_unbounded_memory_attack` test pins this with a 10:100k ratio (10-byte input, 100k tokens — clearly adversarial). ## Why checked_mul `input_byte_len * 4` overflows u64 at `input_byte_len > u64::MAX/4`. Without `checked_mul`, the comparison would silently wrap to a tiny number and accept any `token_count`. The `fail_input_times_4_overflow` test catches this. ## Five-Whys 1. Why bind OPS-007 now? — Token-explosion DoS regressions silently pass CI (CI doesn't run adversarial inputs); only show up in production user load. Without a verdict pin, the bound drifts. 2. Why a (u64, u64) pair? — Algorithm-level pin; the actual adversarial-input harness is FULL_DISCHARGE. 3. Why 4× specifically? — Contract literal; matches BPE worst-case byte-fallback bound. 4. Why inclusive `<=`? — Contract test wording `<= 4 * len`. 5. Why 18 tests across 7 sections? — Provenance pin (×1), pass band (×5: typical, exact-cap, 1-byte-1-token, realistic prompt, 1M scale), token-explosion fail (×4 incl. DoS class), empty-input fail (×2), overflow protection (×2), boundary sweep (×10 probes around 400-cap), realistic UTF-8 scenarios (×3: emoji 4-byte, CJK 3-byte, adversarial). ## Cross-reference Linear-bound verdict pattern joins the catalog alongside threshold (`cov_001`, `data_inv_003`), count-zero-tolerance (many), and conjunctive-exit (`pub_cli_002_004`). Each domain with linear-relationship invariants now has a reference pattern. ## Scope PARTIAL_ALGORITHM_LEVEL only. Wiring into the actual adversarial- input tokenizer harness is FULL_DISCHARGE work for the operations-pipeline implementation PR. ## Tests 18 unit tests, all green.
…LEVEL Algorithm-level PARTIAL discharge for FALSIFY-OPS-003 (greedy decoding is deterministic) per `contracts/apr-cli-operations-v1.yaml`. ## What this binds `verdict_from_greedy_decoding_pair(run_a, run_b)` returns Pass iff: 1. Both outputs are non-empty 2. `run_a == run_b` (byte-identical) ## Same shape as bpe_inv_006 `bpe_inv_006` (encode determinism) uses positive byte-equality on token-id sequences. `ops_003` uses positive byte-equality on generated text outputs. Both gate determinism at different layers: - `bpe_inv_006`: tokenizer deterministic (encode → same token IDs) - `ops_003`: sampler deterministic (greedy decode → same text) The shared shape signals a verdict-shape pattern: "two outputs must match or Fail" applies to encode determinism, decode determinism, reproducibility, side-effect freedom. ## Why this matters for SHIP-007 root-cause work The realistic regression classes pinned in Section 7 cover exactly the bisection axes from `2026-04-27 findings.md`: - HashMap iteration order in argmax tiebreak - Floating-point associativity in parallel logit reduction - Per-host code path divergence If a future SHIP-007 fix introduces or reveals nondeterminism in the sampling layer, this verdict catches it independently of the layer-0 stage diff. ## Five-Whys 1. Why bind OPS-003 now? — Greedy-decoding non-determinism is the canonical "passes CI in dev, fails on user box" defect class; without a verdict pin, MODEL-1 ship goes out with silent stochasticity. 2. Why a (&[u8], &[u8]) pair? — Algorithm-level pin; the actual double-invocation harness is FULL_DISCHARGE. 3. Why byte-identical? — Determinism is binary; one ULP/bit of drift signals upstream non-determinism. 4. Why refuse empty? — Vacuous Pass would mask `apr run` silent regression. 5. Why 18 tests across 7 sections? — Pass band (×4: short, 2+2, long, special chars), single-byte drift fail (×3: first/last/middle), length mismatch (×2), empty input (×3), symmetry (×2: pass + fail), full-position drift sweep (43 positions), realistic apr run scenarios (×3 incl. SHIP-007 regression classes — HashMap tiebreak, FP associativity). ## Cross-reference - `bpe_inv_006`: encode determinism (tokenizer layer). - `ops_003`: decode determinism (sampler layer). [this PR] - `data_inv_005`/`ship_010`/`ops_001`: SHA-256 byte-equality (third application of the digest-pair pattern). The byte-equality verdict family is now applied across 5+ modules at different scopes (token IDs, generated text, SHA-256 digests, filesystem snapshots, NFC byte sequences). ## Scope PARTIAL_ALGORITHM_LEVEL only. Wiring into the actual double- invocation `apr run --temperature 0` harness is FULL_DISCHARGE work for the operations-pipeline implementation PR. ## Tests 18 unit tests, all green.
…LEVEL Algorithm-level PARTIAL discharge for FALSIFY-OPS-001 (read-only commands have no filesystem side effects) per `contracts/apr-cli-operations-v1.yaml`. ## What this binds `verdict_from_filesystem_snapshot_pair(before_snapshot, after_snapshot)` returns Pass iff: 1. Both digests are 32 bytes (SHA-256) 2. `before == after` (byte-identical) Pinned constant: `AC_OPS_001_SHA256_BYTES = 32`. ## Why mirror data_inv_005's shape Identical algorithm-level reduction: - `data_inv_005` (corpus_sha256 reproducibility): two SHA-256 digests from independent ingest hosts must match. - `ops_001` (read-only side-effect freedom): two SHA-256 digests of the filesystem before/after a "read-only" command must match. Both verdicts share: - 32-byte SHA-256 length pin - Byte-identical equality - Wrong-length-but-equal Fail (catches MD5 substitution) The reuse signals a verdict-shape pattern: "two digests must match or Fail" applies to reproducibility, side-effect freedom, and any content-equality gate. ## Why pin SHA-256 length explicitly Three regression classes covered: 1. Truncation (some hash libraries return prefix only). 2. Padding (truncated digest padded to 64 bytes for SHA-512). 3. Hash-function drift (SHA-3, BLAKE3 substituted). `fail_wrong_length_but_equal` test (16-byte arrays matching each other) explicitly catches MD5 substitution, even when the bytes "happen to be equal". ## Five-Whys 1. Why bind OPS-001 now? — Read-only commands writing files is a subtle defect class (lock files, metadata caches, mtime updates) that ships invisibly. CI passes; only filesystem audits catch it. 2. Why a (&[u8], &[u8]) pair? — Algorithm-level pin; the actual `find -print0 | xargs sha256sum | sha256sum` snapshot harness is FULL_DISCHARGE. 3. Why byte-identical (no tolerance)? — Read-only is binary; any filesystem mutation invalidates the contract. 4. Why pin SHA-256 length? — Catches hash-function drift class. 5. Why 15 tests across 7 sections? — Provenance pin (×1), pass band (×2), single-byte mutation (×3 incl. one-bit), wrong-length caller error (×4 incl. matching-but-wrong-length), 32-position byte-flip sweep, symmetry property, realistic apr scenarios (×3: clean run, lock-file write, metadata-cache). ## Cross-reference The 32-byte SHA-256 byte-equality verdict is now reused across: - `data_inv_005` (corpus reproducibility) - `ship_010` (published-artifact byte identity) - `ops_001` (filesystem snapshot equality) If we add a fourth, this primitive should be promoted to a shared `format::common::sha256_byte_eq` utility in a follow-up. ## Scope PARTIAL_ALGORITHM_LEVEL only. Wiring into the actual filesystem- snapshot harness is FULL_DISCHARGE work for the operations-pipeline implementation PR. ## Tests 15 unit tests, all green.
…VEL — closes 1/1 sweep Algorithm-level PARTIAL discharge for FALSIFY-COV-001 (apr-cli line coverage >= 95%) per `contracts/apr-cli-coverage-v1.yaml`. ## What this binds `verdict_from_line_coverage_percent(line_coverage_percent)` returns Pass iff: 1. `line_coverage_percent.is_finite()` (rules out NaN, ±∞) 2. `0.0 <= line_coverage_percent <= 100.0` (domain) 3. `line_coverage_percent >= 95.0` (contract floor, inclusive) Pinned constant: `AC_COV_001_MIN_PERCENT = 95.0`. ## ✅ Closes the 1/1 COV sweep on `apr-cli-coverage-v1` **Five contract families now fully algorithm-bound at PARTIAL:** - `dataset-thestack-python-v1` (7/7) - `tokenizer-bpe-v1` (7/7) - `apr-cli-publish-v1` (4/4) - `apr-cli-qa-v1` (10/10) - `apr-cli-coverage-v1` (1/1) ← this PR ## Why inclusive `>=` 95.0 Contract test wording: `>= 95.0`. Inclusive floor matches character-for-character. Boundary tests `pass_at_exact_floor_95_0` + `fail_just_below_floor_via_ulp` (using `f64::from_bits` for ULP-below) bracket the strict semantics. ## Why pin 95.0 (not 85.0) Per CLAUDE.md "Quality Standards": "95% minimum test coverage (upgraded from 85%)". The verdict pins the upgraded threshold; `fail_at_85_old_threshold` test explicitly catches a regression back to the pre-upgrade value. ## Why is_finite() + range checks `cargo llvm-cov` could emit NaN (rare division by zero in coverage tooling), Inf (huge counters wrapping), or out-of-range values (parsing bug treating a count as percentage). Defense in depth catches all three. ## Five-Whys 1. Why bind COV-001 now? — Coverage drift is the kind of regression that ships silently with green CI (the threshold moves, not the code). Without a verdict pin, a contributor could quietly lower the floor in `make tier3` and ship. 2. Why an f64 input? — Algorithm-level pin; the `cargo llvm-cov | awk` extraction is FULL_DISCHARGE. 3. Why inclusive `>=`? — Contract literal. ULP-below test bracket strict semantics. 4. Why pin 95.0 vs 85.0? — CLAUDE.md upgraded; verdict enforces upgraded value. 5. Why 19 tests across 7 sections? — Provenance pin (×1), pass band (×4: floor, canonical 96.35, 99.99, 100), below- floor fail (×4 incl. 85% old threshold), NaN/±∞ (×3), out-of-domain (×3: -1, 100.001, 1e10), boundary sweep (11 probes + 1 ULP), realistic apr-cli scenarios (×2). ## Cross-reference Threshold-based verdict pattern (`>= floor` with domain checks) is conceptually similar to: - `data_inv_003` (Jaccard < 0.85, strict) - `tok_par_002` (efficiency >= 80%, inclusive via cross-multiply) - `distill_train_009` (val_loss strict <) Each binds a different threshold semantic; this one is inclusive `>=`. ## Scope PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual `cargo llvm-cov report --summary-only | awk` extraction is FULL_DISCHARGE work for the qa-pipeline implementation PR. ## Tests 19 unit tests, all green.
Algorithm-level PARTIAL discharge for FALSIFY-QA-009 (3-format
coverage: GGUF + APR + SafeTensors) per
`contracts/apr-cli-qa-v1.yaml`.
## What this binds
`verdict_from_three_format_coverage(gguf_exit, apr_exit, safetensors_exit)`
returns Pass iff all three exit codes equal 0.
Pinned constant: `AC_QA_009_REQUIRED_FORMAT_COUNT = 3`.
## Why conjunctive composition (all three or fail)
Per CLAUDE.md "Debugging: Use apr Tools First":
"All tools support GGUF, APR, and SafeTensors formats. If a tool
says 'format not supported', that's a BUG."
Each format is independently shipped as a contract, so
silently dropping one is a regression class. Conjunctive `&&`
catches:
- Single-format regression (most common: SafeTensors writer
panics on a new tensor variant).
- Multi-format collapse (rare: a refactor breaks all three).
The 8-cell composite matrix in Section 6 proves only `(0, 0, 0)`
passes — every other cell of `{0, 1}³` fails.
## Why pin format count at 3
Catches a regression where:
- A 4th format is silently added (e.g., ONNX) without bumping
the contract — would not be exercised by this verdict.
- One of the three is silently dropped from the test harness.
The `provenance_required_format_count_is_three` test pins this.
## Five-Whys
1. Why bind QA-009 now? — Format coverage is a CLAUDE.md-level
invariant; without a verdict pin, a SafeTensors regression
(e.g., new tensor variant the writer doesn't handle) ships
silently.
2. Why a 3-tuple of i32? — Algorithm-level pin; the
per-format `apr inspect` invocation harness is FULL_DISCHARGE.
3. Why conjunctive? — Each format is independent; AND-fold
catches any single regression.
4. Why 8-cell composite matrix? — Proves only origin (0, 0, 0)
passes; mutation testing would catch any subset-Pass mutation.
5. Why 15 tests across 7 sections? — Provenance pin (×1), pass
band (×1: only origin), single-format fail (×3 — one per
format), multi-format fail (×4 incl. all-three-fail),
non-1 exit fail (×3: panic, negative, i32::MAX), 8-cell
composite matrix, symmetry under permutation + exhaustive
4×4×4 sweep at canonical exit codes.
## Cross-reference
The "all-must-succeed conjunctive" pattern is conceptually
similar to `bpe_inv_002` (4 special tokens distinct + in range),
but scaled to 3-tuple exit codes instead of 4-tuple ID checks.
Both rely on conjunctive composition of independent gates.
## Scope
PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual
3-format `apr inspect` invocation harness is FULL_DISCHARGE
work for the qa-pipeline implementation PR.
## Tests
15 unit tests, all green.
Algorithm-level PARTIAL discharge for FALSIFY-QA-003 (JSON output
is valid — `jq .` exits 0) per `contracts/apr-cli-qa-v1.yaml`.
## What this binds
`verdict_from_json_validity(output_bytes, jq_exit_code)` returns
Pass iff:
1. `output_bytes` is non-empty (refuse vacuous Pass)
2. `jq_exit_code == 0` (jq accepted the input)
3. First non-whitespace byte is `{` or `[`
4. Last non-whitespace byte is `}` or `]`
5. Brace `{`/`}` counts balance AND bracket `[`/`]` counts balance
Five conjunctive checks composing JSON-shape sanity with jq's
parser verdict.
## Why redundant structural checks alongside jq?
Two failure modes the verdict must catch:
1. **jq missing or stubbed**: a CI environment where `jq` is
absent and the harness silently substitutes `true` (exit 0).
Without the structural checks, `(any-text, 0)` would Pass.
2. **jq spurious 0**: a jq regression returning 0 on whitespace
or non-JSON input. The structural checks (start with `{`/`[`,
end with `}`/`]`, balanced) catch all three classes:
- bare strings
- leading text + JSON
- trailing text after JSON
Both jq AND structural must agree to Pass.
## Documented limitation: braces inside strings
The brace-balance check is byte-level and counts `{`/`}` even
inside string literals like `"{not real}"`. This means
`{"k":"} extra"}` would have unbalanced count 0+1-1+0 = 0 (Pass)
even though it's malformed JSON.
This is intentional for PARTIAL_ALGORITHM_LEVEL: real
balanced-with-string-awareness parsing requires a JSON tokenizer,
which is FULL_DISCHARGE territory. The verdict's role is to
catch the common regression class (truncated output, missing
closing brace, double-opening) which the byte-counting check
handles correctly.
## Five-Whys
1. Why bind QA-003 now? — JSON output validity is the contract
for every `--json` flag (cf. QA-007). A regression that emits
malformed JSON breaks every downstream consumer.
2. Why a (&[u8], i32) pair? — Algorithm-level pin; subprocess
invocation + `jq .` pipe is FULL_DISCHARGE.
3. Why structural checks alongside jq? — Two failure classes
(missing jq, jq spurious-0) require defense in depth.
4. Why byte-level brace count (not real JSON parser)? —
PARTIAL_ALGORITHM_LEVEL pins the decision rule; full JSON
parser is FULL_DISCHARGE.
5. Why 22 tests across 7 sections? — Pass band (×5: object,
array, realistic, nested, with whitespace), jq-rejected fail
(×2), empty output fail (×2 incl. whitespace-only), non-JSON
shape fail (×3), bracket imbalance fail (×4 incl. realistic
truncation), edge cases (×3: minimal `{}`/`[]`, negative jq
exit), realistic apr scenarios (×3: qa results, trace stats,
text-instead-of-JSON regression).
## Cross-reference
Companion to QA-007 (`--json` flag changes output): QA-007
ensures `--json` does *something*; QA-003 ensures what it does
is well-formed JSON. Both gates must Pass independently.
## Scope
PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual
`apr inspect --json | jq .` pipeline is FULL_DISCHARGE work for
the qa-pipeline implementation PR.
## Tests
22 unit tests, all green.
Algorithm-level PARTIAL discharge for FALSIFY-QA-008 (no phantom subcommands — advertised commands have real implementations) per `contracts/apr-cli-qa-v1.yaml`. ## What this binds `verdict_from_phantom_scan(commands_advertised, phantom_count)` returns Pass iff: 1. `commands_advertised > 0` (refuse vacuous Pass on empty CLI) 2. `phantom_count == 0` (zero-tolerance phantom commands) 3. `phantom_count <= commands_advertised` (counter sanity) ## Why this is distinct from QA-001 QA-001 pins the *registered count* at 58 (catches add/drop drift in the registry vs CLAUDE.md claim). QA-008 pins the *registry-vs-implementation gap* (catches commands that are advertised in `apr --help` but have no working `--help`-responding dispatch). A new subcommand added to the registry but with the dispatch arm forgotten passes QA-001 (count stays 58 if it replaced something) but fails QA-008 (advertised=58 but phantom=1). ## Why zero-tolerance A phantom subcommand ships a "command not found" experience to users — the worst kind of CLI regression because: - It's invisible in unit tests (the registry shows the command) - It only surfaces in shell scripts / CI / user smoke runs - It fails late and fails confusingly ## Five-Whys 1. Why bind QA-008 now? — Phantom subcommands are the most user-facing "command not found" regression class. Without a verdict pin, CI passes (registry has 58 commands) but users hit "unimplemented" at runtime. 2. Why a (u64, u64) pair? — Algorithm-level pin; the actual per-subcommand `apr <c> --help` invocation harness is FULL_DISCHARGE. 3. Why distinct from QA-001? — Different regression classes: QA-001 catches count drift, QA-008 catches dispatch gaps. Both gates must Pass independently. 4. Why zero-tolerance? — One phantom is enough to break a user workflow; no tolerance band makes sense. 5. Why 16 tests across 7 sections? — Pass band (×3: 58, 1, 1M), zero-tolerance fail (×4 incl. handful, half, all), empty advertised (×2), partition violations (×2), boundary sweep (×7 incl. partition+1), zero-tolerance property (×5 sizes), realistic apr scenarios (×3 incl. minimal-features build). ## Cross-reference Same shape and discipline as INV-DATA-001/002/007, INV-BPE-007, QA-004 — the "zero-tolerance match count + total + partition" pattern is now applied to 6+ verdicts. Each gates a different domain (license, PII, UTF-8, UNK, NaN, phantom) but with identical algorithm-level reduction. Combined with QA-001 (registered count pin) and the existing 3-surface drift trio (test harness, contracts/apr-cli-commands-v1.yaml, CLAUDE.md), QA-008 forms the 5th enforcement layer for CLI command integrity. ## Scope PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual per-subcommand `apr <c> --help` exit-code-collecting harness is FULL_DISCHARGE work for the qa-pipeline implementation PR. ## Tests 16 unit tests, all green.
Algorithm-level PARTIAL discharge for FALSIFY-QA-007 (--json flag changes output; not a no-op) per `contracts/apr-cli-qa-v1.yaml`. ## What this binds `verdict_from_json_flag_diff(default_output, json_output)` returns Pass iff: 1. Both byte slices are non-empty (refuse vacuous Pass) 2. `default_output != json_output` (inverse-equality) ## Why inverse-equality (novel verdict shape) Most verdicts in the algorithm-binding cycle are positive-equality (byte-identical match) or zero-tolerance-count (no failures). This is the first inverse-equality verdict: the contract requires the two outputs to DIFFER. A regression where `--json` is parsed as a clap flag but doesn't route to a different formatter would produce identical bytes — Pass-by-equality would Fail-by-inverse-equality. ## Why refuse empty inputs Either output being empty means the apr subprocess silently failed. Refusing both individually catches the silent-failure class. ## Five-Whys 1. Why bind QA-007 now? — `--json` no-op is a class of regression that ships with green CI: clap parses the flag, but the formatter dispatch silently uses the text writer. Without a verdict pin, downstream JSON parsers break in production. 2. Why a (&[u8], &[u8]) pair? — Algorithm-level pin; the actual subprocess invocations producing the two outputs are FULL_DISCHARGE. 3. Why inverse-equality (not metadata sniffing)? — Contract test wording: `diff <(default) <(--json) | grep -q .` matches character-for-character. JSON-validity check is FALSIFY-QA-003's territory; -007 is purely about non-equality. 4. Why include symmetry tests? — Verdict must be symmetric in (default, json) to avoid order-dependent bugs. 5. Why 18 tests across 7 sections? — Pass band (×4: realistic inspect, one-byte diff, length diff, qa outputs), no-op fail (×2 incl. long identical), empty-input fail (×3), single-byte edge (×2: appended + prepended), symmetry property (×3 incl. pass + fail-identical + fail-one-empty), inverse-equality property (×4 lengths), realistic 3-subcommand contrast (×3: inspect, diff, validate-unimplemented regression). ## Cross-reference Inverse-equality is conceptually related to `bpe_inv_006`'s encode determinism (which uses positive equality). Together they cover both directions: "two runs must be identical" (006) and "two runs must differ" (007). Different contract ID, different direction, same primitive (byte-equality). ## Scope PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual double-subprocess invocation (text + --json) is FULL_DISCHARGE work for the qa-pipeline implementation PR. ## Tests 18 unit tests, all green.
…EVEL — closes 7/7 INV-BPE sweep
Algorithm-level PARTIAL discharge for FALSIFY/INV-BPE-005 (NFC
idempotence + composed/decomposed equivalence) per
`contracts/tokenizer-bpe-v1.yaml`.
## What this binds
`verdict_from_nfc_idempotence(composed_nfc, decomposed_nfc, double_nfc)`
returns Pass iff:
1. All three byte slices are non-empty (refuse vacuous Pass)
2. `composed_nfc == decomposed_nfc` (composable equivalence)
3. `composed_nfc == double_nfc` (NFC idempotence)
## ✅ Closes the 7/7 INV-BPE sweep on `tokenizer-bpe-v1`
INV-BPE-001 (vocab range + paired model), 002 (special tokens),
003 (round-trip / FALSIFY-SHIP-012 dual), 004 (merge algebra),
005 (NFC idempotence), 006 (encode determinism), 007 (byte
coverage / no UNK) all bound at PARTIAL_ALGORITHM_LEVEL.
## Why three-way byte-equality
The contract pins TWO properties on the same NFC implementation:
- Composable equivalence: nfc("café"-composed) == nfc("café"-decomposed)
- Idempotence: nfc(nfc(x)) == nfc(x)
A regression in either dimension breaks MODEL-2's tokenizer
contract. Composing both into one verdict means a single PARTIAL
covers both gate paths; the test suite exercises each
independently:
- `fail_cafe_composed_vs_decomposed` catches missing-NFC.
- `fail_double_nfc_differs` catches non-idempotent NFC.
- `fail_hangul_precomposed_vs_decomposed` catches NFC missing on
multi-codepoint canonical equivalents.
## Why byte-equality (not str-equality)
The Rust `&[u8]` shape lets the verdict compare:
- Pre-NFC byte sequences (which may not be valid UTF-8)
- Post-NFC byte sequences from any NFC implementation (ICU, unic,
rust-stdlib, the tokenizer's own)
- Hex/UTF-8 fixtures in tests
`==` on `&[u8]` is character-for-character byte-identical; matches
the contract falsifier's "byte-equals" wording.
## Why refuse empty inputs
Vacuous Pass on `(&[], &[], &[])` would let an early-return
short-circuit silently pass the gate. Refusing all three
empty-input combinations explicitly catches it.
## Five-Whys
1. Why bind INV-BPE-005 now? — Closes the 7/7 INV-BPE sweep on
`tokenizer-bpe-v1`. NFC idempotence is required before BPE
encoding; without it, "café" and "café" tokenize differently
even though they're canonically equivalent — corrupts MODEL-2
tokenizer contract.
2. Why a 3-tuple of byte slices, not the actual NFC implementation? —
Algorithm-level pin; the actual NFC + tokenizer code that
produces the three forms is FULL_DISCHARGE work for the
corpus-tokenize PR.
3. Why three forms instead of two? — Composable equivalence and
idempotence are independent properties; a tokenizer can pass
one and fail the other. Three-way input lets the verdict
exercise both with one call.
4. Why byte-level rather than codepoint-level? — Catches
subtler regressions where two byte sequences decode to the
same codepoints under different NFC normalizations
(idempotence implies bytes, not just codepoints).
5. Why 20 tests across 7 sections? — Pass band (×5: ASCII,
composed-café, CJK, emoji, math), composed-vs-decomposed
fail (×3: café, distinct, single byte), idempotence fail
(×2), combined violations (×1), caller errors (×4: all empty
+ per-input empty), symmetry/three-way distinct (×2), and
realistic Hangul precomposed/decomposed pair + long mixed
text (×3).
## Cross-reference (informational)
This verdict's logic is structurally similar to `data_inv_007`
(UTF-8 + NFC round-trip) — both gate NFC correctness — but at
different scope:
- `data_inv_007` counts NFC failures across a corpus.
- `bpe_inv_005` proves NFC idempotence on a single fixture
triple.
The two together provide both unit-level (per-text) and
corpus-level (millions-of-files) NFC pinning.
## Scope
PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual NFC
implementation that produces the three byte sequences is
FULL_DISCHARGE work for the corpus-tokenize implementation PR.
## Tests
20 unit tests, all green. **Closes the 7/7 INV-BPE gate sweep**
on `tokenizer-bpe-v1`.
…EVEL (#1159) Algorithm-level PARTIAL discharge for FALSIFY/INV-BPE-003 (round-trip byte-equality on 10K held-out docs) per `contracts/tokenizer-bpe-v1.yaml`. Also FALSIFY-SHIP-012 directly. ## What this binds `verdict_from_roundtrip_scan(docs_scanned, roundtrip_failures)` returns Pass iff: 1. `docs_scanned >= 10_000` (contract statistical floor) 2. `roundtrip_failures == 0` (zero-tolerance — non-injective tokenization corrupts loss target) 3. `roundtrip_failures <= docs_scanned` (counter sanity) Pinned constant: `AC_BPE_INV_003_REQUIRED_DOCS = 10_000`. ## Why zero-tolerance `decode(encode(nfc(text))) != nfc(text)` means the tokenizer is not injective on the input domain. Even one failure means MODEL-2's training loss is computed against tokens that don't reconstruct the original text — silently corrupting the cross-entropy gradient on that document. The contract falsifier ("Any non-zero diff bytes fails") admits no tolerance band. ## Why 10K floor The contract specifies a 10K-doc held-out corpus. Round-trip failures cluster around rare Unicode patterns (emoji ZWJ sequences, RTL combining marks, control sequences) that occur at frequencies of ~1-in-10K to 1-in-100K. A scan over 1K docs would miss those classes entirely; a scan over 100K would over-tax every smoke run. Pinning 10K matches the contract's stat-power requirement. ## Why partition-violation check `roundtrip_failures > docs_scanned` indicates counter corruption — the failure set cannot exceed the total scanned. Catches a counter- rollover bug or double-count regression. ## Five-Whys 1. Why bind INV-BPE-003 now? — Round-trip injectivity is the foundation of MODEL-2's training-target correctness; without a verdict pin, a regression in NFC handling or BPE merge-table loading silently corrupts every gradient. 2. Why a (u64, u64) pair, not the actual scanner? — Algorithm-level pin; the streaming round-trip scanner is FULL_DISCHARGE work for the corpus-tokenize PR. 3. Why pin 10K floor? — Statistical-power match to contract; smaller samples miss rare-Unicode classes. 4. Why partition-violation check? — Counter corruption mustn't silently pass. 5. Why 17 tests across 7 sections? — Mutation survey: provenance pin (×1), pass band (×4: floor, +1, CSN-Python, huge), zero- tolerance fail (×4 incl. one-in-million), sample-size fail (×3), partition violation (×2), boundary sweep (×8 each dimension), and zero-tolerance property (×4 sizes). ## Cross-reference This verdict ALSO discharges FALSIFY-SHIP-012 (round-trip gate from SHIP-TWO-001 spec §15.4) — the contract identifies the two as the same gate. Two contract IDs share one PARTIAL discharge. ## Scope PARTIAL_ALGORITHM_LEVEL only. Wiring this into the actual streaming round-trip scanner that produces `(docs_scanned, roundtrip_failures)` is FULL_DISCHARGE work for the corpus-tokenize implementation PR. ## Tests 17 unit tests, all green.
8089c51 to
f595bd3
Compare
This was referenced May 12, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Single squashed PR that combines 135 PRs worth of format/PARTIAL discharge feat commits into one diff. One CI run instead of 135.
Each individual PR adds 1 new file
crates/aprender-core/src/format/<name>.rs(verdict functions + tests) plus apub mod <name>;line inmod.rs. They're structurally independent — no cross-dependencies — so a single combined diff is mechanically equivalent to merging the 135 PRs sequentially.Why
With 250+ queued PRs all individually running CI, the merge queue is rate-limited by runner pool size (~6-8 PRs/hour after the #1632 sccache fix). Squashing the format-discharge cluster collapses 135 sequential workspace-test runs into 1, freeing the queue for the remaining unique-content PRs.
Files
135 new format modules, each binding a contract's FALSIFY-* gates at PARTIAL_ALGORITHM_LEVEL. See individual commit messages for per-contract context.
Test plan
cargo clippy -p aprender-core --lib -- -D warningscleancargo check -p aprender-core --libcleanAfter merge
The 135 originating PRs will be auto-closed by GitHub (branch-equal-to-main after this lands).
🤖 Generated with Claude Code