Enforce manifest schema validation and fix lineage@1 self-reference#4
Enforce manifest schema validation and fix lineage@1 self-reference#4BigBirdReturns wants to merge 6 commits into
Conversation
The verifier previously checked only integrity.merkle_root, leaving E_MANIFEST_SCHEMA decorative. Manifest field drift (top-level created_at, wrong spec_version, missing license, malformed extension ids) passed silently. Add _validate_manifest_schema enforcing all spec 5.2 required fields and 5.3 optional-field shape, run before signature verification so structural defects fail fast and independently of crypto. With the net in place, three builder-side drifts it now catches are fixed (future compiles only; existing signed shards remain valid since they are signed over their own bytes, and the Merkle root excludes manifest.json per spec section 4): - compiler_generic: created_at moved from manifest top level into metadata.created_at (spec 5.2) - cli compile: spec_version "1.1.0" -> "1.0.0" (spec 5.2) - cli compile: add required license.spdx block (was omitted entirely) Also bump axm_verify.__version__ 1.1.0 -> 1.2.0 to match pyproject. Tests: 6 new conformance cases exercising E_MANIFEST_SCHEMA directly (top-level created_at, missing created_at, wrong spec_version, bad integrity.algorithm, non-integer statistics, malformed extension id). Update one stale e2e assertion that pinned the old spec_version drift. Out of scope (per plan): ext parquet @1 rename and lineage self-referential shard_id (frozen-schema RFC). Pre-existing ML-DSA stub failures in the full suite are test-ordering artifacts, unchanged by this commit.
The backend-contract tests importlib.reload(axm_build.sign) with a fake dilithium module bound in sys.modules. reload mutates the module in place, binding fake _mldsa44_*_raw functions (pk=b"p"*1312, sig=b"z"*2420). monkeypatch restores sys.modules and __import__ on teardown but cannot un-mutate the reloaded module, so the fake backend leaked into every later test importing axm_build.sign — breaking wrong-key rejection and making failures depend on collection order (the 5 order-dependent ML-DSA failures seen in the full suite). Add an autouse finalizer that reloads axm_build.sign after each contract test. With no monkeypatch dependency it tears down after monkeypatch, so the reload runs against the restored real import system and re-binds the genuine backend. Full suite: 80 passed, 3 skipped (was 5 order-dependent failures).
COMPATIBILITY and spec 5.3 require extension tables on disk as
"<name>@<version>.parquet" — the "@<version>" suffix is part of the name.
The compiler wrote bare "<name>.parquet" and synthesized the version only in
the manifest. Rename across all four coupled sites so writer, manifest,
backfill, and registry agree:
- compiler_generic writer paths: locators/references/temporal/lineage now
written as "<name>@1.parquet"
- extension detection: the filename now carries the version, so the stem
("<name>@1") IS the identifier; stop re-appending "@1" (which would have
produced "locators@1@1")
- backfill_lineage_shard_id: read "ext/lineage@1.parquet" so the two-pass
shard_id backfill still finds the file
- EXTENSION_REGISTRY "file" entries updated to the on-disk names
Guard test (test_compiler_lineage_ships_without_pending_sentinel): compile
with supersedes set, read whichever lineage file the writer produced, assert
no __PENDING__ sentinel survives. This catches the specific drift where the
writer moves to lineage@1.parquet but the backfill keeps the old path and
silently no-ops, shipping __PENDING__ in signed, Merkle-covered bytes that the
verifier never reads. Asserts sentinel absence only — not that the embedded id
equals manifest shard_id, which the lineage RFC must resolve. Verified the
guard fails when the backfill drift is reintroduced.
Full suite: 81 passed, 3 skipped.
lineage@1.shard_id tries to store the content-addressed id of the shard it ships in, computed over bytes that include the column, so it can never be consistent: the embedded id is the pass-1 pre-image root while the manifest shard_id is the pass-2 root. Document the contradiction and propose options (A: lineage@2 drops the self column and the two-pass build — recommended; B: redefine the column as a pre-image id; C rejected: excluding lineage from the Merkle breaks tamper-evidence). The schema is frozen, so resolution is a version bump and the spec owner's call. No behavior change in this commit.
The manifest-shape validator checks declared extension ids are well-formed, but the disk-side rules of spec 10 were unenforced: a shard could carry ext files with no extensions key, or a bare unversioned filename, and pass. The suite even encoded this — test_shard_with_ext_file_covered_by_merkle added a bare spatial.parquet, re-signed, and asserted PASS. Add _validate_extensions_consistency, run after the Merkle check (so it only sees byte-consistent shards and never preempts E_MERKLE_MISMATCH): - every ext/ file must match '<name>@<version>.parquet' - ext/ non-empty -> manifest.extensions lists exactly the on-disk identifiers - ext/ empty/absent -> manifest.extensions omitted (hash stability) Tests: make the covered-by-merkle case spec-conformant (versioned name + declared), and add three rejection tests (undeclared file, bare name, declared but absent). Mutation-checked: disabling the new call reverts the undeclared and declared-but-absent cases to passing a bad shard. Full suite: 84 passed, 3 skipped.
RFC 0002 is accepted with its conclusion revised by the facts: lineage@1 was never published in spec/extensions/, never carried by a test vector or the gold shard, and nothing released depends on it — it is still protospec. So the self-referential shard_id column is removed in place rather than versioned to a conservative lineage@2. lineage@1 is now predecessor-oriented: (supersedes_shard_id, action, timestamp, note) The owning shard is identified by manifest.shard_id; rows do not repeat it. Because no field depends on the final shard_id, the lineage bytes are final before hashing: removed backfill_lineage_shard_id() and the two-pass Merkle build, the __PENDING__ sentinel, and the self column. - ext_schemas: LINEAGE_SCHEMA drops shard_id; sort_key/stable_join become supersedes_shard_id (predecessor-oriented). - spec/extensions/lineage@1.md: first published extension schema (protospec). - COMPATIBILITY.md §6: scoped carve-out — naming convention frozen and verifier-enforced; extension schemas protospec until published + freeze tag. Frozen kernel unaffected. - CHANGELOG 1.3.0: announces the pre-freeze schema change loudly (old shape, new shape, why) since the repos are public. - Test: replace the __PENDING__ sentinel guard with the corrected invariant — no shard_id column, exactly the four predecessor columns, shard verifies. - Version 1.2.0 -> 1.3.0 (compiler output changed). RFC 0002 status: Accepted.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d07d15fc8b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if not isinstance(s, dict) or not _is_str(s.get("path")) or not _is_str(s.get("hash")): | ||
| _err(errors, ErrorCode.E_MANIFEST_SCHEMA, | ||
| f"sources[{i}] must have string path and hash") |
There was a problem hiding this comment.
Validate source paths and hashes, not just their types
With this check, a signed manifest can still set sources[0].path to a non-content/ path or sources[0].hash to any string (for example "bogus") and pass verification, because the later content checks use provenance/span hashes and never consult manifest.sources. Since spec 5.2 defines the path as a relative POSIX path under content/ and the hash as a SHA-256 hex value, this leaves malformed manifests accepted by the new schema validator; please reject non-content/ paths and non-64-hex hashes here, and ideally compare the manifest entries to the content files.
Useful? React with 👍 / 👎.
| ext_dir = root / "ext" | ||
| ext_files = [] | ||
| if ext_dir.is_dir(): | ||
| ext_files = [f for f in ext_dir.iterdir() if f.is_file() and not f.name.startswith(".")] |
There was a problem hiding this comment.
Reject nested files under ext/ when checking extensions
This only inspects regular files that are direct children of ext/, so a shard with a signed file such as ext/nested/spatial@1.parquet is covered by the Merkle tree but ext_files is empty and an omitted manifest.extensions is accepted. That bypasses the new spec-10 rule that non-empty ext/ contents must be declared and follow ext/<name>@<version>.parquet; please either walk ext/ recursively and reject nested paths or reject directories under ext/.
Useful? React with 👍 / 👎.
Summary
This PR adds comprehensive manifest schema validation (spec 5.2/5.3) and corrects the
lineage@1extension schema by removing its self-referentialshard_idcolumn. The changes enforce the specification more strictly while simplifying the lineage build process.Key Changes
Manifest Schema Validation
_validate_manifest_schema()to enforce spec 5.2 required fields and 5.3 optional-field shape before signature verificationspec_version == "1.0.0", required string fields (shard_id,metadata.{title, namespace, created_at},publisher.{id, name},license.spdx)metadata.created_atas RFC 3339 timestamp using regexsourcesas non-empty array with{path, hash}objectsintegrity.algorithm == "blake3"andstatistics.{entities, claims}as integersextensionsarray identifiers match<name>@<version>pattern_validate_extensions_consistency()to enforce spec 10 disk-side rulesext/files follow<name>@<version>.parquetnamingextensionslist exactly matches files on diskext/directories don't have anextensionskeyLineage Schema Correction (RFC 0002)
shard_idcolumn fromlineage@1schema — it was self-referential and could never be consistent(shard_id, supersedes_shard_id, action, timestamp, note)with__PENDING__sentinel(supersedes_shard_id, action, timestamp, note)— predecessor-oriented onlycompiler_generic.py_write_lineage_extension()to write final bytes directly without sentinel or backfillEXTENSION_REGISTRYto reflect new sort key (supersedes_shard_idinstead ofshard_id)Extension Naming Convention Enforcement
locators@1.parquet,references@1.parquet,temporal@1.parquet,lineage@1.parquet)extensionslist generation to use file stems directly (avoiding@1@1duplication bug)Documentation & Tests
spec/extensions/lineage@1.md— first published extension schema (marked protospec)shard_idcolumn and verifies correctlyVersion & Metadata Updates
spec_versionin CLI from "1.1.0" to "1.0.0" (spec compliance)Notable Implementation Details
_is_str(),_is_int()) to distinguish JSON types correctly (e.g., booleans are ints in Python)https://claude.ai/code/session_01L8nRvckwUsAjDrPYchC6r6