Skip to content

Enforce manifest schema validation and fix lineage@1 self-reference#4

Open
BigBirdReturns wants to merge 6 commits into
mainfrom
claude/affectionate-galileo-j4ypws
Open

Enforce manifest schema validation and fix lineage@1 self-reference#4
BigBirdReturns wants to merge 6 commits into
mainfrom
claude/affectionate-galileo-j4ypws

Conversation

@BigBirdReturns

Copy link
Copy Markdown
Owner

Summary

This PR adds comprehensive manifest schema validation (spec 5.2/5.3) and corrects the lineage@1 extension schema by removing its self-referential shard_id column. The changes enforce the specification more strictly while simplifying the lineage build process.

Key Changes

Manifest Schema Validation

  • Added _validate_manifest_schema() to enforce spec 5.2 required fields and 5.3 optional-field shape before signature verification
    • Validates spec_version == "1.0.0", required string fields (shard_id, metadata.{title, namespace, created_at}, publisher.{id, name}, license.spdx)
    • Validates metadata.created_at as RFC 3339 timestamp using regex
    • Validates sources as non-empty array with {path, hash} objects
    • Validates integrity.algorithm == "blake3" and statistics.{entities, claims} as integers
    • Validates extensions array identifiers match <name>@<version> pattern
  • Added _validate_extensions_consistency() to enforce spec 10 disk-side rules
    • Ensures all ext/ files follow <name>@<version>.parquet naming
    • Verifies manifest extensions list exactly matches files on disk
    • Ensures empty ext/ directories don't have an extensions key
  • Both validators run at appropriate points in the verification pipeline (schema before crypto, extensions after Merkle)

Lineage Schema Correction (RFC 0002)

  • Removed shard_id column from lineage@1 schema — it was self-referential and could never be consistent
    • Old shape: (shard_id, supersedes_shard_id, action, timestamp, note) with __PENDING__ sentinel
    • New shape: (supersedes_shard_id, action, timestamp, note) — predecessor-oriented only
  • Eliminated two-pass Merkle build and backfill logic from compiler_generic.py
    • Single deterministic pass now suffices since lineage bytes no longer depend on the final shard id
  • Updated _write_lineage_extension() to write final bytes directly without sentinel or backfill
  • Updated EXTENSION_REGISTRY to reflect new sort key (supersedes_shard_id instead of shard_id)

Extension Naming Convention Enforcement

  • Fixed extension filename generation to use versioned names (locators@1.parquet, references@1.parquet, temporal@1.parquet, lineage@1.parquet)
  • Corrected manifest extensions list generation to use file stems directly (avoiding @1@1 duplication bug)
  • Added regex validators for RFC 3339 timestamps, extension identifiers, and extension filenames

Documentation & Tests

  • Added spec/extensions/lineage@1.md — first published extension schema (marked protospec)
  • Added RFC 0002 document explaining the lineage self-reference correction and decision rationale
  • Updated COMPATIBILITY.md §6 to clarify extension naming is frozen while extension schemas are protospec
  • Added comprehensive test coverage for manifest schema validation failures
  • Added tests for extension envelope consistency (missing files, undeclared files, bare filenames)
  • Added test verifying lineage@1 has no shard_id column and verifies correctly

Version & Metadata Updates

  • Bumped version to 1.3.0 (protospec schema correction)
  • Updated CHANGELOG with detailed explanation of lineage@1 changes and backward compatibility notes
  • Fixed spec_version in CLI from "1.1.0" to "1.0.0" (spec compliance)

Notable Implementation Details

  • Schema validation uses helper functions (_is_str(), _is_int()) to distinguish JSON types correctly (e.g., booleans are ints in Python)
  • RFC 3339 regex is lenient on fractional seconds and offset form but strict on date-time skeleton
  • Extension identifiers follow pattern `^[a-z0-9_]+@\

https://claude.ai/code/session_01L8nRvckwUsAjDrPYchC6r6

claude added 6 commits June 9, 2026 18:51
The verifier previously checked only integrity.merkle_root, leaving
E_MANIFEST_SCHEMA decorative. Manifest field drift (top-level created_at,
wrong spec_version, missing license, malformed extension ids) passed
silently. Add _validate_manifest_schema enforcing all spec 5.2 required
fields and 5.3 optional-field shape, run before signature verification so
structural defects fail fast and independently of crypto.

With the net in place, three builder-side drifts it now catches are fixed
(future compiles only; existing signed shards remain valid since they are
signed over their own bytes, and the Merkle root excludes manifest.json
per spec section 4):

- compiler_generic: created_at moved from manifest top level into
  metadata.created_at (spec 5.2)
- cli compile: spec_version "1.1.0" -> "1.0.0" (spec 5.2)
- cli compile: add required license.spdx block (was omitted entirely)

Also bump axm_verify.__version__ 1.1.0 -> 1.2.0 to match pyproject.

Tests: 6 new conformance cases exercising E_MANIFEST_SCHEMA directly
(top-level created_at, missing created_at, wrong spec_version, bad
integrity.algorithm, non-integer statistics, malformed extension id).
Update one stale e2e assertion that pinned the old spec_version drift.

Out of scope (per plan): ext parquet @1 rename and lineage self-referential
shard_id (frozen-schema RFC). Pre-existing ML-DSA stub failures in the full
suite are test-ordering artifacts, unchanged by this commit.
The backend-contract tests importlib.reload(axm_build.sign) with a fake
dilithium module bound in sys.modules. reload mutates the module in place,
binding fake _mldsa44_*_raw functions (pk=b"p"*1312, sig=b"z"*2420). monkeypatch
restores sys.modules and __import__ on teardown but cannot un-mutate the
reloaded module, so the fake backend leaked into every later test importing
axm_build.sign — breaking wrong-key rejection and making failures depend on
collection order (the 5 order-dependent ML-DSA failures seen in the full suite).

Add an autouse finalizer that reloads axm_build.sign after each contract test.
With no monkeypatch dependency it tears down after monkeypatch, so the reload
runs against the restored real import system and re-binds the genuine backend.

Full suite: 80 passed, 3 skipped (was 5 order-dependent failures).
COMPATIBILITY and spec 5.3 require extension tables on disk as
"<name>@<version>.parquet" — the "@<version>" suffix is part of the name.
The compiler wrote bare "<name>.parquet" and synthesized the version only in
the manifest. Rename across all four coupled sites so writer, manifest,
backfill, and registry agree:

- compiler_generic writer paths: locators/references/temporal/lineage now
  written as "<name>@1.parquet"
- extension detection: the filename now carries the version, so the stem
  ("<name>@1") IS the identifier; stop re-appending "@1" (which would have
  produced "locators@1@1")
- backfill_lineage_shard_id: read "ext/lineage@1.parquet" so the two-pass
  shard_id backfill still finds the file
- EXTENSION_REGISTRY "file" entries updated to the on-disk names

Guard test (test_compiler_lineage_ships_without_pending_sentinel): compile
with supersedes set, read whichever lineage file the writer produced, assert
no __PENDING__ sentinel survives. This catches the specific drift where the
writer moves to lineage@1.parquet but the backfill keeps the old path and
silently no-ops, shipping __PENDING__ in signed, Merkle-covered bytes that the
verifier never reads. Asserts sentinel absence only — not that the embedded id
equals manifest shard_id, which the lineage RFC must resolve. Verified the
guard fails when the backfill drift is reintroduced.

Full suite: 81 passed, 3 skipped.
lineage@1.shard_id tries to store the content-addressed id of the shard it
ships in, computed over bytes that include the column, so it can never be
consistent: the embedded id is the pass-1 pre-image root while the manifest
shard_id is the pass-2 root. Document the contradiction and propose options
(A: lineage@2 drops the self column and the two-pass build — recommended;
B: redefine the column as a pre-image id; C rejected: excluding lineage from
the Merkle breaks tamper-evidence). The schema is frozen, so resolution is a
version bump and the spec owner's call. No behavior change in this commit.
The manifest-shape validator checks declared extension ids are well-formed,
but the disk-side rules of spec 10 were unenforced: a shard could carry ext
files with no extensions key, or a bare unversioned filename, and pass. The
suite even encoded this — test_shard_with_ext_file_covered_by_merkle added a
bare spatial.parquet, re-signed, and asserted PASS.

Add _validate_extensions_consistency, run after the Merkle check (so it only
sees byte-consistent shards and never preempts E_MERKLE_MISMATCH):

- every ext/ file must match '<name>@<version>.parquet'
- ext/ non-empty -> manifest.extensions lists exactly the on-disk identifiers
- ext/ empty/absent -> manifest.extensions omitted (hash stability)

Tests: make the covered-by-merkle case spec-conformant (versioned name +
declared), and add three rejection tests (undeclared file, bare name, declared
but absent). Mutation-checked: disabling the new call reverts the undeclared
and declared-but-absent cases to passing a bad shard.

Full suite: 84 passed, 3 skipped.
RFC 0002 is accepted with its conclusion revised by the facts: lineage@1 was
never published in spec/extensions/, never carried by a test vector or the gold
shard, and nothing released depends on it — it is still protospec. So the
self-referential shard_id column is removed in place rather than versioned to a
conservative lineage@2.

lineage@1 is now predecessor-oriented:
  (supersedes_shard_id, action, timestamp, note)
The owning shard is identified by manifest.shard_id; rows do not repeat it.
Because no field depends on the final shard_id, the lineage bytes are final
before hashing: removed backfill_lineage_shard_id() and the two-pass Merkle
build, the __PENDING__ sentinel, and the self column.

- ext_schemas: LINEAGE_SCHEMA drops shard_id; sort_key/stable_join become
  supersedes_shard_id (predecessor-oriented).
- spec/extensions/lineage@1.md: first published extension schema (protospec).
- COMPATIBILITY.md §6: scoped carve-out — naming convention frozen and
  verifier-enforced; extension schemas protospec until published + freeze tag.
  Frozen kernel unaffected.
- CHANGELOG 1.3.0: announces the pre-freeze schema change loudly (old shape,
  new shape, why) since the repos are public.
- Test: replace the __PENDING__ sentinel guard with the corrected invariant —
  no shard_id column, exactly the four predecessor columns, shard verifies.
- Version 1.2.0 -> 1.3.0 (compiler output changed).

RFC 0002 status: Accepted.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d07d15fc8b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/axm_verify/logic.py
Comment on lines +222 to +224
if not isinstance(s, dict) or not _is_str(s.get("path")) or not _is_str(s.get("hash")):
_err(errors, ErrorCode.E_MANIFEST_SCHEMA,
f"sources[{i}] must have string path and hash")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Validate source paths and hashes, not just their types

With this check, a signed manifest can still set sources[0].path to a non-content/ path or sources[0].hash to any string (for example "bogus") and pass verification, because the later content checks use provenance/span hashes and never consult manifest.sources. Since spec 5.2 defines the path as a relative POSIX path under content/ and the hash as a SHA-256 hex value, this leaves malformed manifests accepted by the new schema validator; please reject non-content/ paths and non-64-hex hashes here, and ideally compare the manifest entries to the content files.

Useful? React with 👍 / 👎.

Comment thread src/axm_verify/logic.py
Comment on lines +280 to +283
ext_dir = root / "ext"
ext_files = []
if ext_dir.is_dir():
ext_files = [f for f in ext_dir.iterdir() if f.is_file() and not f.name.startswith(".")]

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject nested files under ext/ when checking extensions

This only inspects regular files that are direct children of ext/, so a shard with a signed file such as ext/nested/spatial@1.parquet is covered by the Merkle tree but ext_files is empty and an omitted manifest.extensions is accepted. That bypasses the new spec-10 rule that non-empty ext/ contents must be declared and follow ext/<name>@<version>.parquet; please either walk ext/ recursively and reject nested paths or reject directories under ext/.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants