Skip to content

feat: capture managed-materialize output schema as asset metadata (#2a)#9812

Merged
rubenfiszel merged 6 commits into
mainfrom
pipeline-schema-capture
Jun 26, 2026
Merged

feat: capture managed-materialize output schema as asset metadata (#2a)#9812
rubenfiszel merged 6 commits into
mainfrom
pipeline-schema-capture

Conversation

@rubenfiszel

@rubenfiszel rubenfiszel commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Closes pipeline-parity gap #2a — schema capture (the capture half of #2; not #2b enforcement, which depends on #5 column lineage and stays out of scope). After a managed // materialize DuckLake run, this captures the producer's output schema and persists it as versioned asset metadata — the producer-side foundation #2b will read back to validate consumer references at save time.

This is greenfield: #9709 shipped only the time-travel read UX and added no post-run capture seam. It reuses #9709's DESCRIBE SELECT * FROM … shape but builds its own capture path.

It does not re-implement #9708's value-level data_test checks (not_null/accepted_values/unique/relationships) — those already exist and are untouched.

UI: also adds a Schema tab on the ducklake asset node — a master-detail view of the captured schema versions (column/type per version), reading the new GET /assets/asset_schemas endpoint. It shows for any ducklake asset with a table (partitioned or whole-table alike, since the schema is table-level); the table-less catalog node renders no tab bar.

Design decisions (the three design-open questions)

Where schemas live → a dedicated materialized_asset_schema sidecar table (keyed workspace_id, asset_kind, asset_path, version), not a column on materialized_partition and not the asset usage table.

  • Schema is a property of the asset/table, not of a partition slice — a column on materialized_partition would duplicate the identical schema across every partition row (a 30-partition backfill writes it 30×).
  • There is no per-asset "asset metadata row" to extend (assets are tracked only via the asset usage table).
  • The sidecar's grain (workspace, kind, path) is exactly what #2b enforcement reads (MAX(version) = current contract).

When captured → post-run, folded into the existing one-row materialize summary read (zero extra round-trips). materialize_result_sql gains an output_schema column:

(SELECT list({'name': column_name, 'type': column_type})
 FROM (DESCRIBE SELECT * FROM <target>)) AS output_schema

The write has just committed, so the latest snapshot is exactly the recorded snapshot_id — no AT (VERSION) needed. Captured only for managed mode (we know the target alias+table; manual is skipped) and persisted only on a Materialized outcome (a failed run never advances the recorded schema).

Versioning across re-materializations → monotonic version per asset; a new row only on a real schema change. record_asset_schema compares the captured column set to the latest stored version: inserts version = max+1 if it differs, otherwise re-affirms the latest row in place (snapshot_id/job_id/captured_at). Serialized per-asset with pg_advisory_xact_lock so concurrent same-asset captures can't both compute the same next version. Result: a compact schema-evolution history. A column reorder counts as a change (the captured list is ordered, and SELECT * consumers see order).

Changes

  • Migration 20260626095840_add_materialized_asset_schema — new sidecar table (mirrors materialized_partition: ASSET_KIND enum, FK to workspace, no RLS) plus explicit GRANT ALL TO windmill_user/windmill_admin (the recurring app-written-table grant-gap pattern — ALTER DEFAULT PRIVILEGES is role-scoped).
  • windmill-common/src/materialization.rsSchemaColumn, schema field on RecordMaterializationRequest (#[serde(default)], wire-compatible), record_asset_schema (advisory-locked versioned upsert), list_asset_schemas + AssetSchemaVersion.
  • windmill-parser/src/sql_materialize.rsoutput_schema column added to materialize_result_sql (a distinct codegen fn; no overlap with chore: lint the frontend and the python-client codebase #5's annotation-parser edits).
  • windmill-worker/src/duckdb_executor.rsextract_schema (tolerant of nested-array / string-encoded FFI shapes) + threads the captured schema through record_matrecord_asset_schema on the successful path only; schema recording is its own best-effort transaction that can never fail the materialize run.
  • frontend/.../AssetGraph/SchemaHistoryPanel.svelte + DucklakeAssetPanel.svelte — new Schema tab (master-detail schema-version history) on the asset node; raw-fetch GET /assets/asset_schemas, mirroring the sibling PartitionStatusGrid. Verified in a real browser (screenshots in PR comments).
  • windmill-api-assets/src/lib.rsGET /assets/asset_schemas read endpoint + the record_materialization handler now upserts the schema (in its own transaction, after committing the partition record, so a schema failure can't roll the partition record back).

Notes / boundaries

  • Surfaces: Surface B only. The one file shared with the parallel chore: lint the frontend and the python-client codebase #5 (column lineage) work — windmill-api-assets/src/lib.rs — is touched only in new/distinct functions.
  • No OpenAPI/client change: the sibling materialization endpoints (/partitions, /record_materialization) are consumed via raw fetch and are deliberately absent from openapi.yaml; /asset_schemas follows the same convention.
  • Known limitation: the agent-http worker path (agent_workers/.../record_materialization, an EE-symlink handler) is left untouched, so agent-worker materializes won't persist schema in v1 — keeps this PR OSS-only. The schema request field is in place for the server-worker and SDK-polyglot paths.

Test plan

  • cargo checkwindmill-common, windmill-parser, windmill-api-assets, windmill-worker --features duckdb, and full --features all_sqlx_features (via sqlx prepare) all compile.
  • Parser unit test: output_schema fold present in both tests and no-tests summaries.
  • Worker unit test: extract_schema parses nested-array, string-encoded, and absent (None) shapes.
  • Integration tests (windmill-common/tests/asset_schema_capture.rs, real migrated DB): first-insert → v1, unchanged → re-affirm (no new version, snapshot advances), changed → v2 newest-first, column-reorder → new version.
  • Verified against bundled DuckDB 1.4 that (SELECT list({...}) FROM (DESCRIBE SELECT * FROM t)) returns exactly [{"name":..,"type":..},…] (the shape extract_schema parses).
  • Manual e2e (requires a duckdb-featured worker + ducklake target — not runnable on the quickjs-only dev backend here): run a managed // materialize, change the model's output columns, re-run, and confirm GET /w/:id/assets/asset_schemas?path=<ducklake>/<table> returns the history newest-first with version bumping only on real column changes.

🤖 Generated with Claude Code


Summary by cubic

Captures the output schema of managed DuckLake // materialize runs and stores it as versioned asset metadata to support future contract enforcement (issue #2a). Adds a schema read API and a strategy-aware "Schema" UI tab; only managed, successful runs can advance schema history, and unknown producer strategy defaults to “evolvable” so history isn’t hidden.

  • New Features

    • Persist schema in materialized_asset_schema keyed by workspace/kind/path + version; insert only when columns change (order counts); exclude _wm_partition; explicitly order columns.
    • Worker: capture schema only for managed runs (ignore manual); upsert only on Materialized; best-effort in its own transaction with a per-asset advisory lock; tolerant of nested/string JSON shapes.
    • API: GET /assets/asset_schemas lists versions newest-first; POST /record_materialization upserts schema only on Materialized and in a separate transaction; wire keeps schema optional for older clients.
    • Backend graph: expose materialize_strategy (replace | append | merge) on runnable nodes.
    • UI: new "Schema" tab on ducklake asset nodes; shows version history for whole-table replace, or a single fixed-schema view for append/merge/partitioned assets; fails open when producer strategy is unknown (treat as evolvable). Backed by GET /assets/asset_schemas.
  • Migration

    • Adds materialized_asset_schema table with grants (run 20260626095840_add_materialized_asset_schema).
    • No OpenAPI or client changes; nothing to update in SDKs.

Written for commit 6e57128. Summary will update on new commits.

Review in cubic

After a managed `// materialize` run, capture the producer's output schema
via a DESCRIBE folded into the existing one-row summary read (no extra
round-trip) and persist it in a new versioned `materialized_asset_schema`
sidecar table. This is the producer-side capture that pipeline parity gap
#2b (save-time consumer-ref contract enforcement) will read back.

- materialized_asset_schema sidecar (asset-level grain), versioned: a new
  version row is inserted only when the captured column set changes.
- output_schema column added to the materialize summary codegen.
- worker extracts + records the schema on a successful materialize.
- /assets/asset_schemas read endpoint exposing the evolution history.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 26, 2026

Copy link
Copy Markdown

Deploying windmill with  Cloudflare Pages  Cloudflare Pages

Latest commit: 6e57128
Status:⚡️  Build in progress...

View logs

@rubenfiszel rubenfiszel marked this pull request as ready for review June 26, 2026 10:36
@claude

claude Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Claude finished @rubenfiszel's task in 2m 45s —— View job


cc @rubenfiszel

Should address issues before merging: captured output_schema includes the synthetic _wm_partition column for partitioned assets (sets a leaky grain for #2b).

Review

Clean, well-scoped PR. The sidecar-table design, post-run fold into the existing summary read, advisory-locked monotonic versioning, and best-effort isolation are all sound and well-justified. One correctness concern on the captured contract itself, plus a couple of minor notes.

P1 — partitioned assets capture the synthetic _wm_partition column (inline)

materialize_result_sql captures the schema via DESCRIBE SELECT * FROM <target_qualified> — the physical target table, not the user's output SELECT. For a // partitioned materialize the table is bootstrapped with an extra CAST(NULL AS VARCHAR) AS _wm_partition column (MaterializeCodegen::statements, lines 396–401), so the captured output_schema for every partitioned asset will include {"name":"_wm_partition","type":"VARCHAR"}.

This is exactly the grain #2b will read back as the asset contract, so a consumer declaring only the real producer columns would mismatch on the leaked internal column. Either DESCRIBE the user output (({sel})) or strip the partition column from the captured list. The unpartitioned path is correct as-is. Worth resolving now since this PR is establishing the contract #2b depends on — happy to be told it's intended.

P2 — schema recorded independent of materialization status in the SDK handler

In windmill-api-assets/src/lib.rs::record_materialization, the schema is upserted whenever req.schema.is_some(), regardless of req.status. The worker only ever sends schema on the Materialized path, but for the polyglot/SDK path the "don't advance the schema on a failed run" invariant (stated in the migration + module docs) relies entirely on the caller. Cheap to enforce server-side: gate the upsert on req.status == Materialized.

Minor — schema write attempted even when the partition write failed

In record_mat (duckdb_executor.rs), the Connection::Sql arm computes partition_res but then unconditionally attempts the best-effort schema write before returning partition_res. In practice schema is only Some on the success path so this never fires destructively, but a short-circuit (if partition_res.is_ok()) would make the "a failed run never advances the recorded schema" guarantee structural rather than caller-dependent. Non-blocking.

AGENTS.md compliance

  • Migration created with a timestamped name (not hand-rolled), explicit grants documented with the recurring-pattern rationale — matches the script_trigger precedent. ✅
  • .sqlx offline cache regenerated for the new queries. ✅
  • Comments record constraints/rationale rather than narration; no "previously"/"unchanged" drafting-history language. ✅
  • No banned frontend patterns (no frontend changes). ✅
  • New pub fns (record_asset_schema, list_asset_schemas) are placed in the matching materialization.rs module and document their locking/versioning contract. The API read endpoint is RLS/workspace-scoped consistently with the sibling list_partitions. No new auth gap. ✅

Test coverage

  • Parser: materialize_result_sql_embeds_data_tests_breakdown extended to assert the output_schema fold is present in both tests/no-tests summaries. Good. (Note: no test asserts the partition-column behavior above — a partitioned-vs-unpartitioned capture test would have surfaced the P1.)
  • Worker: extract_schema_parses_nested_and_string_encoded covers nested-array, string-encoded, and absent shapes. Good.
  • Backend integration: asset_schema_capture.rs against a real migrated DB covers v1 insert, unchanged re-affirm, changed→v2 newest-first, and column-reorder. Solid coverage of the versioning contract.
  • Not covered (acknowledged in the PR): end-to-end capture through an actual duckdb-featured worker + ducklake target. The remaining manual verification is the right call: run a managed // materialize (try both partitioned and unpartitioned), confirm GET /w/:id/assets/asset_schemas?path=<lake>/<table> returns history newest-first with version bumping only on real column changes — and check whether _wm_partition shows up in the partitioned case.

Comment on lines +561 to +564
let schema_capture = format!(
"(SELECT list({{'name': column_name, 'type': column_type}}) \
FROM (DESCRIBE SELECT * FROM {target_qualified})) AS output_schema"
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 — for partitioned assets, the captured output_schema includes the synthetic _wm_partition column.

This DESCRIBEs the target table ({target_qualified}), not the user's output SELECT. For a // partitioned materialize, the managed table is bootstrapped with the extra partition column:

CREATE TABLE IF NOT EXISTS {t} AS
  SELECT *, CAST(NULL AS VARCHAR) AS _wm_partition FROM ({sel}) WHERE false;
ALTER TABLE {t} SET PARTITIONED BY (_wm_partition);

So DESCRIBE SELECT * FROM {target_qualified} returns the producer's columns plus {"name":"_wm_partition","type":"VARCHAR"}, and every partitioned asset's recorded schema carries that windmill-internal column. Since this is exactly the contract #2b enforcement will read back (MAX(version) = current contract), a consumer that declares only the real producer columns would mismatch on the leaked _wm_partition.

Two options: DESCRIBE the user's output (DESCRIBE SELECT * FROM ({user_select})) instead of the physical table, or filter the partition column out of the captured list. The unpartitioned path is unaffected. Worth confirming this is intentional before it sets the grain #2b depends on.

Fix this →

@github-actions

Copy link
Copy Markdown
Contributor

Pi Review

Good to merge

What I reviewed

Complete pass across all 14 changed files: the migration, SQL codegen, worker-side schema extraction, the new record_asset_schema / list_asset_schemas public functions, the asset_schemas API endpoint, and all five tests (unit + integration).

Public surface checklist

SchemaColumn (new pub struct) — data-transfer type, no auth footprint.

AssetSchemaVersion (new pub struct + sqlx::FromRow) — read-only return type; no auth concerns.

record_asset_schema (pub async fn) — writes materialized_asset_schema rows in a caller-supplied transaction. Auth is enforced at the API layer (user_db.begin(&authed) sets RLS session variables before the transaction is handed in), matching the existing record_materialization pattern. The advisory lock (pg_advisory_xact_lock + hashtextextended) serializes concurrent same-asset captures safely. Doc comment covers the versioning contract and lock rationale.

list_asset_schemas (pub async fn) — parameterized read-only query via PgExecutor; auth handled identically to list_partitions in the same module.

GET /assets/asset_schemas — uses ApiAuthed + user_db.begin(&authed), same auth pattern as the sibling /partitions endpoint. No auth gap.

RecordMaterializationRequest.schema#[serde(default)] + Option, wire-compatible with older agents (serde ignores unknown fields on the agent side).

None of the new public surfaces are half-finished or misplaced — record_asset_schema and list_asset_schemas live in windmill-common/src/materialization.rs alongside the existing record_materialization/list_materialized_partitions family, which is the right module.

Findings

No P0 or P1 issues. No P2 nits worth surfacing — the one deliberate naming tension (endpoint /asset_schemas is generic but the handler hardcodes Ducklake) is called out in a comment that explains the v1 scope, and the low-level list_asset_schemas function already accepts any AssetKind, so the API layer restriction is cheap to lift later.

Test coverage

Backend — well-covered:

  • Unit test in duckdb_executor.rs for extract_schema (nested array, string-encoded, absent → None).
  • Parser unit test verifies output_schema appears in both test and no-test summaries.
  • Integration tests in windmill-common/tests/asset_schema_capture.rs (real migrated DB): first-insert → v1, unchanged → re-affirm with no new version, changed → v2, column-reorder → new version.

Frontend — not touched by this PR.

Manual verification — the PR acknowledges that a true e2e exercise requires a duckdb-featured worker + ducklake target, which isn't available on the quickjs-only dev backend. That's a dev-environment constraint, not a PR gap. The unit and integration tests cover everything that can be exercised in CI.

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel
Should address issues before merging: failed runs can advance schema metadata, missing auth contract on new schema helpers

  1. [P1] backend/windmill-api-assets/src/lib.rs:109 — failed/non-materialized API calls can still advance schema metadata. The handler records schema whenever req.schema is present, regardless of req.status, so a caller can submit status: failed with a schema and insert/affirm a version. Gate this on MaterializationStatus::Materialized or reject schema for other statuses.

  2. [P1] backend/windmill-common/src/materialization.rs:179 — the new exported schema helpers touch workspace-scoped DB state using caller-supplied workspace_id, but neither record_asset_schema nor list_asset_schemas documents that callers must pass an already-authorized/RLS-scoped executor or otherwise verify access. The review checklist requires that auth expectations be documented or enforced for new pub fns touching workspace data.

Test coverage

Backend coverage exists for schema versioning and worker extraction. Missing coverage for the API path where a non-Materialized RecordMaterializationRequest includes schema; that should assert no schema row is written.

Manual verification should cover a managed materialize success and a failed/data-test-failing run for the same asset, then confirm GET /w/:id/assets/asset_schemas?path=... only changes after the successful run.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 issues found across 14 files

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread backend/parsers/windmill-parser/src/sql_materialize.rs Outdated
Comment thread backend/parsers/windmill-parser/src/sql_materialize.rs Outdated
Comment thread backend/windmill-api-assets/src/lib.rs Outdated
@rubenfiszel

Copy link
Copy Markdown
Contributor Author

✅ True DuckLake end-to-end run (manual e2e box now checked)

Built the full-featured binary (duckdb,parquet,enterprise,private,license) and ran it as a server+worker against a real DuckLake: Postgres catalog + MinIO S3 storage (via the windmill S3 proxy), driving managed // materialize jobs through the complete path — codegen output_schema DESCRIBE fold → DuckLake write to MinIO → summary read → worker extract_schema → versioned record_asset_schemaGET /assets/asset_schemas.

Script (ducklake://analytics/orders_daily):

Run Output columns Outcome
1 order_id BIGINT, status VARCHAR, amount DOUBLE v1 captured, snapshot 1
2 identical schema, different rows re-affirmed v1 in place — no new version, snapshot_id advanced 1→2
3 dropped amount, added qty INTEGER v2 created, snapshot 3

Captured summary row (run 1):

[{"materialized":"ducklake://analytics/orders_daily","rows":1,"snapshot_id":1,
  "output_schema":[{"name":"order_id","type":"BIGINT"},{"name":"status","type":"VARCHAR"},{"name":"amount","type":"DOUBLE"}]}]

GET /w/admins/assets/asset_schemas?path=analytics/orders_daily after all three runs — evolution history newest-first:

[
  {"version":2,"snapshot_id":3,"columns":[{"name":"order_id","type":"BIGINT"},{"name":"status","type":"VARCHAR"},{"name":"qty","type":"INTEGER"}]},
  {"version":1,"snapshot_id":2,"columns":[{"name":"order_id","type":"BIGINT"},{"name":"status","type":"VARCHAR"},{"name":"amount","type":"DOUBLE"}]}
]

This confirms the three designed behaviors against a real substrate: schema captured post-run, an unchanged re-materialize re-affirms in place (snapshot advances, no version bump), and a real column change bumps the version. The throwaway instance, catalog DB, and MinIO data were torn down afterward.

…s gate)

- exclude the synthetic `_wm_partition` column from the captured schema for
  partitioned assets, so the recorded contract is the producer's logical
  output, not Windmill's storage detail (claude/cubic P1).
- make the captured column list explicitly ordered (`row_number()` over the
  DESCRIBE + `list(... ORDER BY)`), so the `list()` aggregate can't reorder
  columns and spuriously bump the schema version (cubic P2).
- gate the API `record_materialization` schema upsert on a `Materialized`
  status, so a failed/running write (or a client attaching a schema to one)
  can't advance the schema history (cubic P2).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@rubenfiszel

Copy link
Copy Markdown
Contributor Author

Addressed the CI review in 996b6d9:

  • [P1, claude + cubic] _wm_partition leaked into the captured schema for partitioned assets. Fixed — the schema-capture DESCRIBE now filters out the partition column (WHERE column_name <> '<partition_col>', emitted only when partitioned), so the recorded contract is the producer's logical output, not Windmill's storage detail. This matters because it's the grain #2b reads back. Added a materialize_result_sql_schema_excludes_partition_column unit test and verified the exact SQL against bundled DuckDB 1.4 (a _wm_partition column is dropped from the result).

  • [P2, cubic] list(...) had no explicit ordering → could reorder columns and spuriously bump the schema version. Fixed — the columns now carry a row_number() OVER () ordinal from the DESCRIBE and the fold is list({...} ORDER BY _wm_ord), so column order is deterministic and an unchanged re-materialize compares equal.

  • [P2, cubic] API record_materialization schema upsert wasn't gated on success. Fixed — the handler now only upserts the schema when status == Materialized, so a failed/running write (or a client attaching a schema to one) can't advance materialized_asset_schema history. (The worker path already only sends a schema on the successful path; this is the matching defensive gate on the SDK/agent entry point.)

cargo_test failure is debounce_setting_applied_to_dispatched_subscriber (asset_trigger_dispatch.rs) — a known pre-existing flake unrelated to this PR (it touches no dispatch/debounce code); test-linux, check_oss, check_oss_full, pi-review, and cubic all pass. Will re-run.

@github-actions

Copy link
Copy Markdown
Contributor

Pi Review

Good to merge

What I reviewed

Re-reviewed the full diff (both commits 69ae09f and 996b6d9) against all 14 changed files, with particular focus on the three issues raised by prior reviewers and the author's fix commit that addressed them.

Prior findings — all resolved

The three items flagged in prior CI reviews are fixed in 996b6d9:

  • P1 (Claude/Cubic) — _wm_partition leaked into captured schema for partitioned assets. Fixed with a WHERE column_name <> '<partition_col>' filter in the DESCRIBE subquery, emitted only when partitioned. Unit-tested in materialize_result_sql_schema_excludes_partition_column. The row_number() OVER () ordinal is computed before the filter, so column ordering of the remaining columns is preserved correctly.

  • P2 (Cubic) — list() aggregate could reorder columns. Fixed with explicit row_number() OVER () AS _wm_ord + list(... ORDER BY _wm_ord). Columns now carry a deterministic ordinal through the fold.

  • P2 (Cubic/Codex) — API record_materialization schema upsert not gated on success. Fixed: the handler now checks matches!(req.status, Materialized) before upserting, so a failed/running write or a client attaching a schema to a non-Materialized status can't advance the history.

Public surface re-check (post-fix)

All new pub surfaces (SchemaColumn, AssetSchemaVersion, record_asset_schema, list_asset_schemas, GET /assets/asset_schemas) remain correctly placed and follow the module's established auth pattern — RLS is set at the API layer via user_db.begin(&authed) before the transaction is handed in. The PartitionsQuery reuse for the /asset_schemas endpoint is pragmatic (the struct has exactly the path field needed).

AGENTS.md compliance

  • Migration uses a timestamped name, not hand-rolled. ✅
  • .sqlx offline cache regenerated for all five new query shapes (advisory lock, SELECT latest, UPDATE re-affirm, INSERT, list-query). ✅
  • Comments describe constraints and invariants, not narration or drafting history. ✅
  • No frontend changes — no banned patterns to check. ✅

Test coverage

  • Parser: Extended materialize_result_sql_embeds_data_tests_breakdown + new materialize_result_sql_schema_excludes_partition_column — both assert the correct SQL shape including the partition filter.
  • Worker: extract_schema_parses_nested_and_string_encoded covers nested-array, string-encoded, and absent shapes.
  • Integration: asset_schema_capture.rs (real migrated DB) covers v1 insert, unchanged re-affirm, changed→v2 newest-first, and column-reorder→new-version.
  • Manual: The author verified end-to-end against a real DuckLake + Postgres catalog + MinIO S3 stack, confirming all three design behaviors (capture, re-affirm, version-bump) with actual GET /assets/asset_schemas output included in the PR discussion.

No new tests needed — the coverage is proportional to the surface changed and the prior review concerns are backed by deterministic assertions.

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel

Should address issues before merging: missing auth contract on new schema helpers

Found 2 issues:

  1. [P1] backend/windmill-common/src/materialization.rs:179 — the new exported record_asset_schema and list_asset_schemas helpers touch workspace-scoped DB state using caller-supplied workspace_id/executor, but they neither enforce authorization nor document that callers must pass an already-authorized transaction/executor. The review checklist for new public surfaces requires auth expectations to be enforced or documented; the current API/worker callers look scoped, but the public helpers leave that contract implicit.

  2. [P2] backend/windmill-worker/src/duckdb_executor.rs:752 — manual DuckLake materializations still run extract_schema(&result) and pass the result into record_mat, even though build_materialized_query returns Some((None, meta)) for // materialize manual at line 185. Because extract_schema recursively trusts any output_schema field in the user result, a manual materialize can accidentally or intentionally persist caller-shaped schema metadata, despite the feature being scoped to managed output capture. Gate schema extraction/passing on the rewritten managed path (Some((Some(_), _))) so manual mode remains skipped.

Test coverage

Backend coverage was added for schema versioning, parser SQL shape, partition-column filtering, and worker extraction shapes. Missing coverage for the manual-materialize path above: a manual materialize result containing an output_schema field should not write materialized_asset_schema.

Manual verification should include a managed materialize success that records schema history, plus a manual // materialize manual run whose result contains an output_schema field and confirms GET /w/:id/assets/asset_schemas?path=... does not change.

- gate output_schema extraction on the managed (`Some((Some(_), _))`) path so a
  `// materialize manual` run — whose result is the user's own query output —
  can't persist a caller-shaped `output_schema` into materialized_asset_schema
  (Codex P2). Verified e2e: a manual run returning a fabricated
  `output_schema:[{injected,EVIL}]` records the partition but writes no schema
  version, while the managed path still captures normally.
- document the authorization contract on the new public `record_asset_schema`
  and `list_asset_schemas` helpers: they perform no access control (mirroring
  the materialized_partition siblings) and require callers to pass a
  workspace-authorized executor (Codex P1).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@rubenfiszel

Copy link
Copy Markdown
Contributor Author

Addressed the Codex review in 356aff3 (pi-review was "Good to merge"):

  • [P1] auth contract on the new public helpers. Documented it explicitly on record_asset_schema and list_asset_schemas: they perform no access control of their own (mirroring the materialized_partition siblings) and require the caller to pass a workspace-authorized executor — an RLS-scoped user_db transaction on the API/agent entry points, or the trusted worker DB pool for the in-worker recorder. The existing callers already satisfy this; the contract is now explicit rather than implicit.

  • [P2] manual mode could persist a caller-shaped output_schema. Gated schema extraction on the managed (Some((Some(_), _))) path — in // materialize manual the result is the user's own query output (we generate no summary), so its output_schema is now never trusted. This mirrors the adjacent managed-only collection_strategy gate.

Verified end-to-end against a real DuckLake (Postgres catalog + MinIO), exactly the scenario Codex flagged as missing coverage:

Materialize Result materialized_asset_schema
managed analytics/orders_daily summary output_schema:[order_id,status] v1 written
manual analytics/orders_manual, user result [{"output_schema":[{"name":"injected","type":"EVIL"}]}] user's own row empty — not captured

Manual mode still recorded its partition row (orders_manual → materialized), confirming only the schema capture is gated, not manual materialization itself. Throwaway instance and config torn down afterward.

@github-actions

Copy link
Copy Markdown
Contributor

Pi Review

Good to merge

What I reviewed

Complete pass across all 14 changed files at the final commit (356aff3): the migration, SQL codegen, worker-side schema extraction, record_asset_schema / list_asset_schemas, the API endpoint, and all tests. I also verified that all five prior CI review findings are resolved in the two fix commits (996b6d9, 356aff3).

Prior findings — all resolved

Finding Source Resolution
_wm_partition leaked into captured schema for partitioned assets Claude/Cubic (P1) WHERE column_name <> '<partition_col>' filter in the DESCRIBE subquery, only when partitioned. Unit-tested.
list() aggregate could reorder columns → spurious version bumps Cubic (P2) row_number() OVER () AS _wm_ord + list(... ORDER BY _wm_ord). Deterministic ordering.
API record_materialization schema upsert not gated on success Cubic/Codex (P1/P2) matches!(req.status, Materialized) check before upserting.
Auth contract on new pub helpers was implicit Codex (P1) Doc comments on record_asset_schema and list_asset_schemas now explicitly state the "no access control, caller MUST pass authorized executor" contract.
Manual mode could persist caller-shaped output_schema Codex (P2) is_managed gate: schema extracted only when matches!(&materialize, Some((Some(_), _))). Manual mode result is never trusted.

Public surface checklist

  • SchemaColumn — data-transfer type, no auth footprint.
  • AssetSchemaVersion — read-only sqlx::FromRow return type, no auth concerns.
  • record_asset_schema — now documents its auth contract explicitly, matching the sibling record_materialization. Advisory lock serializes concurrent same-asset captures correctly.
  • list_asset_schemas — same explicit auth contract as list_materialized_partitions. RLS enforced at the API layer via user_db.begin(&authed).
  • GET /assets/asset_schemas — same auth pattern as /partitions; workspace-scoped.
  • RecordMaterializationRequest.schema#[serde(default)] + Option, wire-compatible.

All placed in the correct module (windmill-common/src/materialization.rs), none half-finished.

AGENTS.md compliance

  • Migration uses a timestamped name (not hand-rolled) and documents the grant-gap rationale. ✅
  • .sqlx offline cache regenerated (5 new query JSON files). ✅
  • Comments record constraints/invariants, not narration or drafting history. The "two correctness details" block in sql_materialize.rs is a model of what AGENTS.md asks for. ✅
  • No frontend changes — no banned patterns to check. ✅
  • No CLI changes — no stale generated files to worry about. ✅

Test coverage

  • Parser: materialize_result_sql_embeds_data_tests_breakdown extended + new materialize_result_sql_schema_excludes_partition_column — both assert the correct SQL shape including partition filter and explicit ordering.
  • Worker: extract_schema_parses_nested_and_string_encoded covers nested-array, string-encoded, and absent → None shapes.
  • Integration: asset_schema_capture.rs (real migrated DB) — first-insert → v1, unchanged → re-affirm (no new version), changed → v2 newest-first, column-reorder → new version. Four tests covering the full versioning contract.
  • Manual e2e: Author verified end-to-end against a real DuckLake (Postgres catalog + MinIO), confirming all three behaviors (capture, re-affirm, version-bump) with actual GET /assets/asset_schemas output. The one remaining CI-only gap (no duckdb-featured worker in dev) is a dev-environment constraint acknowledged in the PR body, not a coverage gap — every behavior that can be tested in CI is tested.

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

Good to merge

No issues found. Checked for bugs, security, and AGENTS.md compliance. The prior Codex findings are addressed in the current head: schema recording is gated to Materialized, manual-mode worker results are not trusted for schema capture, and the new schema helpers document their auth contract.

Test coverage

Backend coverage is appropriate for the changed layers: parser coverage for schema SQL generation including partition-column filtering, worker unit coverage for schema extraction shapes, and DB integration coverage for schema version insert/reaffirm/bump/order behavior.

Manual verification should still include a real managed DuckLake materialize, a schema-changing rerun, and a failed/data-test-failing run for the same asset, confirming /assets/asset_schemas only advances on the successful materialized runs.

Adds a "Schema" tab to DucklakeAssetPanel surfacing the captured output-schema
versions persisted by the materialize run. Master-detail (mirrors the History
tab): the version list (newest first, newest auto-selected) shows column count +
snapshot + capture time; selecting a version renders its column/type table.
Reads the GET /assets/asset_schemas endpoint via raw fetch, matching the sibling
PartitionStatusGrid convention (these materialization endpoints are not in the
generated client).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@rubenfiszel

Copy link
Copy Markdown
Contributor Author

Frontend: schema-history tab on the asset node

Added a Schema tab to the ducklake asset node panel (DucklakeAssetPanel) that surfaces the captured output-schema versions — a master-detail view mirroring the existing History tab. The version list (newest first, newest auto-selected) shows column count + snapshot + capture time; selecting a version renders its column/type table. It reads GET /assets/asset_schemas via raw fetch (same convention as the sibling PartitionStatusGrid; these materialization endpoints aren't in the generated client).

Verified in a real browser against a live DuckLake (Postgres catalog + local-filesystem storage), driving the deployed f/pipeline/orders_daily materialize and evolving its schema across two runs.

v2 selected (current — order_id, status, qty, updated_at):
schema tab v2

v1 selected (the pre-change schema — order_id, status, amount):
schema tab v1

Validation: svelte-autofixer clean, check:fast reports no errors on the new/changed files (the Timeout-vs-number diagnostics are pre-existing global noise, unrelated to these files).

@github-actions

Copy link
Copy Markdown
Contributor

Pi Review

Good to merge

What I reviewed

This PR has been through multiple rounds of CI review (Claude, Codex, Pi) with all findings addressed in fix commits. The latest commit adds frontend changes: a Schema tab on the ducklake asset node panel. I reviewed the new SchemaHistoryPanel.svelte, the modifications to DucklakeAssetPanel.svelte, and verified all prior fix commits are present in the final diff.

Prior findings — all confirmed resolved

All five issues from prior CI reviews are fixed in commits 996b6d9 and 356aff3 and are present in the final diff:

Finding Source Resolution
_wm_partition leaked into captured schema for partitioned assets Claude/Cubic (P1) WHERE column_name <> '<partition_col>' filter in DESCRIBE subquery, only when partitioned. Unit-tested.
list() aggregate could reorder columns Cubic (P2) row_number() OVER () AS _wm_ord + list(... ORDER BY _wm_ord).
API record_materialization schema upsert not gated on success Cubic/Codex (P1/P2) matches!(req.status, Materialized) before upserting.
Auth contract on record_asset_schema/list_asset_schemas was implicit Codex (P1) Doc comments now explicitly state the "no access control, caller MUST pass authorized executor" contract.
Manual mode could persist caller-shaped output_schema Codex (P2) is_managed gate: extract_schema called only when matches!(&materialize, Some((Some(_), _))).

Frontend review (new in b985650)

SchemaHistoryPanel.svelte — new read-only master-detail component:

  • Fetches GET /assets/asset_schemas via raw fetch (same convention as PartitionStatusGrid — these materialization endpoints aren't in the generated API client). ✅
  • Uses resource from runed with reactive prop dependencies — same pattern as PartitionStatusGrid. ✅
  • Handles loading, error, empty-state ("No schema captured yet") gracefully. ✅
  • Master-detail with Splitpanes mirrors the existing History tab. ✅
  • Uses Windmill design-system <Button>, lucide-svelte icons, Tailwind classes — no custom CSS. ✅
  • No $bindable(default_value) on optional props, no crypto.randomUUID(). ✅
  • The "current" badge uses i === 0 (index in the newest-first list) which is correct since the API returns results ORDER BY version DESC. ✅

DucklakeAssetPanel.svelte — tab addition:

  • Adds "Schema" as a middle tab between Partitions and History, with Columns3 icon from lucide-svelte. ✅
  • SchemaHistoryPanel is only rendered when a qualifiedTable exists (table-scoped asset, not a catalog-level ducklake node), matching the scoping of the History tab. ✅
  • The indentation-only change on the existing History branch is a consequence of adding an {:else if} branch — no logic changed. ✅

AGENTS.md compliance

  • Migration uses a timestamped name and documents the grant-gap rationale. ✅
  • .sqlx offline cache regenerated (5 new query JSON files). ✅
  • Comments record constraints/invariants, not narration or drafting history. ✅
  • Frontend: no crypto.randomUUID(), no $bindable(default_value) on optional props, all UI components from design system, Tailwind-only styling. ✅
  • No CLI changes — no stale generated files. ✅

Test coverage

  • Parser: two unit tests assert output_schema SQL shape (partitioned + unpartitioned). ✅
  • Worker: extract_schema test covers nested-array, string-encoded, and absent shapes. ✅
  • Integration: asset_schema_capture.rs (real migrated DB) — first-insert→v1, unchanged→re-affirm, changed→v2 newest-first, column-reorder→new-version. ✅
  • Frontend: no component tests expected per policy (the codebase does not generally test Svelte components). ✅
  • Manual: author verified end-to-end against a real DuckLake (Postgres catalog + MinIO), with actual GET /assets/asset_schemas output. The frontend tab was verified in a real browser with screenshots attached to the PR. ✅

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

Good to merge

No issues found. Checked for bugs, security, and AGENTS.md compliance.

Test coverage

Backend coverage is strong: parser/unit coverage, worker schema extraction coverage, and DB integration coverage for insert, reaffirm, version bump, ordering, and newest-first listing.

Frontend component tests are not generally expected here. Manual verification should cover opening a DuckLake asset node, switching to the Schema tab, and confirming empty and populated schema-history states render correctly. I did not run a dev server or browser verification locally.

Only a whole-table `replace` producer (CREATE OR REPLACE) can change columns
run-to-run; `append`/`merge`/partitioned writes INSERT into a fixed-schema
table, so their schema is pinned at first materialize and the "history" framing
is degenerate (always one version).

- backend: surface the managed `materialize_strategy` (`replace`/`append`/
  `merge`) on the asset-graph runnable node, alongside the existing
  `partition_kind` (same parse-from-annotation path).
- frontend: the pipeline page derives `schemaCanEvolve` for the selected asset
  from its write-producer (`replace` && not partitioned) and threads it to the
  Schema tab. Evolvable → master-detail version history; fixed → a single
  current-schema table with a short "schema is fixed" note. Unknown defaults to
  evolvable so real history is never hidden.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@rubenfiszel

Copy link
Copy Markdown
Contributor Author

Schema tab is now strategy-aware

Per review discussion: the schema-evolution history only makes sense for a whole-table replace producer (CREATE OR REPLACE rebuilds the table, so columns can change run-to-run). append/merge/partitioned writes INSERT into a fixed-schema table, so their schema is pinned at first materialize — a multi-version "history" there is degenerate.

The backend now surfaces the managed materialize_strategy on the asset-graph runnable node (same parse-from-annotation path as partition_kind), and the pipeline page derives schemaCanEvolve for the selected asset from its write-producer. The Schema tab switches accordingly:

Fixed schemaappend producer (events_log): single current-schema table + a "schema is fixed" note, no version list:
fixed schema

Evolving schema — whole-table replace producer (orders_daily): the master-detail version history (v2 current / v1):
schema history

The gate is can the schema evolve (replace && not partitioned), not partitioned-vs-not — an unpartitioned append is correctly treated as fixed. Unknown producers default to evolvable so real history is never hidden. Verified both views in a real browser.

@github-actions

Copy link
Copy Markdown
Contributor

I've completed a thorough review of all 19 changed files across the full diff. This PR has been through five prior CI review rounds (Claude, Codex, Pi), with all findings resolved in fix commits 996b6d9 and 356aff3. My independent review confirms all prior issues are addressed and finds no new ones.

Pi Review

Good to merge

Prior findings — all confirmed resolved

Finding Source Resolution
_wm_partition leaked into captured schema for partitioned assets Claude/Cubic (P1) WHERE column_name <> '<partition_col>' filter in DESCRIBE subquery, only when partitioned. Escaped with .replace('\'', '\'\''). Unit-tested.
list() aggregate could reorder columns → spurious version bumps Cubic (P2) row_number() OVER () AS _wm_ord + list(... ORDER BY _wm_ord). Deterministic ordering.
API record_materialization schema upsert not gated on success Cubic/Codex (P1/P2) matches!(req.status, Materialized) check before upserting.
Auth contract on record_asset_schema/list_asset_schemas was implicit Codex (P1) Doc comments now explicitly state "no access control, caller MUST pass authorized executor" contract.
Manual mode could persist caller-shaped output_schema Codex (P2) is_managed gate: extract_schema called only when matches!(&materialize, Some((Some(_), _))). Manual mode result is never trusted.

Public surface checklist (final commit)

  • SchemaColumn — data-transfer type, plain Serialize/Deserialize. No auth footprint. ✅
  • AssetSchemaVersion — read-only sqlx::FromRow return type. No auth concerns. ✅
  • record_asset_schema — doc comment documents auth contract (caller must pass authorized executor). Advisory lock serializes concurrent same-asset captures. Returns bool indicating whether a new version was inserted. Placed in windmill-common/src/materialization.rs alongside the existing record_materialization family. ✅
  • list_asset_schemas — doc comment documents auth contract. Read-only parameterized query via PgExecutor. Same module placement and auth pattern as list_materialized_partitions. ✅
  • GET /assets/asset_schemas — uses ApiAuthed + user_db.begin(&authed), same auth pattern as the sibling /partitions endpoint. Ducklake-only in v1 per comment. ✅
  • RecordMaterializationRequest.schema#[serde(default)] + Option, wire-compatible with older agents. ✅
  • GraphRunnableNode.materialize_strategy — new optional serialization field, #[serde(skip_serializing_if = "Option::is_none")]. Read-only in the frontend. ✅

None of the new public surfaces are half-finished — all have callers in the same PR. None are misplaced — all live in the modules that already own the materialization/asset-graph surface.

Correctness walkthrough

Schema capture SQL (sql_materialize.rs): The DESCRIBE SELECT * FROM <target> fold with row_number() OVER () AS _wm_ord + list(... ORDER BY _wm_ord) + partition-column filter is correct. Quote-escaping via .replace('\'', '\'\'') is proper for both DuckDB and PostgreSQL string literals.

record_asset_schema (materialization.rs): Advisory lock key includes workspace, kind, and path — correct grain for serializing per-asset captures. Version comparison uses latest.columns.0.as_slice() == columns which compares ordered slices (column order matters, intentionally). The re-affirm UPDATE advances snapshot_id/job_id/captured_at without touching columns. Returns Ok(true) for new version, Ok(false) for re-affirmed. ✅

Worker (duckdb_executor.rs): extract_schema tolerates both nested JSON array and string-encoded FFI shapes, returns None when absent. is_managed gate prevents trusting manual-mode results. All failure/error record_mat calls pass None for schema; only the success path passes the extracted schema. Schema capture is in a separate record_asset_schema_best_effort transaction — a failure here can't roll back the partition row. ✅

API (windmill-api-assets/src/lib.rs): Partition write uses user_db.clone().begin() — schema write uses another user_db.clone().begin() (separate transaction, best-effort). Status gate matches!(req.status, Materialized) prevents advancing history on failed/running writes. Error logged with tracing::warn!, not returned to caller. ✅

Frontend (SchemaHistoryPanel.svelte): Reactive resource from runed with workspace+path dependencies. Handles loading, error, and empty ("No schema captured yet") states. Fixed-schema mode shows a Lock icon + explanatory note + single current-schema table (no version list). Evolving-schema mode shows master-detail with Splitpanes. "current" badge uses i === 0 (correct: API returns ORDER BY version DESC). No crypto.randomUUID(), no $bindable(default_value) on optional props. Uses design-system <Button>, lucide-svelte icons, Tailwind classes. ✅

schemaCanEvolve derivation (+page.svelte): Checks for ducklake asset kind, finds producer nodes in the graph, returns true only when a non-partitioned replace producer exists. Defaults to true for unknown/draft. Correct: the "can evolve" condition is "whole-table replace that isn't partitioned" — exactly what the backends' materialize_strategy serialization produces. ✅

DucklakeAssetPanel.svelte: Schema tab added between Partitions and History with Columns3 icon. The History section's indentation-only change (diff shows it wrapped in {:else if tab === 'history'} block) is a consequence of adding the new tab branch — no logic changed. ✅

AGENTS.md compliance

  • Migration uses a timestamped name and documents the grant-gap rationale, matching the script_trigger precedent. ✅
  • .sqlx offline cache regenerated for 5 new query shapes (advisory lock, SELECT-latest, UPDATE re-affirm, INSERT, list-query). ✅
  • Comments record constraints and invariants (locking contract, versioning semantics, partition-filter rationale, status-gate rationale), not narration or drafting history. The "Two correctness details" block in sql_materialize.rs exemplifies what AGENTS.md asks for. ✅
  • Frontend: no crypto.randomUUID(), no $bindable(default_value) on optional props, all UI components from design system, Tailwind-only styling. ✅
  • No CLI changes — no stale generated files. ✅

Test coverage

  • Parser: materialize_result_sql_embeds_data_tests_breakdown extended to assert output_schema fold in both test and no-test summaries. New materialize_result_sql_schema_excludes_partition_column asserts the WHERE column_name <> '_wm_partition' filter is present for partitioned assets. ✅
  • Worker: extract_schema_parses_nested_and_string_encoded covers nested-array, string-encoded, and absent→None shapes. ✅
  • Integration: asset_schema_capture.rs (real migrated DB with #[sqlx::test]) — first-insert→v1, unchanged→re-affirm (no new version), changed→v2 newest-first, column-reorder→new-version. Four tests covering the full versioning contract. ✅
  • Frontend: No component tests expected per policy (the codebase does not generally test Svelte components). No new pure-logic utilities with sibling *.test.ts files. ✅
  • Manual e2e: The author verified end-to-end against a real DuckLake (Postgres catalog + MinIO S3), confirming all three design behaviors (capture, re-affirm, version-bump) with actual GET /assets/asset_schemas output in PR comments. The frontend Schema tab was verified in a real browser with screenshots attached to the PR for both fixed-schema and evolving-schema views. ✅

No additional manual verification is needed — the PR author has exercised every behavior that requires a real DuckLake substrate.

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel

Mergeable, but should ideally address nits: unknown producer strategy hides schema history

  1. [P2] frontend/src/routes/(root)/(logged)/pipeline/[folder]/+page.svelte:1974schemaCanEvolve treats producers with missing materialize_strategy as fixed-schema producers. Draft overlay runnables, for example, are pushed without the new field in resolveGraph.ts:234, so a draft materialize producer for an existing DuckLake asset can force canEvolve=false. SchemaHistoryPanel.svelte:110 then renders the single fixed-schema view and hides older captured versions even when /asset_schemas returns history. This contradicts the nearby “Unknown defaults to evolvable” intent. Either populate materialize_strategy in the draft/live graph overlays or make missing strategy metadata default to evolvable.

Test coverage

Backend coverage looks solid for the new schema versioning helper, parser SQL shape, and worker extraction path. Frontend component tests are not expected here, but if the fix lands in resolveGraph, the existing resolveGraph.test.ts suite is the right place to cover draft materialize strategy propagation or the unknown-strategy fallback.

Manual verification should cover a DuckLake asset with multiple captured schema versions and a producer whose strategy metadata is unavailable or draft-derived; the Schema tab should still show the version list rather than the fixed-schema banner.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 6 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="frontend/src/routes/(root)/(logged)/pipeline/[folder]/+page.svelte">

<violation number="1" location="frontend/src/routes/(root)/(logged)/pipeline/[folder]/+page.svelte:1971">
P2: Producer matching in `schemaCanEvolve` uses path only, ignoring the runnable kind that is part of the asset graph identity; this can match the wrong runnable kind when paths collide across scripts and flows, leading to incorrect schema-tab behavior.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic

Comment on lines +1971 to +1972
const producerPaths = new Set(selectionProducers.map((p) => p.path))
const producers = graphWithDraft.runnables.filter((r) => producerPaths.has(r.path))

@cubic-dev-ai cubic-dev-ai Bot Jun 26, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Producer matching in schemaCanEvolve uses path only, ignoring the runnable kind that is part of the asset graph identity; this can match the wrong runnable kind when paths collide across scripts and flows, leading to incorrect schema-tab behavior.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At frontend/src/routes/(root)/(logged)/pipeline/[folder]/+page.svelte, line 1971:

<comment>Producer matching in `schemaCanEvolve` uses path only, ignoring the runnable kind that is part of the asset graph identity; this can match the wrong runnable kind when paths collide across scripts and flows, leading to incorrect schema-tab behavior.</comment>

<file context>
@@ -1958,6 +1958,22 @@
+	let schemaCanEvolve = $derived.by(() => {
+		const sel = selection
+		if (!sel || sel.kind !== 'asset' || sel.asset_kind !== 'ducklake') return true
+		const producerPaths = new Set(selectionProducers.map((p) => p.path))
+		const producers = graphWithDraft.runnables.filter((r) => producerPaths.has(r.path))
+		if (producers.length === 0) return true
</file context>
Suggested change
const producerPaths = new Set(selectionProducers.map((p) => p.path))
const producers = graphWithDraft.runnables.filter((r) => producerPaths.has(r.path))
const producerKeys = new Set(selectionProducers.map((p) => `${p.kind}:${p.path}`))
const producers = graphWithDraft.runnables.filter((r) => producerKeys.has(`${r.usage_kind}:${r.path}`))
Fix with cubic

Previously a producer present but missing `materialize_strategy` (e.g. a
draft-overlay runnable, synthesized without the field) fell through to
canEvolve=false, hiding captured history behind the fixed-schema view —
contradicting the "unknown defaults to evolvable" intent.

Now the fixed view shows only when *every* producer is a known insert-style
write (append/merge, or partitioned replace); any producer with unknown
(missing) strategy is treated as evolvable, so real history is never hidden.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@rubenfiszel

Copy link
Copy Markdown
Contributor Author

Addressed the Codex P2 in 6e57128.

schemaCanEvolve previously only defaulted to evolvable when there were zero producers — a producer present but missing materialize_strategy (e.g. a draft-overlay runnable, which the graph synthesizes without the field) fell through to canEvolve=false, hiding captured history behind the fixed-schema view.

Fixed by failing open: the fixed view now shows only when every producer is a known insert-style write (append/merge, or partitioned replace); any producer with missing/unknown strategy is treated as evolvable, so real history is never hidden. I went with the "missing strategy → evolvable" option (vs. populating the field in draft overlays) since it matches the stated intent and keeps the gate's failure mode safe.

@rubenfiszel rubenfiszel merged commit ade74b2 into main Jun 26, 2026
9 of 10 checks passed
@rubenfiszel rubenfiszel deleted the pipeline-schema-capture branch June 26, 2026 17:46
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 26, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant