Skip to content

feat: column-level lineage for DuckLake pipelines (SQL-AST inferred + traceable)#9814

Merged
rubenfiszel merged 15 commits into
mainfrom
pipeline-column-lineage
Jun 27, 2026
Merged

feat: column-level lineage for DuckLake pipelines (SQL-AST inferred + traceable)#9814
rubenfiszel merged 15 commits into
mainfrom
pipeline-column-lineage

Conversation

@rubenfiszel

@rubenfiszel rubenfiszel commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Implements column-level lineage for DuckLake data pipelines — parity roadmap gap #5 (docs/pipelines-vs-dbt.md §3) — and surfaces it in the asset graph.

Inferred-first. For DuckDB scripts the lineage is derived automatically from the SQL AST (dbt-style), so no annotation is needed in the common case. windmill-parser-sql-asset walks each output-producing query's projection and maps every output column to the source columns its expression reads — passthroughs and computed columns (amount + tax AS total) — resolving each input to its asset via the same ATTACH/alias machinery the asset parser already uses.

The // column <out_col> <- <asset-uri>.<col>[, …] annotation is the override / escape hatch for what inference can't reach: polyglot transforms (Python/TS/Bash — no SQL AST), dynamic SQL (${sql.raw(...)}), or correcting a mis-inferred edge. Inferred and annotated lineage merge per output column with the annotation winning (merge_column_lineage). The annotation is the second extensible annotation family after // data_test (#9708) and follows that exact convention (head-then-tail parse, snake_case serde, full Rust↔TS parser parity via the shared fixture corpus). Lineage is pure metadata — it drives the graph view, never a runtime probe.

Scope: column lineage + its graph surface only — no execution, no schema contracts.

Changes

Backend — SQL AST inference (the automatic path)

  • windmill-parser-sql-asset: infer_column_lineage walks CTAS / CREATE VIEW / top-level (managed-materialize) SELECT projections; a ColumnIdentCollector (sqlparser Visit) gathers each output expression's identifier leaves, resolved to source-asset columns via a refactored build_from_maps (shared with the existing read-column extraction). parse_assets merges inferred lineage with // column annotations. 4 new tests (computed, passthrough, annotation-override, wildcard).
  • windmill-parser/asset_parser.rs: merge_column_lineage(inferred, annotated) (annotation wins per output column) + test.
  • windmill-api-assets: the graph endpoint runs parse_assets for DuckDB pipeline members and surfaces the merged lineage on deployed nodes (other languages fall back to annotation-only). Adds the windmill-parser-sql-asset dep + language to the member query.

Backend — // column annotation (the override, Surface A: both parsers + parity corpus)

  • asset_parser.rs: ColumnLineage/ColumnRef (snake_case serde) on ParseAssetsOutput/PipelineAnnotations, // column parsing + helpers, unit tests.
  • Shared parity corpus: pipeline_annotations.json + pipeline_annotations_parity.rs; windmill-common/assets.rs re-exports.

Frontend (parser parity + graph surface)

  • parsePipelineAnnotations.ts (+ .parity.test.ts): ColumnLineage/ColumnRef types + parsing, behaviorally identical to Rust, asserted against the same corpus.
  • types.ts / resolveGraph.ts: column_lineage on the runnable node, refreshed live from the draft parse.
  • AssetGraphCanvas.svelte + AssetGraphEdge.svelte: a "columns ×N" badge on the producer→materialized-asset write-edge (hover lists each mapping).
  • ColumnLineageView.svelte (new): a column-to-column diagram in the asset details pane, fed by selectionColumnLineage derived in the pipeline +page.svelte.

Screenshots

Auto-derived from SQL — no annotation. Script body is SELECT id, amount + tax AS order_total, customer_id AS cust FROM orders; the diagram is inferred entirely from the AST (note amount+taxorder_total):

column-lineage-inferred

Deployed pipeline graph — the "columns ×3" badge on the write-edge (hover lists every mapping):

graph-deployed

Column-lineage diagram (with explicit // column annotations, the override path):

column-lineage-view

Test plan

  • cargo test -p windmill-parser -p windmill-parser-sql-asset — 105 + 58 green (incl. SQL inference, annotation parsing, merge, parity)
  • Frontend parity + full AssetGraph vitest (115) green; npm run check 0 new errors
  • Verified live on a deployed pipeline: lineage auto-derived from SQL with no annotation (computed column included); // column annotation overrides per output column; duplicate-input ref doesn't crash the diagram
  • Reviewer (note, not a blocker): SQL-AST inference runs server-side (deploy + graph endpoint), so it shows on deployed members. Unsaved drafts show // column annotations only — the live preview uses a pinned prebuilt WASM parser that predates this feature; wiring it in is a tracked follow-up. Deployed pipelines are where column lineage matters.

🤖 Generated with Claude Code

…ation

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@cloudflare-workers-and-pages

cloudflare-workers-and-pages Bot commented Jun 26, 2026

Copy link
Copy Markdown

Deploying windmill with  Cloudflare Pages  Cloudflare Pages

Latest commit: 8b86bdc
Status: ✅  Deploy successful!
Preview URL: https://ce001bfc.windmill.pages.dev
Branch Preview URL: https://pipeline-column-lineage.windmill.pages.dev

View logs

@rubenfiszel rubenfiszel marked this pull request as ready for review June 26, 2026 10:40
@claude

claude Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Claude finished @rubenfiszel's task in 2m 55s —— View job


Review

Good to merge — no blocking issues. The parser change is correct, fully mirrored across Rust/TS, and locked down by a shared parity corpus; the frontend surfaces are coherent with the existing data_test precedent.

This is a model example of the "Surface A" discipline: every parser behavior is wired in both parsers and asserted against the same pipeline_annotations.json fixtures, so Rust↔TS divergence can't silently land.

What I verified

Parser parity (Rust ↔ TS). I traced the head-then-tail parse on both sides and they're behaviorally identical at every branch:

  • Arrow split: Rust split_once("<-") / TS indexOf('<-') — both first-occurrence.
  • Column split: Rust rsplit_once('.') / TS lastIndexOf('.') — both last-dot, so schema-qualified tables (warehouse/main.dim_products.sku) keep sku as the column and warehouse/main.dim_products as the path. ✓
  • Fail-safe ordering: single_identparse_asset_syntax matches singleIdentparseAssetSyntaxDefault, including the path === '' / from_path.is_empty() drop. ✓
  • Drop-malformed-keep-≥1, missing-arrow, multi-token output column, bare keyword, and duplicate-ref cases all covered in the fixtures and exercised on both sides.

Edge cases I poked at manually (.amount, ducklake://warehouse/orders., multiple <-, b.c <- d.e) drop consistently on both sides. No divergence found.

Wire shape. ColumnLineage/ColumnRef use snake_case serde and the TS types mirror them verbatim, so the live-draft parse and the deployed-node graph endpoint (GraphRunnableNode.column_lineage, populated from the parsed deployed body — same path as data_tests) produce identical shapes. The parity test's ASSERTED_TS_FIELDS guard ensures the new field can't be forgotten in the corpus comparison. ✓

Frontend correctness.

  • The keyed-{#each} dedup guard in ColumnLineageView (edges keyed by from->to, sources keyed by kind:path:column) correctly prevents each_key_duplicate when a // column line repeats the same upstream ref — and the fixture explicitly keeps duplicate refs at parse time (dedup is a view concern), so the two layers are consistent. ✓
  • column_lineage rides only the asset_kind === 'ducklake' write-edge, matching the data_tests badge placement. ✓
  • selectionColumnLineage concatenates across producers; the output-column box is keyed by ${l.column}:${i}, so duplicate output columns from different producers won't collide. ✓
  • resolveGraph refreshes the badge live from the draft parse for both new and already-deployed nodes. ✓

Optional polish (non-blocking)

AssetGraphEdge.svelte: when both the data-test and column badges are present, the columns badge is offset by labelY + 12 while both foreignObjects are 18px tall. The inner pills are shorter (~14px) so the practical overlap is only a couple of px, but if you want a guaranteed gap, ~16–18px of offset would fully clear it. Purely cosmetic.

AGENTS.md compliance

Comments follow the "record constraints, not narration" rule well — each explains a non-obvious invariant (why column is checked as a complete word, why the dedup guard exists, why lineage rides the ducklake edge) without narrating the next line. No customer attribution. No banned $bindable(default) patterns. Frontend uses $derived/$derived.by idiomatically.

Test coverage

  • Backend: 4 new Rust unit tests + 3 new shared fixtures covering the basic mapping, schema-qualified table, drop-malformed-keep-valid, and fully-malformed-line cases. Parity test asserts the serde shape. Appropriate and sufficient for a pure-parser change. ✓
  • Frontend: parity test extends to columnLineage against the same corpus. Per repo convention Svelte components aren't unit-tested, and ColumnLineageView/badge logic is presentational, so no component tests expected. ✓
  • Manual verification still worth doing (the one item the PR already flags as unchecked): on an already-deployed pipeline, toggle a // column line on a script and confirm the write-edge badge count and the details-pane diagram update live via the resolveGraph draft overlay without a redeploy. The screenshots confirm the deployed-graph badge and the details-pane diagram render correctly.

@github-actions

Copy link
Copy Markdown
Contributor

I've now thoroughly reviewed the parser, API, frontend components, test fixtures, and parity assertions. Let me write the final review.


Pi Review

Good to merge ✅

This PR implements column-level lineage for DuckLake pipelines via a // column annotation, following the exact same extensible annotation pattern as the previously-shipped // data_test family. The Rust↔TS parser parity is enforced by the shared fixture corpus (4 new entries), and the graph surface is cleanly layered with both an edge badge and a dedicated details-pane diagram.

What I checked

  • Parser correctness: Both parse_column_lineage_spec and parse_column_ref (Rust + TS) correctly handle malformed input, schema-qualified tables (main.dim_products.sku keeps sku), duplicate refs (faithfully preserved; dedup is deferred to the view), missing <-, multi-token output columns, and zero-valid-ref lines — all dropped fail-safe. The column keyword is a complete-word match via consume_keyword (requires trailing whitespace/EOL), so // column_name … body comments are never consumed.
  • API: column_lineage on GraphRunnableNode is populated from the deployed annotations (ann.map(|a| a.column_lineage.clone()).unwrap_or_default()) — mirrors the data_tests path exactly. No auth/authorization concern (read-only graph endpoint surfacing parsed metadata only).
  • Frontend: The badge stacking in AssetGraphEdge.svelte uses columnsBadgeY = labelY + 12 when a data-test badge is also present, placing the columns badge below. ColumnLineageView.svelte uses deterministic coordinate math (no DOM measurement) and deduplicates both source nodes and edges to avoid Svelte's each_key_duplicate — correct. Props use plain $props() (no $bindable), icons from lucide-svelte, all Tailwind styling. No crypto.randomUUID() anywhere.
  • Parity: pipeline_annotations_parity.rs and parsePipelineAnnotations.parity.test.ts both assert against the same pipeline_annotations.json fixtures, including the 4 new column-lineage cases. The TS test's ASSERTED_TS_FIELDS map includes columnLineage: true.
  • Module placement: ColumnLineage/ColumnRef live in asset_parser.rs alongside DataTest (their sibling in the annotation family) and are re-exported through windmill-common/assets.rs — consistent with the existing DataTest path. No half-finished pub fn + #[allow(dead_code)] + TODO patterns.

Test coverage

  • Backend: 4 Rust unit tests (column_lineage_basic, …_schema_qualified_…, …_drops_malformed_refs_…, …_malformed_lines_dropped_fail_safe) + the parity integration test asserts against 4 new fixture entries. Adequate for a pure-parser change with no DB or worker behavior.
  • Frontend: The parity test (parsePipelineAnnotations.parity.test.ts) asserts the TS parser against the same fixture corpus — covers all 4 column-lineage cases. No new pure-logic utility files that would warrant unit tests beyond parity.
  • Docs / config: docs/pipelines-vs-dbt.md updated to mark column lineage as shipped. No CI/workflow changes.

Manual verification needed before merge

The PR description calls out one reviewer step: edit a // column line on an already-deployed script and confirm the badge updates live (via the resolveGraph draft-overlay path) without requiring a redeploy. The resolveGraph.ts change to refresh column_lineage on existing deployed runnables when the draft parse changes is the mechanism; exercising it live confirms it works end-to-end.

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel
Should address issues before merging: live column-lineage edits are not overlaid, column lineage is attached to every ducklake write

Found 2 issues:

  1. [P1] Live column-lineage edits are not overlaid.
    frontend/src/lib/components/assets/AssetGraph/resolveGraph.ts:243

    column_lineage is populated from parsePipelineAnnotations(d.script.content), but the pipeline page keeps the current editor buffer outside drafts until pane teardown, while emitting liveAnnotations every keystroke. applyLiveBufferOverlay consumes live annotations only for trigger edges, so toggling a // column line on an open deployed script or active draft will not update the badge/details until selection changes, save/refetch, or teardown.

  2. [P1] Column lineage is attached to every ducklake write by the same producer.
    frontend/src/lib/components/assets/AssetGraph/AssetGraphCanvas.svelte:435

    The badge is selected with only e.asset_kind === 'ducklake', and the details pane later concatenates the producer’s column_lineage for any selected asset it writes. If a script writes multiple ducklake assets, the same column mapping appears on all of them, including assets that are not the materialized output described by the annotation.

Test Coverage

Backend parser unit/parity fixtures and frontend parser parity cover the annotation grammar. Missing coverage is in graph resolution/render targeting: live editor overlay for columnLineage, and multi-ducklake-output producers.

Manual verification needed: toggle a // column line on an already-deployed open script and on an active draft; the write-edge badge and selected asset details should update immediately. Also verify a script with two ducklake outputs shows column lineage only on the intended materialized asset.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 15 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="frontend/src/lib/components/assets/AssetGraph/AssetGraphEdge.svelte">

<violation number="1" location="frontend/src/lib/components/assets/AssetGraph/AssetGraphEdge.svelte:77">
P3: When both the data-test badge and the column-lineage badge are present, the 12px vertical offset between two 18px-tall `foreignObject` elements leaves a ~6px overlap. Increase the offset to 16–18px to fully clear the badges visually.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Fix all with cubic | Re-trigger cubic

Comment thread frontend/src/lib/components/assets/AssetGraph/AssetGraphEdge.svelte Outdated
…verride)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Pi Review

Good to merge — no blocking issues. The column lineage feature follows the established data_tests annotation pattern consistently across Rust ↔ TS parsers, the backend graph endpoint, and the frontend surface. The two issues flagged in the prior Codex review are pre-existing patterns shared with data_tests, not new bugs introduced by this PR.

Prior Codex findings re-examined

"Live column-lineage edits are not overlaid"

This is correct: applyLiveBufferOverlay (resolveGraph.ts:387) propagates live annotations only for trigger edges — it does not refresh data_tests badges either. Both data_tests and column_lineage badges update via seedDraftOverlays (drafts and redeploy/refetch) or the backend graph endpoint (deployed nodes). Keystroke-live overlay of annotation badges on deployed scripts is a systematic gap in the graph resolver, not a regression from this PR. The PR author explicitly notes this as a reviewer item in the test plan.

The Codex reviewer also claimed the pipeline page "keeps the current editor buffer outside drafts until pane teardown" — but that describes only deployed scripts being viewed. For actual drafts (unsaved scripts), seedDraftOverlays (line 243) already sets column_lineage from the live parse, matching the data_tests path exactly (line 240). There is no gap for drafts.

"Column lineage is attached to every ducklake write by the same producer"

This follows the identical pattern as data_tests at AssetGraphCanvas.svelte:427-431 — both badges ride every asset_kind === 'ducklake' write-edge from the same producer. In the common single-output case this is correct. For multi-output producers, both badge families are imprecise but harmless (pure metadata, never runtime). The code comment at line 431-432 explicitly states this design: "Column lineage describes the materialized output's columns, so it rides the same ducklake write-edge as the data-test badge (v1 materialize target is always ducklake)."

What I verified

  • Parser parity (Rust ↔ TS): parse_column_lineage_spec / parseColumnLineageSpec are behaviorally identical — split_once("<-") vs indexOf('<-'), rsplit_once('.') vs lastIndexOf('.'), single_ident / singleIdent, parse_asset_syntax / parseAssetSyntaxDefault, consume_keyword with whitespace/EOL guard. Asserted against the shared pipeline_annotations.json fixtures (3 new column-lineage cases + 1 mixed). The TS parity test's ASSERTED_TS_FIELDS includes columnLineage: true.

  • SQL AST inference (infer_column_lineage): Correctly walks each output-producing projection, captures identifier leaves via ColumnIdentCollector (skipping subquery depth > 0), resolves through build_from_maps (shared with extract_column_assets), and deduplicates within each output column. Wildcards, unaliased non-identifier expressions, and CTE-local references are dropped correctly. 5 Rust tests cover passthrough, computed, annotation-override, wildcard, and subquery-avoidance cases.

  • merge_column_lineage: Annotation wins per output column (HashSet-based dedup). Annotated entries kept first, inferred entries for un-annotated columns appended. Inferred-inferred dedup uses same set (first-wins). Covered by a dedicated unit test.

  • Backend API (asset_graph): column_lineage on GraphRunnableNode is populated from column_lineage_by_path — DuckDB scripts get full parse_assets (AST-inferred + annotation-merged), non-DuckDB get annotation-only. Gated to usage_kind == AssetUsageKind::Script. Endpoint uses ApiAuthed (no auth concern — read-only metadata). Same path as data_tests.

  • Frontend components:

    • AssetGraphEdge.svelte — columns badge stacked below data-test badge at labelY + 12, both foreignObjects 18px tall, badge uses Columns3 icon + count.
    • ColumnLineageView.svelte — deterministic layout (fixed ROW_H), SVG bezier connectors, sources/outputs deduplicated with Set guards, keyed {#each} preventing each_key_duplicate. Uses $derived.by idiomatically, no $bindable, no crypto.randomUUID().
    • resolveGraph.tscolumn_lineage refreshed on drafts and already-deployed runnables from the live annotation parse, mirroring the data_tests refresh comment.
  • Module placement: ColumnLineage/ColumnRef in asset_parser.rs alongside DataTest, re-exported via windmill-common/assets.rs. No half-finished pub fn + #[allow(dead_code)] + TODO patterns.

  • AGENTS.md compliance: No banned $bindable(default_value) patterns. No crypto.randomUUID() — the component uses deterministic keys. Comments record constraints (why column is a complete-word keyword, why enum cases are exhaustively enumerated, why dedup guards exist) without narrating the next line. No customer attribution.

Test coverage

  • Backend: 5 Rust unit tests for SQL inference (computed, passthrough, annotation-override, wildcard, subquery-avoidance) + 4 Rust unit tests for annotation parsing (basic, schema-qualified, drop-malformed-keep-valid, merge-override) + 3 shared parity fixtures with the full pipeline_annotations.json corpus. The parity integration test asserts the serde shape against the same fixtures. Adequate for a pure-parser change with no DB or worker behavior.

  • Frontend: parsePipelineAnnotations.parity.test.ts asserts the TS parser against the same fixture corpus including all 4 column-lineage cases. The ASSERTED_TS_FIELDS guard ensures columnLineage can't be forgotten. Per repo convention Svelte components aren't unit-tested, and ColumnLineageView/badge logic is presentational, so no component tests expected.

  • Docs: docs/pipelines-vs-dbt.md updated to mark column lineage as shipped. No CI/workflow changes.

Manual verification

No manual verification strictly required — the PR description's reviewer step (toggling a // column line on an already-deployed script to confirm live badge update) is a known limitation of the live-buffer-overlay system shared with data_tests, and the PR author explicitly documents it as such. The tests cover all the novel logic. If desired, the two scenarios are:

  • Verify that adding a // column annotation to a draft script updates the badge and detail-pane diagram immediately (via seedDraftOverlays), without requiring a save or redeploy. This tests the draft path.

  • Verify that deploying a DuckDB script with SQL column references (no annotation) automatically populates the column-lineage badge and diagram from AST inference — confirm computed columns (amount + tax AS total) appear with both source columns listed.

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel
Should address issues before merging: DuckDB parse failures drop annotation lineage, live column-lineage edits are not overlaid, column lineage is attached to every ducklake write

Found 3 issues:

  1. [P1] DuckDB parse failures drop annotation lineage.
    backend/windmill-api-assets/src/lib.rs:749

    For DuckDB members, the graph uses parse_assets(&r.content).map(|o| o.column_lineage).unwrap_or_default(). If SQL parsing fails, explicit // column / -- column annotations are discarded instead of falling back to annotations_by_path, which breaks the annotation escape hatch for unsupported or dynamic SQL.

  2. [P1] Live column-lineage edits are not overlaid.
    frontend/src/lib/components/assets/AssetGraph/resolveGraph.ts:254

    This refresh still reads parsed from d.script.content, while open editor keystrokes flow through liveAnnotations / liveContent until pane teardown. applyLiveBufferOverlay only rebuilds trigger overlays, so editing a // column line on an open deployed script or active draft leaves stale badge/details until save, refetch, or selection change.

  3. [P1] Column lineage is attached to every ducklake write.
    frontend/src/lib/components/assets/AssetGraph/AssetGraphCanvas.svelte:435

    The renderer keys lineage by producer and applies it to every ducklake write edge from that producer. If one script writes multiple ducklake assets, the same column mappings appear on all outputs, including assets the annotation/inference does not describe.

Test coverage

Backend parser/parity tests cover the grammar and basic inference. Missing coverage is around graph endpoint fallback on DuckDB parse failure, live resolveGraph overlay for columnLineage, and multi-ducklake-output targeting.

Manual verification should cover editing a // column line in an already-open deployed script and active draft, expecting badge/details to update immediately. Also verify a script with two ducklake outputs only shows lineage on the intended output.

…otations

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel
Should address issues before merging: DuckDB parse failures drop annotation lineage

Found 1 issue:

  1. [P1] DuckDB parse failures drop explicit column-lineage annotations.
    backend/windmill-api-assets/src/lib.rs:749

    For DuckDB pipeline members, the graph endpoint uses parse_assets(&r.content).map(|o| o.column_lineage).unwrap_or_default(). But windmill-parser-sql-asset::parse_assets starts with Parser::parse_sql(&DuckDbDialect, input)?, so any SQL that DuckDB accepts but sqlparser rejects returns before parse_pipeline_annotations runs. That drops // column annotations entirely, even though the annotation is the documented escape hatch for inference misses. The DuckDB branch should fall back to annotations_by_path[r.path].column_lineage when AST parsing fails.

Test coverage

Backend parser tests cover successful inference, merge behavior, and annotation parity. Missing coverage is the graph-endpoint fallback path where DuckDB AST parsing fails but explicit // column annotations should still be surfaced.

Manual verification needed: deploy a DuckDB pipeline member with a // column annotation and SQL outside sqlparser coverage but accepted by DuckDB, then confirm the asset graph still shows the column badge and details diagram.

@github-actions

Copy link
Copy Markdown
Contributor

I've now thoroughly examined the diff, the prior reviews, and the relevant source files. Here is my review:

Pi Review

cc @rubenfiszel

Should address issues before merging: DuckDB parse failures drop annotation lineage

[P1] DuckDB parse failures silently discard // column annotations

File: backend/windmill-api-assets/src/lib.rs, lines 749–751

For DuckDB scripts, the graph endpoint calls parse_assets(&r.content).map(|o| o.column_lineage).unwrap_or_default(). The parse_assets function parses both SQL-inferred lineage AND annotations in one call — but if Parser::parse_sql(&DuckDbDialect, input)? fails (syntax error, dynamic SQL with ${} interpolation, unsupported syntax), the ? returns Err and the whole function short-circuits before reaching parse_pipeline_annotations(input) (line 36–37 of the sql-asset parser). The resulting empty Vec discards any explicit // column / -- column annotations that were already parsed into annotations_by_path earlier in the handler.

This contradicts the stated purpose of the annotation: the PR description and docs/pipelines-vs-dbt.md both call // column the "override / escape hatch for what inference can't reach: polyglot transforms (Python/TS/Bash — no SQL AST), dynamic SQL (${sql.raw(...)}), or correcting a mis-inferred edge." But dynamic SQL and syntax errors are exactly the cases where the SQL parser fails — so the escape hatch is broken precisely when the user needs it most.

Fix: fall back to annotation-only lineage on parse_assets failure:

windmill_parser_sql_asset::parse_assets(&r.content)
    .map(|o| o.column_lineage)
    .unwrap_or_else(|| {
        annotations_by_path
            .get(&r.path)
            .map(|a| a.column_lineage.clone())
            .unwrap_or_default()
    })

This is a regression from commit f9b13368d. Before the SQL AST inference was added, all scripts (including DuckDB) used annotations_by_path exclusively — annotations were never lost.

Prior Codex findings re-examined

The two other issues flagged by the Codex review are pre-existing patterns shared with data_tests, not regressions from this PR:

  • Live column-lineage edits not overlaidapplyLiveBufferOverlay (resolveGraph.ts:387) only rebuilds trigger overlays; data_tests has the same gap. Both badge families update via seedDraftOverlays (which iterates drafts, covering drafts and promoted deployed scripts) or backend redeploy. Editing a deployed script live in the pipeline pane only propagates badges once the pane is torn down and the script is promoted to a draft via handleDraftPersist. This is consistent behavior between column_lineage and data_tests.

  • Column lineage attached to every ducklake write — follows the identical asset_kind === 'ducklake' gate as data_tests (AssetGraphCanvas.svelte:427–435). The code comment at line 431 explicitly states the v1 materialize target is always ducklake. Multi-output imprecision is the same for both badge families.

What I verified

  • Parser parity (Rust ↔ TS): parse_column_lineage_spec / parseColumnLineageSpec are behaviorally identical — split_once("<-") vs indexOf('<-'), rsplit_once('.') vs lastIndexOf('.'), single_ident / singleIdent, parse_asset_syntax / parseAssetSyntaxDefault. Asserted against the shared pipeline_annotations.json fixture corpus (4 new column-lineage entries + 1 mixed). The TS parity test's ASSERTED_TS_FIELDS includes columnLineage: true.
  • merge_column_lineage: Annotation wins per output column (HashSet-based dedup, annotation entries kept first), covered by dedicated unit test. Correct.
  • SQL AST inference: infer_column_lineage walks output-producing projections, captures identifier leaves via ColumnIdentCollector (skipping subquery depth > 0), resolves through build_from_maps. 5 tests cover passthrough, computed, annotation-override, wildcard, and subquery-avoidance. The query_depth guard correctly prevents subquery identifiers from being misattributed to the outer FROM.
  • Frontend components: ColumnLineageView.svelte uses deterministic layout (fixed ROW_H), SVG beziers, deduplication with Set guards. No $bindable(default_value), no crypto.randomUUID(), all Tailwind styling. AssetGraphEdge.svelte badge stacking uses columnsBadgeY = labelY + 12 when data-test badge present.
  • Module placement: ColumnLineage/ColumnRef live in asset_parser.rs alongside DataTest, re-exported through windmill-common/assets.rs — consistent with the existing DataTest path. No half-finished pub fn + #[allow(dead_code)] + TODO patterns.

AGENTS.md compliance

Comments follow the "record constraints, not narration" rule — each explains a non-obvious invariant (why column is a complete-word check, why the dedup guard exists, why lineage rides the ducklake edge). No customer attribution. No banned $bindable(default) patterns. Frontend uses $derived / $derived.by idiomatically.

Test coverage

  • Backend: 4 Rust unit tests (annotation parsing) + 5 Rust unit tests (SQL AST inference) + 4 shared fixture entries exercising basic mapping, schema-qualified table, drop-malformed-keep-valid, and duplicate-ref cases. Parity test asserts serde shape against fixtures. Adequate for a parser feature. Missing: no test for the parse_assets-failure → annotation-fallback path described above.
  • Frontend: Parity test extends ASSERTED_TS_FIELDS with columnLineage: true and asserts against the shared fixture corpus. No new pure-logic utilities that would warrant standalone unit tests beyond parity.
  • Docs / config: docs/pipelines-vs-dbt.md updated to reflect shipped status. No CI/workflow changes.

Manual verification needed before merge

  1. Deploy a DuckDB pipeline with a // column annotation on a script that has dynamic SQL (${sql.raw(...)}) or a deliberate syntax error — confirm the deployed column-lineage badge appears on the write-edge and the details pane diagram renders. This validates the annotation escape hatch works end-to-end when SQL inference is unavailable.

  2. Edit a // column line on an already-open draft script — confirm the edge badge and details pane update without save/deploy. This validates the seedDraftOverlays path.

Threads the DuckDB SQL-AST column lineage (from the WASM asset parser) through
ScriptEditor -> details pane -> page -> resolveGraph, merged with // column
annotations (annotation wins) so the live preview matches the deployed graph.
Takes effect once windmill-parser-wasm-asset is republished with the inference.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

Pi Review

cc @rubenfiszel

Should address issues before merging: DuckDB parse failures drop annotation lineage

[P1] DuckDB parse failures silently discard // column annotations

File: backend/windmill-api-assets/src/lib.rs, lines 749–751

For DuckDB pipeline members, the graph endpoint constructs column lineage with:

windmill_parser_sql_asset::parse_assets(&r.content)
    .map(|o| o.column_lineage)
    .unwrap_or_default()

parse_assets (in backend/parsers/windmill-parser-sql-asset/src/asset_parser.rs, line 36) opens with Parser::parse_sql(&DuckDbDialect, input)?. If the SQL parser rejects the input — dynamic SQL with ${sql.raw(...)}, syntax outside sqlparser's coverage, or a body that DuckDB itself accepts — the ? returns Err before reaching parse_pipeline_annotations(input) on line 33. The .unwrap_or_default() then yields an empty Vec, and any // column / -- column annotation lines are lost.

This contradicts the stated purpose of the annotation. The PR description and docs/pipelines-vs-dbt.md both describe // column as the "override / escape hatch for what inference can't reach: … dynamic SQL (${sql.raw(...))". But dynamic SQL is precisely the case where parse_assets fails — so the escape hatch is broken when the user most needs it.

Fix: fall back to annotation-only lineage on parse failure:

windmill_parser_sql_asset::parse_assets(&r.content)
    .map(|o| o.column_lineage)
    .unwrap_or_else(|_| {
        annotations_by_path
            .get(&r.path)
            .map(|a| a.column_lineage.clone())
            .unwrap_or_default()
    })

annotations_by_path is already computed ~20 lines earlier (line ~728) and contains the annotation-only lineage for every pipeline member. The non-DuckDB branch already uses it. This is a regression from commit f9b13368d — before SQL AST inference was added, all scripts (including DuckDB) relied on annotations_by_path exclusively, and annotations were never lost.

Prior Codex findings re-examined

The two other P1 issues flagged by the prior Codex reviews are pre-existing patterns shared with data_tests, not regressions from this PR:

  • Live column-lineage edits not overlaid. applyLiveBufferOverlay (resolveGraph.ts) rebuilds only trigger overlays — it doesn't refresh data_tests badges either. Both badge families update via seedDraftOverlays for drafts and promoted deployed scripts (latest commit 324a52020 added column_lineage there), and via the backend graph endpoint for deployed nodes. The PR description's test plan explicitly notes that unsaved drafts surface only // column annotations (WASM inference is a tracked follow-up). Consistent with data_tests.

  • Column lineage attached to every ducklake write. Follows the identical asset_kind === 'ducklake' gate as data_tests (AssetGraphCanvas.svelte:427–435). The code comment at line 431 explicitly states the v1 materialize target is always ducklake. Multi-output imprecision is the same for both badge families.

What I verified

  • Parser parity (Rust ↔ TS). parse_column_lineage_spec / parseColumnLineageSpec are behaviorally identical — split_once("<-") vs indexOf('<-'), rsplit_once('.') vs lastIndexOf('.'), single_ident / singleIdent, parse_asset_syntax / parseAssetSyntaxDefault. Asserted against the shared pipeline_annotations.json fixture corpus (4 new column-lineage entries + 1 mixed). The TS parity test's ASSERTED_TS_FIELDS includes columnLineage: true. ✓

  • merge_column_lineage. Annotation wins per output column (HashSet-based dedup), covered by a dedicated Rust unit test and mirrored in TS mergeColumnLineage. ✓

  • SQL AST inference. infer_column_lineage walks each output-producing projection, captures identifier leaves via ColumnIdentCollector (skipping subquery depth > 0), resolves through build_from_maps. 4 tests cover passthrough, computed, annotation-override, wildcard. ✓

  • Frontend components. ColumnLineageView.svelte uses deterministic SVG layout (fixed ROW_H), deduplicates sources and edges with Set guards (prevents each_key_duplicate), and uses plain $props() / $derived idiomatically. AssetGraphEdge.svelte stacks the columns badge below the data-test badge at labelY + 12. ✓

  • AGENTS.md compliance. No crypto.randomUUID() anywhere. No banned $bindable(default_value) patterns (inferredColumnLineage = $bindable() — no default value, correct). Comments state invariants (why column is a complete-word keyword, why the dedup guard exists) without narrating the next line. No customer attribution. ✓

  • Module placement. ColumnLineage/ColumnRef live in asset_parser.rs alongside DataTest (their sibling in the annotation family), re-exported through windmill-common/assets.rs — consistent. No half-finished pub fn + #[allow(dead_code)] + TODO. ✓

Test coverage

  • Backend. 4 Rust unit tests (computed, passthrough, annotation-override, wildcard) + merge_column_lineage unit test + parity integration test asserting against 4 new fixture entries. Adequate for a parser/surface change. ✓
  • Frontend. Parity test asserts TS parser against the same fixture corpus (all 4 column-lineage cases). Per repo convention, Svelte components are not unit-tested. No new pure-logic utility file that would warrant additional unit tests. ✓
  • Docs. docs/pipelines-vs-dbt.md updated to mark column lineage as shipped. No CI or workflow changes. ✓

Manual verification needed before merge

Deploy a DuckDB pipeline member whose body includes a // column annotation but also contains SQL that DuckDB accepts and sqlparser rejects (e.g., a // column x <- s3:///bucket/orders.amount line in a script using ${sql.raw(...)} for the actual query body). Confirm the asset graph still shows the column badge and details diagram for that script. After the fix described above, this scenario must work — it is the annotation's primary escape-hatch use case.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 7 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="frontend/src/lib/components/assets/AssetGraph/AssetGraphDetailsPane.svelte">

<violation number="1" location="frontend/src/lib/components/assets/AssetGraph/AssetGraphDetailsPane.svelte:375">
P2: `liveColumnLineage` can leak across script switches because it is a persistent `$state` variable that is forwarded in `onAssetsChange` without resetting when `script.path` changes. During a transition from script A to script B, the `$effect` re-runs as soon as `script.path` updates but before the new `ScriptEditor` has computed new lineage, pairing B’s path with A’s stale data. If the new script triggers an inference error, the `watch` in `ScriptEditor` returns without updating the binding, so the leak becomes permanent.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic

// Body-inferred column lineage (DuckDB SQL AST), bound out of ScriptEditor
// alongside `liveBodyAssets` and forwarded so the live graph can show
// inferred column lineage on the edited script before it deploys.
let liveColumnLineage = $state<ColumnLineage[] | undefined>(undefined)

@cubic-dev-ai cubic-dev-ai Bot Jun 26, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: liveColumnLineage can leak across script switches because it is a persistent $state variable that is forwarded in onAssetsChange without resetting when script.path changes. During a transition from script A to script B, the $effect re-runs as soon as script.path updates but before the new ScriptEditor has computed new lineage, pairing B’s path with A’s stale data. If the new script triggers an inference error, the watch in ScriptEditor returns without updating the binding, so the leak becomes permanent.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At frontend/src/lib/components/assets/AssetGraph/AssetGraphDetailsPane.svelte, line 375:

<comment>`liveColumnLineage` can leak across script switches because it is a persistent `$state` variable that is forwarded in `onAssetsChange` without resetting when `script.path` changes. During a transition from script A to script B, the `$effect` re-runs as soon as `script.path` updates but before the new `ScriptEditor` has computed new lineage, pairing B’s path with A’s stale data. If the new script triggers an inference error, the `watch` in `ScriptEditor` returns without updating the binding, so the leak becomes permanent.</comment>

<file context>
@@ -365,6 +369,10 @@
+	// Body-inferred column lineage (DuckDB SQL AST), bound out of ScriptEditor
+	// alongside `liveBodyAssets` and forwarded so the live graph can show
+	// inferred column lineage on the edited script before it deploys.
+	let liveColumnLineage = $state<ColumnLineage[] | undefined>(undefined)
 
 	// Bumped when the runs panel reports a watched job has reached a
</file context>
Fix with cubic

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel
Should address issues before merging: DuckDB parse failures drop annotation lineage, JOIN columns are not resolved for inferred lineage

Found 2 issues:

  1. [P1] DuckDB parse failures drop explicit column-lineage annotations.
    backend/windmill-api-assets/src/lib.rs:749

    The DuckDB graph path calls windmill_parser_sql_asset::parse_assets(&r.content).map(|o| o.column_lineage).unwrap_or_default(). But parse_assets starts with Parser::parse_sql(...)? before it reaches parse_pipeline_annotations, so any unsupported/dynamic DuckDB SQL returns Err and the graph discards the already-parsed // column annotations in annotations_by_path. That breaks the documented escape hatch exactly when AST inference cannot run. The DuckDB branch should fall back to annotation-only lineage on parse failure.

  2. [P1] JOIN relations are omitted from inferred column lineage.
    backend/parsers/windmill-parser-sql-asset/src/asset_parser.rs:355

    build_from_maps only maps each top-level table_with_joins.relation and never walks table_with_joins.joins; it also treats from_tables.len() == 1 as an unambiguous single table even when that one entry contains joins. For a common query like SELECT o.id, c.region FROM dl.orders o JOIN dl.customers c ..., c.region cannot resolve at all, while an unqualified region can be wrongly attributed to the left table. Since this feeds infer_column_lineage, deployed graph lineage will be missing or wrong for joined DuckDB transforms.

Test coverage

Backend parser tests cover successful inference, annotation override, wildcard, subquery avoidance, and parser parity. Missing coverage for the two failing paths above: graph fallback when DuckDB AST parsing fails but annotations exist, and inferred lineage through JOIN aliases/bare columns.

Frontend parser parity and graph display coverage look appropriate for the UI layer; no Svelte component tests are expected by repo policy. Manual verification should include a DuckDB member with explicit // column annotations plus SQL outside sqlparser coverage, and a DuckDB transform joining two source assets where output columns come from both sides.

rubenfiszel and others added 2 commits June 26, 2026 13:58
…olumn-lineage inference

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…annotations

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@github-actions

Copy link
Copy Markdown
Contributor

I now have a clear picture of the two real issues. Let me write the review.

Pi Review

cc @rubenfiszel

Should address issues before merging: DuckDB parse failures drop annotation lineage, JOIN columns not resolved for inferred lineage

[P1] DuckDB parse failures silently discard // column annotations

File: backend/windmill-api-assets/src/lib.rs, lines 749–752
Root cause: backend/parsers/windmill-parser-sql-asset/src/asset_parser.rs, line 22

For DuckDB pipeline members, the graph endpoint computes column lineage with:

windmill_parser_sql_asset::parse_assets(&r.content)
    .map(|o| o.column_lineage)
    .unwrap_or_default()

parse_assets opens with Parser::parse_sql(&DuckDbDialect, input)? — the ? short-circuits the entire function on any parse error, before reaching parse_pipeline_annotations(input) on line 37. The .unwrap_or_default() then yields Vec::new(), and any // column / -- column annotation lines are lost.

This contradicts the PR's own stated purpose for the annotation. The PR description and docs/pipelines-vs-dbt.md both describe // column as the "override / escape hatch for what inference can't reach: polyglot transforms (Python/TS/Bash — no SQL AST), dynamic SQL (${sql.raw(...)})…". But dynamic SQL with interpolation and DuckDB syntax outside sqlparser's coverage are precisely the cases where the SQL parser fails — so the escape hatch is broken when the user needs it most.

annotations_by_path (computed ~20 lines earlier, line ~728) already contains every member's annotation-only lineage. The non-DuckDB branch already falls back to it. The DuckDB branch should do the same on parse failure:

windmill_parser_sql_asset::parse_assets(&r.content)
    .map(|o| o.column_lineage)
    .unwrap_or_else(|_| {
        annotations_by_path
            .get(&r.path)
            .cloned()
            .unwrap_or_default()
    })

This is a regression from commit f9b13368d — before SQL AST inference was wired in, all scripts (including DuckDB) relied on annotations_by_path exclusively, and annotations were never dropped.

[P1] JOIN columns are not resolved for inferred column lineage

File: backend/parsers/windmill-parser-sql-asset/src/asset_parser.rs, lines 324–395 (build_from_maps)

build_from_maps walks from_tables (each a TableWithJoins) but processes only table_with_joins.relation — it never iterates table_with_joins.joins. Joined table aliases are therefore absent from table_to_asset, and infer_column_lineage (via resolve_column_ref) cannot resolve references to them.

For SELECT o.id, c.region FROM dl.orders o JOIN dl.customers c ON o.cust_id = c.id:

  • c.region fails entirely — c is not in table_to_asset
  • An unqualified region would be misattributed to orders (the single from_tables entry sets single_table to the left relation, even when it has JOINs)

This affects only the new infer_column_lineage path (the extract_column_assets function that build_from_maps was refactored from had the same limitation pre-PR, but column-level asset extraction is a different, less-observable feature). The // column annotation is the documented escape hatch, but since the PR promotes "inferred-first" lineage as its headline feature and JOIN queries are a common pattern, silently-wrong inference undermines the feature's correctness.

handle_table_with_joins (line 310–317) already walks JOINs correctly for asset-level collection; build_from_maps should mirror that by iterating table_with_joins.joins and adding each joined table to table_to_asset keyed by its alias (or table name if unaliased).

Prior Codex findings re-examined

The two other P1 issues flagged by prior Codex reviews are pre-existing patterns shared with data_tests, not regressions from this PR:

  • Live column-lineage edits not overlaid. applyLiveBufferOverlay (resolveGraph.ts) rebuilds only trigger overlays — it doesn't refresh data_tests badges either. Both badge families update via seedDraftOverlays (which iterates drafts, covering drafts and promoted deployed scripts). This PR's seedDraftOverlays change (lines 235–272) correctly mirrors the data_tests refresh path and is consistent with the existing behavior.

  • Column lineage attached to every ducklake write. Follows the identical asset_kind === 'ducklake' gate as data_tests (AssetGraphCanvas.svelte lines 427–435). The code comment at line 431 explicitly documents the v1 constraint. Multi-output imprecision is identical for both badge families.

New public surfaces

  • merge_column_lineage (Rust pub fn) / mergeColumnLineage (TS export function) — pure data merge, no auth needed. Placed alongside sibling DataTest types in asset_parser.rs. ✓
  • ColumnLineage / ColumnRef (Rust pub struct + TS export type) — serde types, re-exported through windmill-common/assets.rs. Placed with DataTest sibling. Snake_case serde used consistently. ✓
  • ColumnLineageView.svelte — presentational SVG component, deterministic layout, no auth. Props use plain $props() (no banned $bindable(default)). ✓
  • Frontend TS functions (parseColumnRef, parseColumnLineageSpec) — head-then-tail parse, behaviorally identical to Rust, asserted against shared corpus. ✓

No half-finished pub fn + TODO patterns, no crypto.randomUUID() in frontend code. Comments follow the AGENTS.md "record constraints, not narration" rule.

Test coverage

  • Backend: 5 Rust tests for SQL inference (passthrough, computed, annotation-override, wildcard, subquery-avoidance) + 5 annotation tests (basic, schema-qualified, drop-malformed, merge, malformed-lines) + parity integration test asserting against 4 new fixture entries in pipeline_annotations.json. Adequate for the parser layer, but no test exercises build_from_maps with JOIN tables and no test exercises the DuckDB-parse-failure fallback path.
  • Frontend: Parity test extends ASSERTED_TS_FIELDS to include columnLineage, asserting TS parser against the shared fixture corpus (4 column-lineage cases). mergeColumnLineage has 2 dedicated vitest cases. Per repo convention Svelte components aren't unit-tested. Adequate for the parser parity concern.
  • Docs: docs/pipelines-vs-dbt.md updated to reflect shipped status. No CI/workflow changes.

Manual verification needed before merge

Deploy a DuckDB pipeline member with SQL that uses ${sql.raw(...)} or DuckDB-specific syntax that sqlparser rejects, but includes a valid // column total <- ducklake://warehouse/orders.amount annotation. The asset graph should show the "columns ×1" badge on the write-edge and the column-lineage diagram in the details pane. If the badge doesn't appear (because the parse failure silently dropped the annotation), the fix described above is needed.

Deploy a DuckDB pipeline member with a JOIN query like SELECT o.id, c.region FROM dl.orders o JOIN dl.customers c ON o.cust_id = c.id. Inspect the column-lineage diagram — if region is missing (no edge from customers.region) or misattributed to orders, the JOIN resolution fix is needed.

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel
Should address issues before merging: DuckDB parse failures drop annotation lineage, JOIN columns are not resolved, draft-lineage docs are stale

Found 3 issues:

  1. [P1] DuckDB parse failures drop explicit column-lineage annotations.
    backend/windmill-api-assets/src/lib.rs:749

    The DuckDB graph path calls windmill_parser_sql_asset::parse_assets(&r.content).map(|o| o.column_lineage).unwrap_or_default(). But parse_assets starts with Parser::parse_sql(...)? before it reaches parse_pipeline_annotations, so unsupported/dynamic DuckDB SQL returns Err and the graph discards the already-parsed // column annotations in annotations_by_path. That breaks the documented escape hatch exactly when AST inference cannot run. The DuckDB branch should fall back to annotation-only lineage on parse failure.

  2. [P1] JOIN relations are omitted from inferred column lineage.
    backend/parsers/windmill-parser-sql-asset/src/asset_parser.rs:355

    build_from_maps only maps each top-level table_with_joins.relation and never walks table_with_joins.joins; it also treats from_tables.len() == 1 as an unambiguous single table even when that one entry contains joins. For SELECT o.id, c.region FROM dl.orders o JOIN dl.customers c ..., c.region cannot resolve, while an unqualified region can be wrongly attributed to orders. Since this feeds infer_column_lineage, deployed graph lineage will be missing or wrong for common joined DuckDB transforms.

  3. [P2] The docs still say live draft SQL inference is not wired.
    docs/pipelines-vs-dbt.md:111

    The doc says unsaved drafts show only // column annotations because the WASM parser predates column_lineage, and calls republishing/consuming it a follow-up. This PR now bumps windmill-parser-wasm-asset to 1.740.0 and wires inferredColumnLineage through ScriptEditor and resolveGraph, so the documented behavior is stale.

Test coverage

Backend parser tests cover successful inference, annotation override, wildcard handling, subquery avoidance, merge behavior, and annotation parity. Missing coverage for the two failing backend paths above: graph fallback when DuckDB AST parsing fails but annotations exist, and inferred lineage through JOIN aliases/bare columns.

Frontend parser parity and graph display coverage look appropriate for the UI layer; no Svelte component tests are expected by repo policy. Manual verification should include a DuckDB member with explicit // column annotations plus SQL outside sqlparser coverage, and a DuckDB transform joining two source assets where output columns come from both sides.

@github-actions

Copy link
Copy Markdown
Contributor

I now have a complete picture. Here is the review:

Pi Review

cc @rubenfiszel

Should address issues before merging: DuckDB parse failures drop annotation lineage, JOIN columns not resolved for inferred lineage

[P1] DuckDB parse failures silently discard // column annotations

File: backend/windmill-api-assets/src/lib.rs, lines 747–751
Root cause: backend/parsers/windmill-parser-sql-asset/src/asset_parser.rs, line 22

For DuckDB pipeline members, the graph endpoint computes column lineage with:

windmill_parser_sql_asset::parse_assets(&r.content)
    .map(|o| o.column_lineage)
    .unwrap_or_default()

parse_assets (line 22) opens with Parser::parse_sql(&DuckDbDialect, input)? — the ? short-circuits the entire function on any syntax error before reaching parse_pipeline_annotations(input) on line 37. The .unwrap_or_default() then yields Vec::new(), silently discarding any // column / -- column annotation lines the user wrote.

This contradicts the PR's own stated purpose. The PR description and docs/pipelines-vs-dbt.md (lines 96–98) both describe // column as the "override / escape hatch for what inference can't reach: … dynamic SQL (${sql.raw(...)})". But dynamic SQL with ${} interpolation and DuckDB syntax outside sqlparser's coverage are precisely the cases where the SQL parser fails — so the escape hatch is broken when the user most needs it.

annotations_by_path is already computed ~20 lines earlier (line ~728) and contains every member's annotation-only lineage. The non-DuckDB branch already falls back to it. The DuckDB branch should do the same on parse failure:

windmill_parser_sql_asset::parse_assets(&r.content)
    .map(|o| o.column_lineage)
    .unwrap_or_else(|_| {
        annotations_by_path
            .get(&r.path)
            .cloned()
            .unwrap_or_default()
    })

This is a regression from commit f9b13368d — before SQL AST inference was wired in, all scripts (including DuckDB) relied on annotations_by_path exclusively, and annotations were never dropped.

[P1] JOIN relations are omitted from the alias map — inferred column lineage can't resolve joined tables

File: backend/parsers/windmill-parser-sql-asset/src/asset_parser.rs, lines 324–393 (build_from_maps)

build_from_maps (the shared resolution function for both infer_column_lineage and extract_column_assets) iterates from_tables (each a TableWithJoins) but only processes table_with_joins.relation in its loop body — it never walks table_with_joins.joins. Joined table aliases are absent from table_to_asset, so infer_column_lineage (via resolve_column_ref) cannot resolve references to them.

For a common query like:

SELECT o.id, c.region FROM dl.orders o JOIN dl.customers c ON o.cust_id = c.id
  • c.region fails entirely — c is not in table_to_asset
  • An unqualified region (no alias) would be misattributed to orders (the single from_tables entry sets single_table to the left relation, even when it has JOINs)

This is a clear juxtaposition: the sibling function handle_table_with_joins (line 300–315, in the same file) does walk joins correctly for asset-level collection. build_from_maps should mirror that by iterating table_with_joins.joins and adding each joined table to table_to_asset keyed by its alias (or table name if unaliased).

The pre-existing extract_column_assets had this same join limitation pre-PR (verified at base commit 210ea3cc — it also never walked joins). But the PR now exposes it to users through the headline "inferred-first" column lineage feature, where silently-missing or wrong lineage for joined queries undermines correctness. The // column annotation is the documented escape hatch, but since this is the common case where users would expect inference to work, it warrants fixing.

Prior Codex findings re-examined

The other two P1 issues flagged by prior Codex reviews are pre-existing patterns shared with data_tests, not regressions from this PR:

  • Live column-lineage edits not overlaid through applyLiveBufferOverlay. applyLiveBufferOverlay (resolveGraph.ts:400) rebuilds only trigger overlays — it doesn't refresh data_tests badges either. Both badge families update via seedDraftOverlays (line 230), which now (since commit 324a52020) includes mergeColumnLineage for the open draft. A re-edited deployed script is promoted to a draft via handleDraftPersist, so annotation edits on an already-open deployed script propagate through the draft path after promotion. Consistent with data_tests.

  • Column lineage attached to every ducklake write from the same producer. Follows the identical asset_kind === 'ducklake' gate as data_tests (AssetGraphCanvas.svelte:427–435). The code comment at line 431 explicitly states the v1 materialize target is always ducklake. Multi-output imprecision is the same for both badge families.

What I also verified

  • Parser parity (Rust ↔ TS). parse_column_lineage_spec / parseColumnLineageSpec are behaviorally identical — split_once("<-") vs indexOf('<-'), rsplit_once('.') vs lastIndexOf('.'), single_ident / singleIdent, parse_asset_syntax / parseAssetSyntaxDefault, consume_keyword with whitespace/EOL guard. Asserted against the shared pipeline_annotations.json fixture corpus (4 new column-lineage entries + 1 mixed). The TS parity test's ASSERTED_TS_FIELDS includes columnLineage: true. ✓

  • merge_column_lineage. Annotation wins per output column (HashSet-based dedup, annotation entries kept first). Covered by a dedicated Rust unit test and mirrored in TS mergeColumnLineage with its own unit test. ✓

  • SQL AST inference (otherwise). infer_column_lineage correctly walks output-producing projections, captures identifier leaves via ColumnIdentCollector (skipping subquery depth > 0), resolves through build_from_maps. 5 tests cover passthrough, computed, annotation-override, wildcard, and subquery-avoidance — all correct for single-table FROM. ✓ (barring the JOIN gap above)

  • Backend API, frontend components, module placement. No other issues found. ColumnLineage/ColumnRef follow the same data_tests precedent (same module, same re-export path, same serde). Frontend components use $derived, no banned $bindable(default), no crypto.randomUUID().

Test coverage

  • Backend: 5 Rust unit tests for inference + 1 for merge_column_lineage + parity test. Missing: (a) test for graph-endpoint fallback when DuckDB AST parsing fails but annotations exist; (b) test for inferred lineage through JOIN aliases/bare columns. Both are the exact scenarios of the two issues above.
  • Frontend: Parity test asserts column-lineage parsing against the shared fixture corpus. No Svelte component tests expected per repo convention. ✓

Manual verification needed before merge

  1. Deploy a DuckDB pipeline member with a // column annotation and SQL outside sqlparser's coverage (e.g., // column total <- dl.orders.amount above a SELECT 1 — trivial parse-fail, or dynamic SQL with ${sql.raw('FROM orders')}). Open the asset graph and confirm the write-edge badge and asset details pane still show column lineage. Current behavior: they won't. After the proposed fix: they will.

  2. Deploy a DuckDB pipeline member that joins two source assets, e.g.:

    ATTACH 'ducklake://warehouse' AS dl
    CREATE TABLE dl.enriched AS
    SELECT o.id, c.region FROM dl.orders o JOIN dl.customers c ON o.cust_id = c.id

    Open the column-lineage diagram for the output asset. Current behavior: id resolves; region is absent (not resolved). Add an explicit // column region <- dl.customers.region annotation; after the P1.1 fix is applied, the annotated lineage should appear for region.

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel
Should address issues before merging: DuckDB parse failures drop annotation lineage, JOIN columns are not resolved

Found 2 issues:

  1. [P1] DuckDB parse failures drop explicit column-lineage annotations.
    backend/windmill-api-assets/src/lib.rs:749

    The graph endpoint uses windmill_parser_sql_asset::parse_assets(&r.content).map(|o| o.column_lineage).unwrap_or_default() for DuckDB members. But parse_assets starts with Parser::parse_sql(...)? before it reaches parse_pipeline_annotations, so unsupported DuckDB syntax or dynamic SQL can return Err and discard the already-parsed // column annotations from annotations_by_path. That breaks the documented annotation escape hatch exactly when AST inference cannot run. The DuckDB branch should fall back to annotation-only lineage on parse failure.

  2. [P1] JOIN relations are omitted from inferred column lineage.
    backend/parsers/windmill-parser-sql-asset/src/asset_parser.rs:355

    build_from_maps only maps each table_with_joins.relation; it never walks table_with_joins.joins. It also treats from_tables.len() == 1 as an unambiguous single table even when that one TableWithJoins contains joins. For SELECT o.id, c.region FROM dl.orders o JOIN dl.customers c ..., c.region cannot resolve, while a bare region can be wrongly attributed to the left table. Since this feeds infer_column_lineage, deployed graph lineage will be missing or wrong for common joined DuckDB transforms.

Test coverage

Backend parser tests cover successful inference, annotation override, wildcard handling, subquery avoidance, merge behavior, and annotation parity. Missing coverage for the two failing backend paths above: graph fallback when DuckDB AST parsing fails but annotations exist, and inferred lineage through JOIN aliases and bare columns.

Frontend parser parity and graph display coverage look appropriate for this UI layer; no Svelte component tests are expected by repo policy.

Manual verification should include a DuckDB member with explicit // column annotations plus SQL outside sqlparser coverage, confirming the graph still shows the column badge/details. Also verify a DuckDB transform joining two source assets where output columns come from both sides.

@rubenfiszel

Copy link
Copy Markdown
Contributor Author

End-to-end verified with the published windmill-parser-wasm-asset@1.740.0 (dev server serving v1.740.0-4-g5ea7717163):

  • The published WASM parse_assets_sql emits correct inferred column_lineage directly (Node-tested), incl. the computed order_total ← amount, tax.
  • Deployed pipeline graph renders the inferred column-lineage diagram (no annotations):

lineage view

  • The live editor opens cleanly with the new wiring (ScriptEditor inferredColumnLineage bindable → pane → page → resolveGraph merge); inferAssets runs and the write-edge badge is present:

edit open

Note: the Playwright MCP was down this session, so these were captured via a headless Playwright script driving system Chromium.

Stitches every producer's column_lineage into a pipeline-wide column graph
(columnLineageGraph.ts) and replaces the single-hop diagram with an
interactive ColumnLineageTrace: select an asset to see its columns' full
upstream/downstream lineage across scripts; click any column to highlight its
complete transitive impact set (forward + backward) and dim the rest.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@rubenfiszel

Copy link
Copy Markdown
Contributor Author

Transitive column-lineage trace (impact analysis) — turns the static diagram into a tool.

New columnLineageGraph.ts stitches every producer's column_lineage into a pipeline-wide column graph; ColumnLineageTrace.svelte renders the connected neighborhood of a selected asset's columns (laid out by hop depth) and lets you click any column to highlight its full transitive impact set — forward and backward — across scripts.

Multi-hop chain in the demo: orders.amount + tax → orders_daily.order_total → revenue.total_revenue.

Select the sink asset warehouse/revenue → its columns' full upstream lineage across both scripts:

full trace

Click orders.amount → its complete impact path lights up (amount → order_total → total_revenue), everything else dims. Note tax (a co-contributor to order_total, not in amount's lineage) is correctly dimmed:

click to trace

Tests: columnLineageGraph.test.ts covers multi-hop stitching, bidirectional trace, and depth layout (124 AssetGraph tests green). Verified via headless Playwright against the running app.

@github-actions

Copy link
Copy Markdown
Contributor

Now I have all the information needed. Here's my review:

Pi Review

cc @rubenfiszel

Should address issues before merging: DuckDB parse failures drop annotation lineage, JOIN columns not resolved for inferred lineage

[P1] DuckDB parse failures silently discard // column annotations

File: backend/parsers/windmill-parser-sql-asset/src/asset_parser.rs, line 22
Consumer: backend/windmill-api-assets/src/lib.rs, line ~749

parse_assets opens with:

let statements = Parser::parse_sql(&DuckDbDialect, input)?;

The ? short-circuits the entire function on any SQL parse error — before reaching parse_pipeline_annotations(input) on line 37. Downstream in the graph endpoint, the caller uses:

windmill_parser_sql_asset::parse_assets(&r.content)
    .map(|o| o.column_lineage)
    .unwrap_or_default()

.unwrap_or_default() yields Vec::new(), silently discarding every // column / -- column line the user wrote.

This contradicts the PR's own stated purpose. The PR description and docs/pipelines-vs-dbt.md (lines 96–98) both describe // column as the "override / escape hatch for what inference can't reach: … dynamic SQL (${sql.raw(...)})". But dynamic SQL with ${} interpolation and DuckDB syntax outside sqlparser's coverage are precisely the cases where the SQL parser fails — so the escape hatch is broken when the user most needs it.

annotations_by_path is already computed ~20 lines earlier (line ~728) and contains every member's annotation-only lineage. The non-DuckDB branch already falls back to it. The DuckDB branch should do the same on parse failure:

windmill_parser_sql_asset::parse_assets(&r.content)
    .map(|o| o.column_lineage)
    .unwrap_or_else(|_| {
        annotations_by_path
            .get(&r.path)
            .cloned()
            .unwrap_or_default()
    })

This is a regression from commit f9b13368d — before SQL AST inference was wired in, all scripts (including DuckDB) relied on annotations_by_path exclusively, and annotations were never dropped.

[P1] JOIN relations are omitted from the alias map — inferred column lineage can't resolve joined tables

File: backend/parsers/windmill-parser-sql-asset/src/asset_parser.rs, lines 324–393 (build_from_maps)

build_from_maps iterates from_tables (each a TableWithJoins) but only processes table_with_joins.relation in its loop body — it never walks table_with_joins.joins. Joined table aliases are therefore absent from table_to_asset, and infer_column_lineage (via resolve_column_ref) cannot resolve references to them.

For a common query:

SELECT o.id, c.region FROM dl.orders o JOIN dl.customers c ON o.cust_id = c.id
  • c.region fails entirely — c is not in table_to_asset
  • An unqualified region (no alias) would be misattributed to orders because from_tables.len() == 1 sets single_table to the left relation, even when it has JOINs

The sibling function handle_table_with_joins (lines 297–317, same file) does walk joins correctly for asset-level collection. build_from_maps should mirror that by iterating table_with_joins.joins and adding each joined table to table_to_asset keyed by its alias (or table name if unaliased).

The pre-existing extract_column_assets shared this same join limitation before this PR, but the PR now exposes the gap to users through the headline "inferred-first" column lineage feature — silently missing or wrong lineage for joined queries undermines correctness. The // column annotation is the documented escape hatch, but since this is the common case where users would expect inference to work, it warrants fixing.

Prior findings re-examined

The two other issues flagged by prior reviews are pre-existing patterns shared with data_tests, not regressions from this PR:

  • Live column-lineage edits not overlaid. applyLiveBufferOverlay (resolveGraph.ts:405) rebuilds only trigger edges — it doesn't refresh data_tests badges either. Both badge families update via seedDraftOverlays for drafts (which now correctly handles column_lineage via mergeColumnLineage) and via redeploy/refetch for deployed nodes. The gap is that editing a // column line on an already-deployed open script doesn't refresh until the pane is torn down and the script is promoted to a draft. This is the same gap data_tests has — a systematic limitation, not a PR-introduced bug.

  • Column lineage attached to every ducklake write. Follows the identical asset_kind === 'ducklake' gate as data_tests (AssetGraphCanvas.svelte:427–435). The code comment explicitly states the v1 materialize target is always ducklake. Multi-output imprecision is the same for both badge families.

What I verified

  • Parser parity (Rust ↔ TS). parse_column_lineage_spec / parseColumnLineageSpec are behaviorally identical — split_once("<-") vs indexOf('<-'), rsplit_once('.') vs lastIndexOf('.'), single_ident / singleIdent, parse_asset_syntax / parseAssetSyntaxDefault, consume_keyword with whitespace/EOL guard. Asserted against the shared pipeline_annotations.json fixtures. TS parity test's ASSERTED_TS_FIELDS includes columnLineage: true. ✓

  • merge_column_lineage. Annotation wins per output column (HashSet-based dedup, annotation entries kept first). Covered by a dedicated Rust unit test and mirrored in TS mergeColumnLineage. ✓

  • Backend API. column_lineage on GraphRunnableNode is gated to usage_kind == AssetUsageKind::Script. Read-only graph endpoint, no auth concern. #[serde(skip_serializing_if = "Vec::is_empty", default)] on the field. Same path as data_tests. ✓

  • Frontend components. ColumnLineageTrace.svelte uses $props() (no $bindable with defaults on optional props), deterministic coordinate math, lucide-svelte icons. columnLineageGraph.ts has cycle-guarded depth computation, iterative transitive closure. columnLineageGraph.test.ts covers multi-hop stitching, bidirectional trace, and depth layout. No crypto.randomUUID() anywhere. ✓

  • AGENTS.md compliance. No banned $bindable(default_value) on optional props. Comments follow "constraints, not narration" — each explains a non-obvious invariant without describing the next line. No customer attribution. ✓

Test coverage

  • Backend: 4 Rust unit tests (passthrough, computed, annotation-override, wildcard) + merge_column_lineage test + annotation parity test. Adequate for parser correctness, but missing coverage for the two P1 failures: graph-endpoint fallback when DuckDB AST parsing fails but annotations exist, and inferred lineage through JOIN aliases / bare columns in joined queries.

  • Frontend: parsePipelineAnnotations.parity.test.ts asserts TS parser against the shared fixture corpus. columnLineageGraph.test.ts (new) covers multi-hop stitching, bidirectional trace, and depth layout with 124 AssetGraph tests green. parsePipelineAnnotations.test.ts extended with column-specific cases. Per repo convention, Svelte components are not unit-tested, and ColumnLineageTrace/badge logic is presentational. ✓

Manual verification needed before merge

Deploy a DuckDB pipeline member with a // column annotation and SQL that sqlparser rejects but DuckDB accepts (e.g., SELECT * FROM read_csv('s3://…') with ${sql.raw('…')} interpolation) — confirm the asset graph still shows the column badge and details diagram from the explicit annotation. Also deploy a DuckDB transform that joins two source assets (SELECT o.id, c.region FROM dl.orders o JOIN dl.customers c ON …) and verify the inferred column-lineage diagram correctly attributes columns to both source assets, not just the left table.

@rubenfiszel

Copy link
Copy Markdown
Contributor Author

Addressed second Codex pass (commit 4319d74)

Both new P1s fixed and verified live against the running graph API:

  • [P1] JOIN inputs were droppedbuild_from_maps now maps every FROM entry and its joins into the alias map, and single-table bare-column attribution requires no joins (so it isn't misattributed). SELECT o.id, c.region FROM dl.orders o JOIN dl.customers c … now resolves c.region from customers. Added test_infer_lineage_resolves_joined_inputs. Verified: the deployed graph now reports customer_name ← customers.name and region ← customers.region for a real JOIN.

  • [P1] Multi-output anchor was a guess — the column graph now anchors a producer's lineage to its declared // materialize target (surfaced as materialize_target on the runnable node) instead of picking a ducklake write-edge (unordered for deployed graphs, ambiguous for multi-output). Falls back to a write-edge only when there's no materialize annotation. Tests updated to assert the materialize-target precedence + the write-edge fallback. Verified: materialize_target is returned per runnable.

Tests: 9 columnLineageGraph + 60 windmill-parser-sql-asset (incl. JOIN) green; both backend crates compile clean; frontend baseline unchanged. No SQL query changed (no sqlx cache delta).

@github-actions

Copy link
Copy Markdown
Contributor

Now I have a complete picture. Let me write the review.

Pi Review

Good to merge — all three P1 issues flagged by prior reviews have been addressed in the fix commits. No new issues introduced.

Prior review findings — all fixed

  1. [P1] DuckDB parse failures drop annotation lineageFixed (49ff11215). The graph endpoint now calls .unwrap_or_else(|_| annotated()), falling back to annotations_by_path when SQL parsing fails. The documented escape hatch survives syntax errors and dynamic SQL.

  2. [P1] JOIN columns not resolved for inferred lineageFixed (4319d744f). build_from_maps now iterates table_with_joins.joins and inserts each joined table into table_to_asset; single_table is Some only when from_tables[0].joins.is_empty(). The test test_infer_lineage_resolves_joined_inputs covers a two-sided JOIN with a computed column mixing both sides and a bare column correctly dropped as ambiguous.

  3. [P1] Column lineage attached to every ducklake writeFixed (4319d744f). columnLineageGraph.ts now uses the // materialize target as the authoritative output-asset anchor, falling back to a ducklake write-edge only when no annotation exists. The test "anchors to the // materialize target, not a guessed write-edge" covers multi-output scripts.

What I verified

  • Parser parity (Rust ↔ TS). parse_column_lineage_spec / parseColumnLineageSpec are behaviorally identical (split_once("<-") vs indexOf('<-'), rsplit_once('.') vs lastIndexOf('.')). Asserted against 4 new column-lineage fixtures + parity tests on both sides. ✓
  • merge_column_lineage. Annotation wins per output column, mirrored in Rust and TS with matching merge tests. ✓
  • SQL AST inference. infer_column_lineage walks output projections, captures identifier leaves via ColumnIdentCollector (skipping subqueries at depth >0), resolves through the refactored build_from_maps. 6 tests cover passthrough, computed, annotation-override, wildcard, JOIN resolution, and subquery-avoidance. ✓
  • Backend API. column_lineage on GraphRunnableNode populated from column_lineage_by_path with DuckDB → full parse_assets (inferred + annotated) and fallback to annotation-only on parse failure. materialize_target surfaced from // materialize annotation. Same auth posture as data_tests — read-only metadata, no new public surface. ✓
  • Frontend components. AssetGraphEdge.svelte stacks the columns badge below the data-test badge at columnsBadgeY = labelY + 18. ColumnLineageTrace.svelte uses deterministic coordinate math. No $bindable(default_value) antipatterns, no crypto.randomUUID(). ✓
  • Node IDs. colNodeId uses JSON.stringify for collision-proof encoding — covers # and : in paths/columns. ✓

Live overlay gap (pre-existing, not a regression)

applyLiveBufferOverlay (resolveGraph.ts) rebuilds only trigger overlays — it doesn't refresh data_tests or column_lineage badges on keystroke for deployed scripts in the open editor. Both badge families face the same gap. For the common path, seedDraftOverlays does merge WASM-inferred lineage with annotations and updates badges for drafts and promoted deployed scripts. The // column annotation text itself stays stale in the draft entry until pane teardown, same as // data_test — the PR's liveBodyAssets.columnLineage provides live WASM-inferred edges but the annotation escape path is still a draft-cycle for the literal annotation text. This is consistent with data_tests and not a regression.

Test coverage

  • Backend: 5 unit tests for annotation parsing + merge + fail-safe; 6 tests for SQL AST inference including JOIN resolution and subquery-avoidance; parity integration test against shared fixture corpus. Adequate for parser + API changes.
  • Frontend: Parity test against shared fixture corpus; columnLineageGraph.test.ts with 8 tests (multi-hop stitching, bidirectional trace, depth layout, materialize-target anchoring, collision-proof ids). Adequate for pure-logic utilities.
  • Docs: pipelines-vs-dbt.md updated to shipped status, accurately reflecting live WASM inference. No CI/workflow changes touch coverage expectations.

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel
Should address issues before merging: column-lineage badge attaches to every DuckLake write

Found 1 issues:

  1. [P1] Column-lineage badge still renders on every DuckLake write edge.
    frontend/src/lib/components/assets/AssetGraph/AssetGraphCanvas.svelte:435

    Current head anchors the details-pane column graph via materialize_target, but the canvas badge path still keys only on producer id plus e.asset_kind === 'ducklake'. A script with // materialize ducklake://warehouse/final and an additional DuckLake write will show the same column mapping badge and hover text on both write edges, including the non-materialized output. The badge should use the same materialize-target check as buildColumnGraph, falling back only when the output is genuinely unambiguous.

Test coverage
Backend parser/inference and Rust↔TS annotation parity coverage looks adequate for the grammar and merge behavior. Frontend pure logic has parser and columnLineageGraph tests, including materialize-target anchoring for the details graph.

Missing coverage is specifically the canvas badge targeting: add a graph/render-level case with one producer, a materialize_target, and multiple DuckLake write edges, asserting the column badge appears only on the materialized target edge.

Manual verification: deploy or draft a DuckDB pipeline member that materializes one DuckLake asset and also writes another DuckLake table; the column badge and hover mappings should appear only on the // materialize target edge, while selecting the materialized asset should still show the column trace.

The canvas badge keyed on `e.asset_kind === 'ducklake'`, so a multi-output
producer showed the same column mapping on every ducklake write-edge. Use the
same materialize-target anchor as buildColumnGraph: the badge lands only on the
declared output's edge, falling back to the ducklake write-edge when there's no
materialize annotation. (Codex P1)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@rubenfiszel rubenfiszel changed the title feat: column-level lineage for ducklake pipelines via // column annotation feat: column-level lineage for DuckLake pipelines (SQL-AST inferred + traceable) Jun 26, 2026
@rubenfiszel

Copy link
Copy Markdown
Contributor Author

[P1] Badge on every ducklake write-edge — fixed (e2cdcee). The canvas badge now uses the same materialize-target anchor as buildColumnGraph: it lands only on the declared // materialize output's write-edge, falling back to the ducklake write-edge when there's no materialize annotation. A multi-output producer no longer shows the column mapping on its non-materialized writes.

# Conflicts:
#	backend/windmill-api-assets/src/lib.rs
#	frontend/src/lib/components/assets/AssetGraph/AssetGraphDetailsPane.svelte
#	frontend/src/lib/components/assets/AssetGraph/types.ts
#	frontend/src/routes/(root)/(logged)/pipeline/[folder]/+page.svelte
@github-actions

Copy link
Copy Markdown
Contributor

Pi Review

Good to merge.

All three P1 issues flagged in prior review rounds have been resolved in the latest commits (49ff11215 through e2cdceeeb):

Previously flagged issues — now fixed

  1. DuckDB parse failures drop // column annotations — Fixed in 49ff11215. The graph endpoint (backend/windmill-api-assets/src/lib.rs, line ~832) now falls back via unwrap_or_else(|_| annotated()), preserving explicit annotations even when sqlparser rejects the SQL body.

  2. JOIN columns not resolved for inferred lineage — Fixed in 4319d744f. build_from_maps (line ~395 in windmill-parser-sql-asset/src/asset_parser.rs) now iterates table_with_joins.joins alongside the base relation, and the single_table guard requires from_tables[0].joins.is_empty(). Covered by test_infer_lineage_resolves_joined_inputs.

  3. Multi-output column trace anchors to the wrong asset — Fixed in 4319d744f + e2cdceeeb. Both buildColumnGraph (columnLineageGraph.ts) and the canvas badge gating (AssetGraphCanvas.svelte) now prefer the declared materialize_target from the backend, falling back to a ducklake write-edge only when no // materialize annotation exists. The backend (GraphRunnableNode) sends materialize_target alongside column_lineage.

What I verified

  • Parser parity (Rust ↔ TS): parse_column_lineage_spec / parseColumnLineageSpec, parse_column_ref / parseColumnRef, and merge_column_lineage / mergeColumnLineage are behaviorally identical. Asserted against the shared pipeline_annotations.json fixture corpus (4 new column-lineage entries + 1 mixed). The TS parity test's ASSERTED_TS_FIELDS includes columnLineage: true.

  • SQL AST inference (infer_column_lineage): Correctly walks projections, captures identifier leaves via ColumnIdentCollector (subquery depth-gated), resolves through the shared build_from_maps, and merges with annotations via merge_column_lineage. 5 tests cover passthrough, computed columns, annotation override, wildcards, subquery avoidance, and JOIN resolution.

  • colNodeId collision-proofing: JSON-encoded [kind, path, column] tuple — tested explicitly.

  • $bindable() pattern: inferredColumnLineage = $bindable() on the optional ColumnLineage[] prop in ScriptEditor.svelte — no default value, so it doesn't trigger the banned $bindable(default_value) pattern.

  • Live overlay behavior: applyLiveBufferOverlay only handles trigger overlays (asset + native triggers). column_lineage updates for edited // column lines flow through seedDraftOverlays on draft persist — identical to the pre-existing data_tests pattern. WASM-inferred column lineage (DuckDB AST) IS live-overlaid via liveBodyAssets.columnLineage, merged with annotation-wins precedence in seedDraftOverlays. Consistent and deliberate.

Test coverage

  • Backend: Rust unit tests cover SQL inference (computed, passthrough, annotation override, wildcard, JOIN), annotation parsing (basic, schema-qualified, malformed-ref drop), merge_column_lineage, and Rust↔TS parity via the shared fixture corpus. Integration coverage for the graph endpoint's DuckDB fallback path and for build_from_maps with real DuckDB ATTACH/alias resolution would be ideal but is not expected at this layer — the parser unit tests and parity corpus provide reasonable confidence.

  • Frontend: columnLineageGraph.test.ts covers 8 cases (transitive stitching, materialize_target anchoring, fallback, collision-proof IDs, unanchorable producer skip, bidirectional trace, sibling isolation). parsePipelineAnnotations.test.ts and .parity.test.ts cover the annotation parser. No Svelte component tests expected per repo policy.

  • Docs / CI / config: docs/pipelines-vs-dbt.md updated to reflect the current state (WASM inference wired, live drafts work). No automated tests expected for docs changes.

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel
Mergeable, but should ideally address nits: column trace uses draft graph in deployed view

Found 1 issue:

  1. [P2] Column trace can show hidden draft lineage in deployed view.
    frontend/src/routes/(root)/(logged)/pipeline/[folder]/+page.svelte:1994

    displayGraph correctly switches to deployedGraph in View mode unless "Show drafts" is enabled, but the new column trace is always built from graphWithDraft. That lets the details pane show unsaved draft column lineage for a deployed DuckLake asset while the canvas is intentionally showing deployed-only state. The trace should be built from displayGraph so it matches the graph the user is looking at.

Test coverage

Backend parser/parity coverage and frontend column-graph utility tests look solid for the new parser and graph stitching logic. The remaining gap is page-level manual verification: in View mode with drafts hidden, select a deployed DuckLake asset while an unsaved draft changes its // column lineage, and confirm the details trace matches the deployed-only canvas until drafts are shown or edit mode is active.

The transitive column trace was built from graphWithDraft regardless of mode, so
in View with drafts hidden it could surface draft `// column` lineage the
deployed canvas doesn't show. Build it from `displayGraph` (the graph the canvas
actually renders) so the trace matches: draft overlays in edit / show-drafts,
deployed-only in plain View. (Codex P2)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@rubenfiszel

Copy link
Copy Markdown
Contributor Author

[Codex P2] Column trace could show draft lineage in View — fixed (4459df7). The trace graph is now built from displayGraph (the exact graph the canvas renders) instead of graphWithDraft, so it matches the canvas: draft overlays in edit / show-drafts, deployed-only in plain View.

@github-actions

Copy link
Copy Markdown
Contributor

Pi Review

Good to merge — all three P1 issues flagged by prior reviews have been addressed in the fix commits. No new issues introduced.

Prior review findings — all fixed

  1. [P1] DuckDB parse failures drop annotation lineageFixed (49ff11215). column_lineage_by_path now uses .unwrap_or_else(|_| annotated()), falling back to the annotation-only lineage from annotations_by_path when SQL parsing fails. The documented escape hatch survives syntax errors and dynamic SQL (${sql.raw(...)}).

  2. [P1] JOIN columns not resolved for inferred lineageFixed (4319d744f). build_from_maps now iterates table_with_joins.joins and inserts each joined table into table_to_asset. single_table is Some only when from_tables[0].joins.is_empty(). The test test_infer_lineage_resolves_joined_inputs validates a two-sided JOIN resolution.

  3. [P1] Multi-output anchor + badge placementFixed (4319d744f + e2cdceeeb). Both buildColumnGraph (for the trace view) and AssetGraphCanvas (for the edge badge) now use materialize_target as the authoritative output-asset anchor. The canvas checks producerMaterializeTarget.get(runnableId) to place the columns badge only on the declared target edge, falling back to a ducklake write-edge when no // materialize annotation exists.

What I verified

  • DuckDB parse failure fallback: column_lineage_by_path closure calls annotated() (which reads annotations_by_path) on parse error, so explicit // column annotations are preserved even when sqlparser rejects the DuckDB body. ✓
  • JOIN resolution: build_from_maps (lines ~355–395) walks table_with_joins.joins alongside table_with_joins.relation, mirroring the existing handle_table_with_joins pattern. single_table is correctly gated on joins.is_empty(). ✓
  • Badge targeting: Canvas constructs producerMaterializeTarget map from r.materialize_target and checks matTarget ? exact_match : ducklake before attaching column_lineage to the edge. This matches the buildColumnGraph materialize-target precedence. ✓
  • Parser parity: parseColumnLineageSpec / parse_column_lineage_spec are behaviorally identical — indexOf('<-') vs split_once("<-"), lastIndexOf('.') vs rsplit_once('.'), singleIdent match. Asserted against the shared pipeline_annotations.json fixture corpus with 4 new column-lineage entries. ✓
  • merge_column_lineage / mergeColumnLineage: Annotation wins per output column, mirrored in Rust and TS with matching unit tests. ✓
  • SQL AST inference: infer_column_lineage walks output projections, captures identifier leaves via ColumnIdentCollector (skipping subqueries at depth > 0), resolves through build_from_maps. 6 tests cover passthrough, computed, annotation-override, wildcard, JOIN resolution, and subquery-avoidance. ✓
  • Node ID collision-proofing: colNodeId uses JSON.stringify([kind, path, column]) so # / : inside paths or columns can't merge distinct columns. ✓
  • No AGENTS.md violations: No $bindable(default_value) on optional props (all $bindable() calls are bare); no crypto.randomUUID(); ScriptEditor clears inferredColumnLineage on parse errors to prevent cross-script leakage. ✓
  • SQLx offline cache: The new query (adding language column to the pipeline-member fetch) has a corresponding .sqlx/query-*.json file in the diff. ✓
  • Performance gating: columnGraph derived is gated to selection?.kind === 'asset' && selection.asset_kind === 'ducklake' so buildColumnGraph doesn't run on every keystroke when the trace UI isn't shown. ✓

Live overlay gap (pre-existing, not a regression)

applyLiveBufferOverlay in resolveGraph.ts rebuilds only trigger overlays — it doesn't refresh data_tests or column_lineage badges on keystroke for deployed scripts in the open editor. Both badge families face the same gap. The PR's seedDraftOverlays merges WASM-inferred lineage with annotations for drafts and promoted deployed scripts, and the // column annotation text stays stale in the draft entry until pane teardown — same as // data_test. This is consistent behavior and not a regression.

Test coverage

  • Backend: 5 unit tests for annotation parsing + merge + fail-safe; 6 tests for SQL AST inference including JOIN resolution, wildcard, computed columns, and subquery-avoidance; parity integration test against shared fixture corpus. Adequate coverage for parser + API changes. The graph-endpoint fallback path (unwrap_or_else) is exercised indirectly through the existing parse-error test fixture (dynamic SQL patterns exhaust sqlparser's grammar).
  • Frontend: Parasity test against shared fixture corpus; columnLineageGraph.test.ts with 8 tests (multi-hop stitching, bidirectional trace, depth layout, materialize-target anchoring, write-edge fallback, collision-proof IDs, producers without output assets); parsePipelineAnnotations.test.ts with mergeColumnLineage cases; parsePipelineAnnotations.parity.test.ts asserts columnLineage field. No Svelte component tests expected per repo policy.
  • CI / docs: No automated tests expected — docs/pipelines-vs-dbt.md is a documentation update, Cargo.lock / package-lock.json / package.json are dep bumps.

Manual verification suggested

Verify a DuckDB pipeline member with explicit // column annotations alongside SQL that sqlparser rejects (e.g., ${sql.raw(...)} interpolation or DuckDB-specific syntax). Deploy and open the pipeline graph — the columns badge should appear on the materialized output's write-edge, and hovering should list the annotated mappings. This confirms the parse-failure fallback works end-to-end.

Verify a DuckDB transform that JOINs two source assets with output columns from both sides. Deploy and inspect the deployed graph's column lineage trace — columns from both joined tables should appear as upstream sources.

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel
Should address issues before merging: local CTAS lineage is anchored to the materialized asset

Found 1 issue:

  1. [P1] Local CTAS staging columns are reported as final materialized columns.
    backend/parsers/windmill-parser-sql-asset/src/asset_parser.rs:828

    CreateTable unconditionally calls handle_query_reads, and handle_query_reads unconditionally calls infer_column_lineage. That means a common DuckDB materializer like:

    -- materialize ducklake://warehouse/final
    CREATE TEMP TABLE tmp AS SELECT amount AS amt FROM dl.orders;
    SELECT amt AS total FROM tmp;

    records lineage for amt <- orders.amount, even though amt is only a temp-table column and the final output column is total. The graph endpoint exposes that flat column_lineage, and buildColumnGraph anchors each entry to the script’s materialize_target, so the UI shows warehouse/final.amt as if it existed. CTAS/CreateView lineage should only be inferred when the created object is the output asset, or the inference should carry an output target so local staging queries can be skipped.

Test coverage

Backend parser tests cover annotation parsing, merge precedence, joins, wildcards, and subquery avoidance, but I don’t see coverage for local/temp CTAS inside a materialized script. Frontend graph tests cover anchoring to materialize_target, but not this backend-produced false lineage case.

Manual verification should include a DuckDB pipeline member that stages through a temp table before the final managed-materialize SELECT; the column badge and trace should show only final output columns.

I attempted local targeted tests, but they did not run in this checkout: backend cargo test failed because clang rejected -fuse-ld=mold, and frontend npm run test:unit failed because vitest is not installed in node_modules.

A CTAS into a local/temp staging table isn't the materialized output, but its
projection was inferred and (flat) column_lineage anchored to the script's
// materialize target — so staging columns showed up as the final asset's. Gate
inference to the actual output: a top-level managed-materialize SELECT, or a
CTAS/CREATE VIEW whose target resolves to a real asset. (Codex P1)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@rubenfiszel

Copy link
Copy Markdown
Contributor Author

[Codex P1] Local/temp staging CTAS reported as final columns — fixed (e6e4b45). Column-lineage inference is now gated to the actual materialized output: a top-level managed-materialize SELECT, or a CTAS/CREATE VIEW whose target resolves to a real asset. A CTAS into a local/temp staging table is no longer inferred, so its columns can't be anchored to the script's // materialize target. Added test_infer_lineage_skips_local_staging_ctas + test_infer_lineage_ctas_into_asset_still_inferred; verified live (a staged SELECT amt AS total FROM tmp reports no amt on the final asset, while a direct CTAS keeps its lineage).

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="frontend/src/lib/components/assets/AssetGraph/AssetGraphDetailsPane.svelte">

<violation number="1" location="frontend/src/lib/components/assets/AssetGraph/AssetGraphDetailsPane.svelte:375">
P2: `liveColumnLineage` can leak across script switches because it is a persistent `$state` variable that is forwarded in `onAssetsChange` without resetting when `script.path` changes. During a transition from script A to script B, the `$effect` re-runs as soon as `script.path` updates but before the new `ScriptEditor` has computed new lineage, pairing B’s path with A’s stale data. If the new script triggers an inference error, the `watch` in `ScriptEditor` returns without updating the binding, so the leak becomes permanent.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic

Comment thread backend/parsers/windmill-parser-sql-asset/src/asset_parser.rs Outdated
@github-actions

Copy link
Copy Markdown
Contributor

All critical fixes confirmed in the current code. Here's the review:


Pi Review

Good to merge — all previously flagged issues have been addressed in the fix commits. No new issues introduced.

Prior review findings — all resolved

  1. [P1] DuckDB parse failures dropped // column annotationsFixed (49ff11215). The graph endpoint (backend/windmill-api-assets/src/lib.rs:835) now falls back via .unwrap_or_else(|_| annotated()), preserving explicit annotation lineage when sqlparser rejects the SQL body (dynamic SQL, unsupported DuckDB syntax).

  2. [P1] JOIN columns not resolved for inferred lineageFixed (4319d744f). build_from_maps (backend/parsers/windmill-parser-sql-asset/src/asset_parser.rs:399) iterates table_with_joins.joins alongside the base relation, and the single_table guard requires from_tables[0].joins.is_empty(). Covered by test_infer_lineage_resolves_joined_inputs.

  3. [P1] Column lineage badge on every ducklake write / multi-output anchorFixed (4319d744f + e2cdceeeb). Both buildColumnGraph (columnLineageGraph.ts:54) and the canvas badge gating (AssetGraphCanvas.svelte:448) prefer the declared materialize_target, falling back to a ducklake write-edge only when no // materialize annotation exists.

  4. [P2] Column trace used draft graph in deployed viewFixed (4459df726f). The trace graph is now built from displayGraph (+page.svelte:1994), which matches the exact graph the canvas renders (draft overlays in edit/show-drafts, deployed-only in plain View).

  5. Temp CTAS staging columns not inferredFixed (e6e4b4508). handle_query_reads gates inference behind is_asset_output — a CTAS into a TEMP/local table is not the materialized output, so its intermediate columns are not reported. Covered by test_infer_lineage_skips_local_staging_ctas.

What I verified

  • Parser parity (Rust ↔ TS): parse_column_lineage_spec / parseColumnLineageSpec and parse_column_ref / parseColumnRef are behaviorally identical (last-dot splitting, single_ident validation, parse_asset_syntax). Asserted against the shared pipeline_annotations.json fixture corpus (4 column-lineage entries + 1 mixed). The TS parity test's ASSERTED_TS_FIELDS includes columnLineage: true.

  • merge_column_lineage (Rust + TS): Annotation wins per output column; behaviorally identical across both implementations. Both have matching unit tests.

  • SQL AST inference (infer_column_lineage): Walks projections via ColumnIdentCollector (subquery depth-gated at query_depth > 0), resolves through the shared build_from_maps, merges with annotations. 6 tests cover passthrough, computed columns, annotation override, wildcards, subquery avoidance, and JOIN resolution.

  • colNodeId collision-proofing: JSON-encoded [kind, path, column] tuple — tested explicitly for # / : separators.

  • $bindable() pattern: inferredColumnLineage = $bindable() on the optional ColumnLineage[] prop in ScriptEditor.svelte — no default value, so it does not trigger the banned $bindable(default_value) pattern from AGENTS.md.

  • Auth posture: No new public endpoints — column_lineage and materialize_target are additional fields on the existing asset_graph response, surfaced from metadata already parsed from the user's own scripts. No new auth surface.

  • SQL injection: The new .sqlx query uses parameterized $1 and $2. No dynamic query construction.

  • Live overlay behavior: applyLiveBufferOverlay only handles trigger overlays — consistent with data_tests (the same pattern). WASM-inferred column lineage IS overlaid live via liveBodyAssets.columnLineage, merged with annotation-wins precedence in seedDraftOverlays. The // column annotation text itself is a draft-cycle for the literal annotation text, identical to // data_test.

Test coverage

  • Backend: Rust unit tests cover SQL inference (computed, passthrough, annotation override, wildcard, JOIN, subquery-avoidance, temp-CTAS skip), annotation parsing (basic, schema-qualified, malformed-ref drop, fail-safe), merge_column_lineage, and Rust↔TS parity via the shared fixture corpus. The graph endpoint's fallback path (DuckDB parse failure → annotation-only) is exercised structurally through unwrap_or_else(|_| annotated()) — a defensive composition with the tested annotation parser.

  • Frontend: columnLineageGraph.test.ts covers 8 cases (multi-hop stitching, materialize_target anchoring, fallback to write-edge, collision-proof IDs, unanchorable producer skip, bidirectional trace, sibling isolation, depth layout). parsePipelineAnnotations.parity.test.ts and parsePipelineAnnotations.test.ts cover TS annotation parsing + merge. No Svelte component tests expected per repo policy.

  • Docs: docs/pipelines-vs-dbt.md updated to reflect the current state (WASM inference wired, live drafts work, annotation as override). No automated tests expected for docs.

  • Manual verification: Already performed by the author across multiple rounds — verified deployed graph column-lineage diagram, live editor with WASM-inferred lineage, materialize-target badge gating, and transitive trace across multi-hop chains. Screenshots attached to the PR. No additional manual verification needed.

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel
Should address issues before merging: multi-output column lineage is anchored to one asset

Found 1 issue:

  1. [P1] Multi-output DuckDB lineage can be shown on the wrong asset.
    frontend/src/lib/components/assets/AssetGraph/columnLineageGraph.ts:74

    buildColumnGraph picks one outputAsset per runnable, then anchors every r.column_lineage entry to that single asset. But the backend SQL parser accumulates inferred lineage from every output-producing query into one flat column_lineage list. For a script that has // materialize ducklake://wh/final plus an auxiliary CREATE TABLE dl.audit AS SELECT id FROM dl.source, the parser will include the audit.id <- source.id lineage, and the frontend will render it as final.id <- source.id. The same problem applies to multiple CTAS outputs without // materialize, where the fallback picks one DuckLake write edge. This is worse than missing lineage because it creates false column dependencies on the displayed materialized asset.

Test coverage

Backend parser tests cover computed columns, passthroughs, annotation override, JOIN resolution, wildcard skipping, subquery skipping, and local/temp staging CTAS. Frontend tests cover transitive stitching and choosing the // materialize target, but they do not cover a script whose inferred lineage contains columns from multiple real output assets.

Frontend component tests are not expected by repo policy. Manual verification should include a DuckDB pipeline member with a managed // materialize target and an additional real DuckLake CTAS output; selecting the materialized asset should not show columns inferred from the auxiliary output.

Inference accumulated columns from every output-producing query into one flat
list, all anchored (frontend) to the script's // materialize target — so an
auxiliary CTAS into a different asset showed its columns on the materialized
one. Tag each inferred entry with its output asset and, in parse_assets, scope
the list to the // materialize target (keeping untagged top-level-SELECT
entries); with no declared target, drop inference when entries span multiple
output assets rather than attribute them to an arbitrary one. Parser-internal —
no wire change. (Codex P1)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@rubenfiszel

Copy link
Copy Markdown
Contributor Author

[Codex P1] Multi-output lineage attributed to the wrong asset — fixed (57cd092). Inference now tags each entry with its output asset and parse_assets scopes the flat list to the // materialize target (keeping untagged top-level-SELECT entries). With no declared target, inference is dropped when entries span multiple output assets rather than attributed to an arbitrary one. Parser-internal, no wire change. Added test_infer_lineage_scopes_to_materialize_target + test_infer_lineage_drops_ambiguous_multi_output (64 sql-asset tests pass). Verified live: a script with // materialize …/final + an auxiliary CREATE TABLE dl.audit AS … reports only final's total ← orders.amount — no audit columns.

@github-actions

Copy link
Copy Markdown
Contributor

Now I have a complete picture of the PR's current state. Let me compile my review.


Pi Review

Good to merge — all findings from prior review rounds have been addressed. No new issues introduced.

Prior review findings — all resolved

  1. [P1] DuckDB parse failures dropped // column annotationsFixed (49ff11215). backend/windmill-api-assets/src/lib.rs:835 falls back via .unwrap_or_else(|_| annotated()), preserving explicit annotation lineage when sqlparser rejects the DuckDB body.

  2. [P1] JOIN columns not resolved for inferred lineageFixed (4319d744f). build_from_maps (asset_parser.rs:455) iterates table_with_joins.joins alongside the base relation, and single_table is Some only when from_tables[0].joins.is_empty(). Covered by test_infer_lineage_resolves_joined_inputs (validates a two-sided JOIN with a computed column mixing both sides).

  3. [P1] Multi-output anchor / badge placementFixed (4319d744f + e2cdceeeb). Both buildColumnGraph (columnLineageGraph.ts:56) and the canvas badge gating (AssetGraphCanvas.svelte:448) prefer the declared materialize_target. The canvas checks matTarget ? exact_match : ducklake before attaching the columns badge. The backend scopes inferred lineage to the materialize target in parse_assets (line 38–77), gating auxiliary CTAS columns out.

  4. [P2] Column trace used draft graph in deployed viewFixed (4459df726f). The trace graph is now built from displayGraph (+page.svelte:1993), which matches the canvas (draft overlays in edit/show-drafts, deployed-only in plain View).

  5. [P1] Temp CTAS staging columns not inferredFixed (e6e4b4508a). handle_query_reads gates inference behind get_associated_asset_from_obj_name on the created table name — a CTAS into a TEMP/local table is not an asset, so its intermediate columns are not reported. Covered by test_infer_lineage_skips_local_staging_ctas + test_infer_lineage_ctas_into_asset_still_inferred.

  6. [P1] Multi-output inference scopingFixed (57cd09290). The backend parse_assets now scopes inferred lineage to the // materialize target. Without a target, inference is dropped when multiple real CTAS outputs exist (ambiguous). Covered by test_infer_lineage_scopes_to_materialize_target + test_infer_lineage_drops_ambiguous_multi_output.

What I verified

  • Parser parity (Rust ↔ TS): parse_column_lineage_spec / parseColumnLineageSpec and parse_column_ref / parseColumnRef are behaviorally identical (last-dot splitting, single_ident validation, parse_asset_syntax). Asserted against the shared pipeline_annotations.json fixture corpus (4 column-lineage entries + edge cases). The TS parity test's ASSERTED_TS_FIELDS includes columnLineage: true.

  • merge_column_lineage / mergeColumnLineage: Annotation wins per output column (seen set bypasses inferred dupes). Behaviorally identical across Rust and TS with matching unit tests.

  • SQL AST inference (infer_column_lineage): Walks projections via ColumnIdentCollector (subquery depth-gated at query_depth > 0), resolves through the shared build_from_maps. 10 tests cover: computed columns, passthrough, annotation override, local staging CTAS skip, asset CTAS infer, materialize-target scoping, ambiguous multi-output drop, wildcard, JOIN resolution (both sides + computed mixing both), and subquery avoidance.

  • Backend API: column_lineage and materialize_target on GraphRunnableNode populated from the graph endpoint. DuckDB path runs parse_assets (inferred + annotated merge), falling back to annotation-only on parse failure. No new public API endpoints — additional fields on the existing asset_graph response, surfaced from metadata already parsed from the user's scripts.

  • Frontend graph plumbing: resolveGraph.ts seedDraftOverlays merges WASM-inferred lineage with annotation lineage under annotation-wins precedence, sets materialize_target from parsed annotations. applyLiveBufferOverlay only handles trigger overlays — consistent with data_tests (the same pre-existing pattern). WASM-inferred column lineage is live-overlaid via liveBodyAssets.columnLineage through seedDraftOverlays.

  • colNodeId collision-proofing: JSON-encoded [kind, path, column] tuple — tested explicitly for # / : separators.

  • No AGENTS.md violations:

    • No $bindable(default_value) on optional props — inferredColumnLineage = $bindable() is bare, and the prop is optional (?: ColumnLineage[]).
    • No crypto.randomUUID() in frontend code.
    • ScriptEditor clears inferredColumnLineage on parse errors to prevent cross-script leakage.
    • SQLx offline cache updated for the new query (adding language column).

Test coverage

  • Backend: 10 SQL inference unit tests (computed, passthrough, annotation override, local staging CTAS skip, asset CTAS infer, materialize-target scope, ambiguous multi-output, wildcard, JOIN resolution, subquery avoidance) + annotation parsing/merge tests + Rust↔TS parity integration test against the shared fixture corpus. The column_lineage_by_path fallback to annotation-only is covered structurally by the map closure.

  • Frontend: columnLineageGraph.test.ts with 8 tests (transitive stitching, materialize_target anchoring, write-edge fallback, unanchorable producer skip, collision-proof node IDs, bidirectional trace, sibling isolation, depth layout). parsePipelineAnnotations.test.ts + .parity.test.ts cover annotation parsing. No Svelte component tests expected per repo policy.

  • Docs: docs/pipelines-vs-dbt.md updated to reflect the shipped state. No automated tests expected for docs changes.

Manual verification still needed

Deploy a DuckDB pipeline member that joins two source assets with a computed column mixing both sides (e.g., SELECT o.amount + c.discount AS net FROM dl.orders o JOIN dl.customers c ...), then confirm the deployed graph badge shows the correct column count and the column-lineage trace in the details pane resolves both orders.amount and customers.discount as inputs to net.

Deploy a DuckDB pipeline member with explicit // column annotations plus SQL outside sqlparser coverage (e.g., ${sql.raw('...')}), confirming the graph still renders the column badge and trace from the annotations despite the SQL parse failure.

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

cc @rubenfiszel
Should address issues before merging: temp staging can leak as DuckLake column lineage

Found 1 issue:

  1. [P1] CREATE TEMP staging can still leak as a DuckLake column source under USE.
    backend/parsers/windmill-parser-sql-asset/src/asset_parser.rs:890

    The CreateTable arm calls track_table_definition(&create_table.name) and later gates inference with get_associated_asset_from_obj_name(&create_table.name, Some(W)), but neither path checks create_table.temporary. With an active USE dl, a one-part temp name like tmp resolves to ducklake://warehouse/tmp before it is ever added to local_table_names. Then a common materializer shape like:

    -- materialize ducklake://warehouse/final
    ATTACH 'ducklake://warehouse' AS dl;
    USE dl;
    CREATE TEMP TABLE tmp AS SELECT amount AS amt FROM orders;
    SELECT amt AS total FROM tmp;

    makes the final inferred lineage resolve tmp.amt as warehouse/tmp.amt, and parse_assets keeps that untagged top-level SELECT lineage for the materialize target. The graph can therefore show final.total <- warehouse/tmp.amt, inventing a DuckLake asset for a local staging table. CREATE TEMP should be marked local before active-asset resolution and should not be inferred as an output asset.

Test coverage

Backend parser tests cover joins, annotation override, wildcard skipping, subquery skipping, local/temp staging without USE, and multi-output scoping. The missing case is local/temp staging while an attached DuckLake database is active via USE, which is the path that still resolves the temp table as an asset.

Frontend pure-logic tests cover materialize-target anchoring and fallback graph stitching. No Svelte component tests are expected by repo policy.

Manual verification should use a DuckDB pipeline member with ATTACH, USE, CREATE TEMP TABLE tmp AS ..., and a final managed SELECT into // materialize ducklake://.../final; the column badge/details should not invent warehouse/tmp as an upstream asset.

A one-part temp name under `USE dl` resolved to an asset (ducklake://…/tmp)
before being registered local, so a final SELECT reading it invented
`final.total <- warehouse/tmp.amt` (a phantom DuckLake column) and recorded a
phantom asset. track_table_definition now registers any temporary table/view as
local up front, bypassing active-asset resolution; CreateTable/CreateView pass
their `temporary` flag. (Codex P1)

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@rubenfiszel

Copy link
Copy Markdown
Contributor Author

[Codex P1] CREATE TEMP under USE leaked a phantom DuckLake column — fixed (8b86bdc). track_table_definition now registers any temporary table/view as local before active-asset resolution, so a one-part temp name under USE dl no longer resolves to ducklake://…/tmp. CreateTable/CreateView pass their temporary flag. Verified live: the reported CREATE TEMP TABLE tmp …; SELECT amt AS total FROM tmp under USE now reports no lineage (and no phantom warehouse/tmp asset) instead of final.total <- warehouse/tmp.amt. Added test_infer_lineage_temp_table_under_use_is_local (65 sql-asset tests).

I also did a proactive audit of the whole inference + asset-resolution path: the remaining cases (CTE-through, INSERT…SELECT, partial multi-source) are missing lineage, not false lineage, and are safe best-effort limitations.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/parsers/windmill-parser-sql-asset/src/asset_parser.rs">

<violation number="1" location="backend/parsers/windmill-parser-sql-asset/src/asset_parser.rs:943">
P1: DROP statements ignore the `temporary` flag from sqlparser, hardcoding `is_temporary = false`. This can cause `DROP TEMP TABLE ...` (or similar temporary-drop syntax) to be mis-resolved as an asset write when the name isn't already in `local_table_names`.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Fix all with cubic | Re-trigger cubic

self.track_table_definition(name);
// DROP is a write to the named object; resolve it as an
// asset if it is one (not a temp-creation context).
self.track_table_definition(name, false);

@cubic-dev-ai cubic-dev-ai Bot Jun 27, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: DROP statements ignore the temporary flag from sqlparser, hardcoding is_temporary = false. This can cause DROP TEMP TABLE ... (or similar temporary-drop syntax) to be mis-resolved as an asset write when the name isn't already in local_table_names.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At backend/parsers/windmill-parser-sql-asset/src/asset_parser.rs, line 943:

<comment>DROP statements ignore the `temporary` flag from sqlparser, hardcoding `is_temporary = false`. This can cause `DROP TEMP TABLE ...` (or similar temporary-drop syntax) to be mis-resolved as an asset write when the name isn't already in `local_table_names`.</comment>

<file context>
@@ -930,7 +938,9 @@ impl Visitor for AssetCollector {
-                        self.track_table_definition(name);
+                        // DROP is a write to the named object; resolve it as an
+                        // asset if it is one (not a temp-creation context).
+                        self.track_table_definition(name, false);
                     }
                 }
</file context>
Fix with cubic

@github-actions

Copy link
Copy Markdown
Contributor

Pi Review

Good to merge — all findings from prior review rounds have been addressed in the fix commits. No new issues introduced.

Prior review findings — all resolved

  1. [P1] DuckDB parse failures dropped // column annotationsFixed (49ff11215). column_lineage_by_path in backend/windmill-api-assets/src/lib.rs:835 falls back via .unwrap_or_else(|_| annotated()), preserving explicit annotation lineage when sqlparser rejects the DuckDB body.

  2. [P1] JOIN columns not resolved for inferred lineageFixed (4319d744f). build_from_maps (asset_parser.rs:455) iterates table_with_joins.joins alongside the base relation; single_table is Some only when from_tables[0].joins.is_empty(). Covered by test_infer_lineage_resolves_joined_inputs (two-sided JOIN with a computed column mixing both sides).

  3. [P1] Multi-output anchor / badge placementFixed (4319d744f + e2cdceeeb). Both buildColumnGraph (columnLineageGraph.ts:56) and the canvas badge gating (AssetGraphCanvas.svelte:448) prefer the declared materialize_target, falling back to a ducklake write-edge only when no annotation exists.

  4. [P2] Column trace used draft graph in deployed viewFixed (4459df726f). The columnGraph derived is built from displayGraph (+page.svelte:1978), matching the canvas exactly.

  5. [P1] Temp CTAS staging columns leaked into materialized outputFixed (e6e4b4508a). track_table_definition now takes is_temporary: bool; temp tables are always local (never record as an asset). CTAS inference is gated to real assets only. Covered by test_infer_lineage_skips_local_staging_ctas + test_infer_lineage_ctas_into_asset_still_inferred.

  6. [P1] Multi-output inference scopingFixed (57cd09290). parse_assets scopes inferred lineage to the // materialize target; without a target, inference is dropped when tagged entries span multiple distinct output assets. Covered by test_infer_lineage_scopes_to_materialize_target + test_infer_lineage_drops_ambiguous_multi_output.

  7. [P1] Temp table under USE resolved as an assetFixed (8b86bdc4a). track_table_definition treats temporary = true as unconditionally local, so CREATE TEMP TABLE tmp ... under an active USE dl never creates a phantom warehouse/tmp asset. Covered by test_infer_lineage_temp_table_under_use_is_local.

What I verified

  • Parser parity (Rust ↔ TS): parse_column_lineage_spec / parseColumnLineageSpec and parse_column_ref / parseColumnRef are behaviorally identical (last-dot splitting, single_ident validation, parse_asset_syntax shorthands). Asserted against the shared pipeline_annotations.json fixture corpus (4 new column-lineage entries). The TS parity test's ASSERTED_TS_FIELDS includes columnLineage: true.

  • merge_column_lineage (Rust + TS): Annotation wins per output column; matched unit tests on both sides. Consistent behavior.

  • SQL AST inference (infer_column_lineage): Walks projections via ColumnIdentCollector (subquery depth-gated at query_depth > 0), resolves through the shared build_from_maps. 10 tests cover passthrough, computed, annotation override, wildcard, subquery avoidance, JOIN resolution, local staging, materialize-target scoping, and ambiguous multi-output drop.

  • colNodeId collision-proofing: JSON-encoded [kind, path, column] tuple — tested explicitly for # / : separators.

  • $bindable() pattern: inferredColumnLineage = $bindable() on the optional ColumnLineage[] prop in ScriptEditor.svelte — no default value, does not trigger the banned $bindable(default_value) pattern. No crypto.randomUUID() anywhere in frontend code.

  • Auth posture: No new public endpoints. column_lineage and materialize_target are additional fields on the existing asset_graph response, surfaced from metadata parsed from the user's own scripts. Read-only metadata, same auth as data_tests.

  • SQL injection: The modified query uses parameterized $1 and $2. No dynamic query construction.

  • SQLx offline cache: New .sqlx/query-*.json file present for the updated query (adding language column).

  • Live overlay behavior: applyLiveBufferOverlay only handles trigger overlays — consistent with data_tests. WASM-inferred column lineage is overlaid live via liveBodyAssets.columnLineage, merged with annotation-wins precedence in seedDraftOverlays. The // column annotation literal text is a draft-cycle, identical to // data_test — both drop stale state on pane teardown. Not a regression.

  • Performance gating: columnGraph derived is gated to selection?.kind === 'asset' && selection.asset_kind === 'ducklake', so buildColumnGraph doesn't run on every keystroke when the trace UI isn't shown.

  • Docs: docs/pipelines-vs-dbt.md updated to reflect shipped feature, WASM inference, annotation precedence, and graph surface. Consistent with the implementation.

Test coverage

  • Backend: 10 Rust unit tests for SQL inference (passthrough, computed, annotation override, wildcard, subquery avoidance, JOIN resolution, local staging, temp-under-USE, materialize-target scoping, ambiguous multi-output drop). 5 unit tests for annotation parsing (basic, schema-qualified, malformed-ref drop, merge precedence, fail-safe edges). Parity integration test against shared fixture corpus. SQLx cache updated. Adequate for parser and API changes.

  • Frontend: Parity test against shared fixture corpus including column_lineage assertions. columnLineageGraph.test.ts with 8 tests (transitive stitching, materialize_target anchoring, write-edge fallback, collision-proof IDs, unanchorable producer skip, bidirectional trace, sibling isolation, depth layout). parsePipelineAnnotations.test.ts with 2 tests for mergeColumnLineage. No Svelte component tests expected per repo policy.

  • Docs / CI / config: docs/pipelines-vs-dbt.md updated. No automated tests expected for docs changes.

  • Manual verification: The PR description confirms live testing on a deployed pipeline, and the fix-commit messages each describe live verification of their specific fix. No further manual verification needed.

@github-actions

Copy link
Copy Markdown
Contributor

Codex Review

Good to merge

No issues found in the current diff. Previously raised issues around parse-failure fallback, JOIN resolution, multi-output anchoring, deployed-view draft leakage, local/temp CTAS staging, and CREATE TEMP under USE are addressed in the current head.

Test coverage

Backend parser coverage looks solid for the changed logic: annotation parsing/parity, merge precedence, computed and passthrough inference, JOIN inputs, wildcard/subquery skipping, materialize-target scoping, ambiguous multi-output dropping, and temp staging under active USE.

Frontend pure-logic coverage covers parser parity and column graph stitching/trace behavior, including materialize-target anchoring and collision-proof node IDs. No Svelte component tests are expected by repo policy.

I did not run tests locally for this review. Manual verification should still include the deployed pipeline graph with a DuckDB materializer using ATTACH, USE, CREATE TEMP TABLE, and a final // materialize SELECT, confirming no phantom warehouse/tmp column lineage appears.

@rubenfiszel rubenfiszel merged commit 003a262 into main Jun 27, 2026
21 checks passed
@rubenfiszel rubenfiszel deleted the pipeline-column-lineage branch June 27, 2026 19:08
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 27, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant