Skip to content

[Bug] dataset_fields with NULL type causes explosion in column_lineage #3083

Description

@bsmin10010-cloud

Description

I have identified a critical bug where dataset_fields are duplicated indefinitely if the field type is null (or omitted). This behavior triggers a cascading, exponential explosion of rows in the column_lineage table, leading to severe database performance degradation.

Root Cause

The issue stems from the INSERT statement for dataset_fields within the PostgreSQL ON CONFLICT clause:

INSERT INTO dataset_fields ...
ON CONFLICT(dataset_uuid, name, type) DO UPDATE ...

In SQL standards, NULL != NULL. Therefore, if a field's type is provided as NULL, the Unique Constraint fails to identify the existing row. As a result, Marquez treats every incoming event for such fields as a "new" entry and performs an INSERT instead of an UPDATE.

Impact: Exponential Data Explosion

This bug turns a linear data ingestion process into an exponential one.

  1. Field Duplication:

    • A single field (e.g., modified_at) can create thousands of duplicate rows in the dataset_fields table over time if the type is consistently null.
  2. Lineage Explosion (The Multiplier Effect):

    • When a job runs to link this duplicated input field to an output field, Marquez generates lineage edges for ALL duplicates.
    • Expected: 1 Input Field UUID → 1 Output Field UUID = 1 row in column_lineage.
    • Actual: N Input Field UUIDs → 1 Output Field UUID = N rows in column_lineage.
  3. Consequence:

    • In my environment, I observed that the column_lineage table contained over 1,000x more rows than necessary.
    • Nearly 99.9% of the lineage data was garbage caused by this duplication.
    • This massive overhead causes API timeouts when querying lineage and significantly increases database storage usage.

Steps to Reproduce

  1. Emit an OpenLineage event for a dataset where a field's type is explicitly null.
  2. Emit the same event multiple times (e.g., 10 times).
  3. Check the dataset_fields table; you will see 10 duplicate rows for the same field name.
  4. Emit a columnLineage event involving that field.
  5. Check the column_lineage table; you will see 10 lineage rows created for what should be a single relationship.

Suggested Fix

Modify the INSERT query in the DAO layer to handle NULL types explicitly.

Using COALESCE(:type, 'UNKNOWN') (or a similar default value) ensures that the ON CONFLICT clause correctly matches existing rows and performs an update instead of a duplicate insert.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions