feat(DENG-11205): Productionizing Probe Fetcher Pipeline by phil-lee70 · Pull Request #9557 · mozilla/bigquery-etl

phil-lee70 · 2026-06-09T17:47:49Z

Description

Productionizes the lineage + probe-fetch portion of the schema_enricher metadata-completeness pipeline as two new weekly bqetl tasks under data_governance_metadata_derived, downstream of Gaurang's column_profiles_v1.

New tables

lineage_mapping_v1 — for each table profiled in column_profiles_v1, walks DataHub upstream lineage to identify the source telemetry ping (Glean or Legacy Telemetry). One row per source table.
probe_definitions_v1 — for each distinct ping in lineage_mapping_v1, fetches probe definitions from the Glean Dictionary (Glean) or the closest upstream _stable table's DataHub schema (Legacy Telemetry). One row per probe.

Both tasks run on the existing bqetl_data_governance_metadata DAG (weekly, Mondays 04:00 UTC). Dependency chain is expressed via referenced_tables in each metadata.yaml:

column_profiles_v1 → lineage_mapping_v1 → probe_definitions_v1

Design notes

Vendored logic. Both query.py files vendor the lineage-walk and probe-fetch helpers from data-shared-llm-agents/agents/schema_enricher/src/schema_enricher/tools/probe_fetcher.py so the scripts depend only on google-cloud-bigquery + requests. Helper functions (_select_ping_from_lineage, fetch_glean_probes, fetch_legacy_probes) are preserved byte-equivalent — including the multi-app aggregation branch for cross-app Glean tables.
Intentional omissions from the agent version. Per-call column-name filtering and the 500-probe cap from fetch_probe_definitions are not replicated — those exist to keep LLM context small at agent read time. The bqetl cache materializes everything; the agent filters at SELECT time via WHERE clauses. Documented in each module docstring.
Cache-on-error semantics. "No ping found" and "multi-app aggregation" are cached as NULL rows (semantic). DataHub transport errors are not cached — the worker skips the row so the next run retries. The 7-day TTL in the original agent cache is replaced by weekly partition overwrites with 90-day retention.
One pair of tables, not per-source-dataset. Lineage and probe data is small (one row per table, one row per probe), so the per-dataset split Gaurang used for column_profiles_v1 isn't needed here.
Date-partitioned on resolved_at / fetched_at with 90-day retention. Older partitions reflect DataHub/Glean state at the time of run, not historical state.

Verification

python -m py_compile clean on both query.py files.
bqetl dag generate bqetl_data_governance_metadata produces the expected set_upstream chain.
Both scripts support --dry-run and ad-hoc single-input flags (--tables on lineage, --pings on probes) for verification against the agent's probe_fetcher.py.

Related Tickets & Documents

DENG-11205

Reviewer, please follow this checklist

github-actions

This PR adds two new weekly query.py tasks under data_governance_metadata_derived — lineage_mapping_v1 (walks DataHub upstream lineage to resolve each profiled table's source ping) and probe_definitions_v1 (fetches probe definitions per ping from the Glean Dictionary or a _stable table's DataHub schema), chained downstream of column_profiles_v1 via referenced_tables.

The code closely follows the established column_profiles_v1 pattern: self-contained vendored helpers, explicit BigQuery schemas matching schema.yaml, partition-decorator WRITE_TRUNCATE loads for idempotency, and the secrets:/scheduling blocks match existing conventions. The requests dependency is fine — it's a pinned direct dep in requirements.txt, present in the runner image. Schemas, field modes, and partition/clustering config are all consistent between the Python schemas, schema.yaml, and metadata.yaml. The two inline notes are minor robustness/determinism observations, not blockers.

github-actions · 2026-06-09T17:53:21Z

+        SELECT
+          ping_platform,
+          source_ping,
+          ANY_VALUE(stable_urn) AS stable_urn


suggestion: ANY_VALUE(stable_urn) silently picks an arbitrary stable_urn when a legacy source_ping resolves to more than one stable table across the profiled derived tables. For legacy telemetry, source_ping is just the bare ping name (e.g. main), with no app prefix, so two derived tables that both trace back to a main ping but through different _stable tables would collapse into one group and the winning stable_urn is non-deterministic. Since fetch_legacy_probes reads the field schema of exactly that stable_urn, the probe set for the ping would then depend on which row ANY_VALUE happened to pick. If a single canonical stable table per ping name isn't guaranteed, consider grouping by stable_urn too (or MIN/MAX for determinism) so the choice is at least stable run-to-run.

This may be worth considering

agree, replaced with MAX(stable_urn) for deterministic answer

github-actions · 2026-06-09T17:53:29Z

+            f"Lineage failed for {project}.{dataset}.{table_name}: {e}. "
+            f"Skipping row — not caching as NULL."
+        )
+        return None


suggestion: On a DataHub transport error this skips the row, and save_lineage does a WRITE_TRUNCATE of the whole partition with only the successfully-resolved rows — so a skipped table is simply absent from this run's partition. The module docstring frames this as "the next run retries," but with the weekly cadence and probe_definitions_v1 consuming the same partition on the same ds, a transient blip drops that table's ping (and therefore its probes) for the entire week, not just for a moment. A bounded retry/backoff around the resolve_ping/_datahub_graphql call would close that gap and keep the "skip rather than poison the cache" semantics intact for genuinely persistent failures.

gkatre · 2026-06-10T17:06:48Z

+    if not rows:
+        logger.warning("No lineage rows produced; nothing to write.")
+        return


This appears to indicate a more serious error. Perhaps consider raising for errors when all attempts (retries) fail? (examples that i can think: DataHub outage, token expiration, etc)

Something similar to this

agreed, implemented fail loud pattern

gkatre · 2026-06-10T17:14:08Z

+    if not rows:
+        logger.warning("No probe rows produced; nothing to write.")
+        return


Similar to the other one...raising for serious total failure errors or every fetch returns empty.

agreed, implemented fail loud pattern

gkatre

@phil-lee70 Nicely done!

I left a couple of comments perhaps worth considering. I also see a couple comments from the bot. Otherwise this is good to go.

scholtzan · 2026-06-11T12:10:43Z

Integration report

View full diff, if available (SQL + DAGs) (requires private-bigquery-etl access)
Private CI run

phil-lee70 added 2 commits June 9, 2026 13:41

feat(DENG-11205): Productionizing Probe Fetcher Pipeline

893b5ff

Merge branch 'main' into deng-11205

e5478e5

github-actions Bot reviewed Jun 9, 2026

View reviewed changes

This comment has been minimized.

Sign in to view

gkatre reviewed Jun 10, 2026

View reviewed changes

gkatre approved these changes Jun 10, 2026

View reviewed changes

add fail loud pattern and replace any()

58de254

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(DENG-11205): Productionizing Probe Fetcher Pipeline#9557

feat(DENG-11205): Productionizing Probe Fetcher Pipeline#9557
phil-lee70 wants to merge 3 commits into
mainfrom
deng-11205

phil-lee70 commented Jun 9, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

github-actions Bot Jun 9, 2026

Uh oh!

gkatre Jun 10, 2026

Uh oh!

phil-lee70 Jun 11, 2026

Uh oh!

github-actions Bot Jun 9, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

gkatre Jun 10, 2026

Uh oh!

phil-lee70 Jun 11, 2026

Uh oh!

gkatre Jun 10, 2026

Uh oh!

phil-lee70 Jun 11, 2026

Uh oh!

gkatre left a comment

Uh oh!

scholtzan commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

phil-lee70 commented Jun 9, 2026

Description

Design notes

Verification

Related Tickets & Documents

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

gkatre Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

phil-lee70 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

This comment has been minimized.

gkatre Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

phil-lee70 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

gkatre Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

phil-lee70 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

gkatre left a comment

Choose a reason for hiding this comment

Uh oh!

scholtzan commented Jun 11, 2026

Integration report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants