mozilla · akkomar · Apr 13, 2026 · Apr 16, 2026 · Apr 22, 2026 · May 5, 2026
@@ -6,6 +6,7 @@
 # the private file second). If you see dependency weirdness in the private
 # image, check for version overlaps between requirements.txt and
 # requirements-private.txt first.
+anthropic
 attrs==25.4.0
 authlib==1.6.12
 beautifulsoup4==4.14.3

@@ -140,10 +140,15 @@ annotated-types==0.7.0 \
     --hash=sha256:1f02e8b43a8fbbc3f3e0d4f0f4bfc8131bcb4eebe8849b8e5c773f3a1c582a53 \
     --hash=sha256:aff07c09a53a08bc8cfccb9c85b05f1aa9a2a6f23728d790723543408344ce89
     # via pydantic
+anthropic==0.109.2 \
+    --hash=sha256:d37db299597c7bc124b49b767ff135f1e6456b64af2b2fad4b63b2a1df333cf0 \
+    --hash=sha256:e0fb4ca5df0ed983248c9c6c3242adc81d9cfddb8725902da53698554117abac
+    # via -r requirements.in
 anyio==4.11.0 \
     --hash=sha256:0287e96f4d26d4149305414d4e3bc32f0dcd0862365a4bddea19d7a1ec38c4fc \
     --hash=sha256:82a8d0b81e318cc5ce71a5f1f8b5c4e63619620b63141ef8c995fa0db95a57c4
     # via
+    #   anthropic
     #   google-genai
     #   httpx
     #   openai
@@ -518,8 +523,13 @@ distro==1.9.0 \
     --hash=sha256:2fa77c6fd8940f116ee1d6b94a2f90b13b5ea8d019b98bc8bafdcabcdd9bdbed \
     --hash=sha256:7bffd925d65168f85027d8da9af6bddab658135b840670a223589bc0c8ef02b2
     # via
+    #   anthropic
     #   google-genai
     #   openai
+docstring-parser==0.18.0 \
+    --hash=sha256:292510982205c12b1248696f44959db3cdd1740237a968ea1e2e7a900eeb2015 \
+    --hash=sha256:b3fcbed555c47d8479be0796ef7e19c2670d428d72e96da63f3a40122860374b
+    # via anthropic
 exceptiongroup==1.3.1 \
     --hash=sha256:8b412432c6055b0b7d14c310000ae93352ed6754f70fa8f7c34141f91c4e3219 \
     --hash=sha256:a7a39a3bd276781e98394987d3a5701d0c4edffb633bb7a5144577f82c773598
@@ -1009,6 +1019,7 @@ httpx==0.28.1 \
     --hash=sha256:75e98c5f16b0f35b567856f597f06ff2270a374470a5c2392242528e3e3e42fc \
     --hash=sha256:d909fcccc110f8c7faf814ca82a9a4d816bc5a6dbfea25d6591d6985b8ba59ad
     # via
+    #   anthropic
     #   google-genai
     #   openai
 hyperframe==6.1.0 \
@@ -1054,9 +1065,7 @@ jaraco-classes==3.4.0 \
 jeepney==0.8.0 \
     --hash=sha256:5efe48d255973902f6badc3ce55e2aa6c5c3b3bc642059ef3a91247bcfcc5806 \
     --hash=sha256:c0a454ad016ca575060802ee4d590dd912e35c122fa04e70306de3d076cce755
-    # via
-    #   keyring
-    #   secretstorage
+    # via secretstorage
 jinja2==3.1.6 \
     --hash=sha256:0137fb05990d35f1275a587e9aee6d56da821fc83491a0fb838183be43f66d6d \
     --hash=sha256:85ece4451f492d0c13c5dd7c13a64681a86afae63a5f347908daf103ce6d2f67
@@ -1169,7 +1178,9 @@ jiter==0.11.1 \
     --hash=sha256:fa992af648fcee2b850a3286a35f62bbbaeddbb6dbda19a00d8fbc846a947b6e \
     --hash=sha256:fe04ea475392a91896d1936367854d346724a1045a247e5d1c196410473b8869 \
     --hash=sha256:fe4a431c291157e11cee7c34627990ea75e8d153894365a3bc84b7a959d23ca8
-    # via openai
+    # via
+    #   anthropic
+    #   openai
 jmespath==1.0.1 \
     --hash=sha256:02e2e4cc71b5bcab88332eebf907519190dd9e6e82107fa7f83b1003a6252980 \
     --hash=sha256:90261b206d6defd58fdd5e85f478bf633a2901798906be2ad389150c5c60edbe
@@ -1923,6 +1934,7 @@ pydantic==2.9.1 \
     --hash=sha256:1363c7d975c7036df0db2b4a61f2e062fbc0aa5ab5f2772e0ffc7191a4f4bce2 \
     --hash=sha256:7aff4db5fdf3cf573d4b3c30926a510a10e19a0774d38fc4967f78beb6deb612
     # via
+    #   anthropic
     #   bigeye-sdk
     #   google-genai
     #   mozilla-nimbus-schemas
@@ -2590,9 +2602,7 @@ s3transfer==0.13.0 \
 secretstorage==3.3.3 \
     --hash=sha256:2403533ef369eca6d2ba81718576c5e0f564d5cca1b58f73a8b23e7d4eeebd77 \
     --hash=sha256:f356e6628222568e3af06f2eba8df495efa13b3b63081dafd4f7d9a7b7bc9f99
-    # via
-    #   bigeye-sdk
-    #   keyring
+    # via bigeye-sdk
 shellingham==1.5.4 \
     --hash=sha256:7ecfff8f2fd72616f7481040475a65b2bf8af90a56c89140852d1120324e8686 \
     --hash=sha256:8dbca0739d487e5bd35ab3ca4b36e11c4078f3a234bfce294b0a0291363404de
@@ -2625,6 +2635,7 @@ sniffio==1.3.1 \
     --hash=sha256:2f6da418d1f1e0fddd844478f41680e794e6051915791a034ff65e5f100525a2 \
     --hash=sha256:f4324edc670a0f49750a81b895f35c3adb843cca46f0530f79fc1babb23789dc
     # via
+    #   anthropic
     #   anyio
     #   google-genai
     #   openai
@@ -2747,6 +2758,7 @@ typing-extensions==4.15.0 \
     --hash=sha256:f0fa19c6845758ab08074a0cfa8b7aecb71c999ca73d62883bc25cc018c4e548
     # via
     #   aiosignal
+    #   anthropic
     #   anyio
     #   beautifulsoup4
     #   cattrs

@@ -0,0 +1,219 @@
+# Metadata scripts - data classification PoC
+
+Tooling that profiles BigQuery columns and assigns each one a data-type label
+from Mozilla's data taxonomy (`classification/Taxonomy overview - Data
+Types.csv`). It reuses a field-profiling + telemetry-lineage pipeline to feed an
+LLM classifier.
+
+**Scope**: PoC. Speed over correctness. Single-table runs, eyeball the output,
+iterate.
+
+## Pipeline
+
+| Script | Phase | Output table (`mozdata-nonprod.analysis.*`) |
+|---|---|---|
+| `field_profiler.py` | **1.** Profile every column (null rate, distinct count, top values) and generate a pass-1 description from observed data. | `akomar_data_profiling_v1` |
+| `lineage_probe_fetcher.py` | **2.** Walk DataHub upstream lineage to the source ping, then fetch probe definitions (Glean Dictionary or legacy stable-table schema). Captures `data_sensitivity`, `send_in_pings`, and `tags` for Glean probes. | `akomar_metadata_phase2_table_pings_v1`, `akomar_metadata_phase2_ping_probes_v1` |
+| `description_reconciler.py` | **3.** *(original PoC, not used by the classifier)* Reconcile phase-1 data-driven descriptions with phase-2 probe intent. | `gkabbz_metadata_phase3_reconciled_v1` |
+| `field_classifier.py` | **4.** Classify each column against the taxonomy: primary + secondary labels, confidence, reasoning, data collection category. Reads phases 1 and 2 directly. | `akomar_field_classifications_v1` |
+
+All scripts accept `--table project.dataset.table` and are idempotent
+(already-processed rows are skipped on re-run).
+
+## Layout
+
+```
+script/metadata/
+  field_profiler.py          - Phase 1: profile every column + pass1 description
+  lineage_probe_fetcher.py   - Phase 2: resolve source ping + fetch probe defs
+                                        (Glean probes include data_sensitivity,
+                                         send_in_pings, tags)
+  description_reconciler.py  - original Phase 3 for descriptions (unused by the classifier)
+  field_classifier.py        - Phase 4: classify columns against the taxonomy
+  classify_table.sh          - wrapper: runs phases 1->2->4 per table, per model
+  classification/
+    Taxonomy overview - Data Types.csv   - source of truth (from legal/privacy)
+    build_taxonomy.py                    - CSV -> taxonomy.json
+    taxonomy.json                        - preprocessed, what the classifier reads
+    compare_models.py                    - diff Claude vs Gemini classifications
+    export_to_sheet.py                   - export classifications as CSV for Google Sheets
+```
+
+## Output tables (`mozdata-nonprod.analysis`)
+
+| Table | Written by | Contents |
+|---|---|---|
+| `akomar_data_profiling_v1` | `field_profiler.py` | One row per column: null rate, distinct count, top values, pass1 description |
+| `akomar_metadata_phase2_table_pings_v1` | `lineage_probe_fetcher.py` | Table -> source ping mapping (via DataHub lineage) |
+| `akomar_metadata_phase2_ping_probes_v1` | `lineage_probe_fetcher.py` | Probe defs per ping, incl. `data_sensitivity`, `send_in_pings`, `tags` for Glean |
+| `akomar_field_classifications_v1` | `field_classifier.py` | Final classification: `primary_label`, `secondary_labels`, `confidence`, `reasoning`, `needs_review`, `data_collection_category` (technical / interaction / web_activity / highly_sensitive), `model` (full model name) |
+
+## Setup (one-time)
+
+```bash
+pip install -r requirements.txt   # adds anthropic + google-genai to the venv
+export DATAHUB_GMS_TOKEN=...
+# Claude backend:
+export ANTHROPIC_API_KEY=...
+# Gemini backend - uses Vertex AI on the `mozdata` project:
+gcloud auth application-default login
+python script/metadata/classification/build_taxonomy.py   # regenerate taxonomy.json
+```
+
+## Usage
+
+**Classify one or more tables end-to-end** with the wrapper:
+
+```bash
+script/metadata/classify_table.sh \
+    moz-fx-data-shared-prod.search_derived.search_clients_daily_v8 \
+    moz-fx-data-shared-prod.ads_backend_stable.interaction_v1
+```
+
+It runs three phases per table (profile -> lineage/probes -> classify) with each
+model in `$MODELS`. Default is a single Gemini run
+(`gemini-3.1-flash-lite-preview`). To also classify with Claude and diff the two
+afterward:
+
+```bash
+MODELS="claude-sonnet-4-6 gemini-3.1-flash-lite-preview" \
+    script/metadata/classify_table.sh "$TABLE"
+python script/metadata/classification/compare_models.py --table "$TABLE"
+```
+
+**Or run the phases manually** (default model for `field_classifier.py` is
+`claude-sonnet-4-6`):
+
+```bash
+TABLE=moz-fx-data-shared-prod.search_derived.search_clients_daily_v8
+python script/metadata/field_profiler.py        --table "$TABLE"
+python script/metadata/lineage_probe_fetcher.py --table "$TABLE"
+python script/metadata/field_classifier.py      --table "$TABLE"
+python script/metadata/field_classifier.py      --table "$TABLE" --model gemini-3.1-flash-lite-preview
+python script/metadata/classification/compare_models.py --table "$TABLE"
+```
+
+**Inspect results:**
+
+```sql
+SELECT column_name, model, primary_label, confidence, needs_review, reasoning, data_sensitivity
+FROM `mozdata-nonprod.analysis.akomar_field_classifications_v1`
+WHERE source_table = 'search_clients_daily_v8'
+ORDER BY column_name, model;
+```
+
+## Model selection
+
+`field_classifier.py --model <full-model-name>` picks the LLM; the backend is
+inferred from the name prefix:
+
+- `claude-*` -> Anthropic API (e.g. `claude-sonnet-4-6`, `claude-opus-4-7`).
+  Requires `ANTHROPIC_API_KEY`.
+- `gemini-*` -> Vertex AI on project `mozdata` (e.g.
+  `gemini-3.1-flash-lite-preview`). Requires application-default credentials.
+
+Default is `claude-sonnet-4-6`; any other prefix is rejected. The destination
+table has a `model` column storing the full model name, and the idempotency key
+is `(project, dataset, table, column, model)`, so multiple models can be run on
+the same table and all rows are kept (including version-to-version comparisons
+within a family).
+
+## Comparing models
+
+After classifying a table with two models:
+
+```bash
+python script/metadata/classification/compare_models.py --table "$TABLE"
+```
+
+If the table has exactly two distinct `model` values they are auto-picked;
+otherwise pass them explicitly with `--left` / `--right`. Prints:
+
+- agreement rate on `primary_label`
+- per-model confidence distribution (high/medium/low counts)
+- side-by-side reasoning for every disagreement, including each model's matched
+  probe
+
+Add `--show-agreements` to also dump the agreed rows. Omit `--table` to compare
+across all classified tables.
+
+## Exporting for Legal review
+
+`export_to_sheet.py` produces a CSV of classifications for manual paste into a
+Google Sheet. The source-table list and model are hardcoded at the top of the
+script; edit them for a different scope. (Writing directly to Sheets via the API
+is blocked by Mozilla's Workspace policy on the gcloud OAuth client's Sheets
+scope, so CSV-and-paste is the path of least resistance.)
+
+```bash
+python script/metadata/classification/export_to_sheet.py
+# writes script/metadata/classification/classifications.csv
+
+# or pipe straight to the macOS clipboard:
+python script/metadata/classification/export_to_sheet.py --stdout | pbcopy
+```
+
+In Google Sheets: open a fresh sheet, click cell A1, paste; Sheets auto-splits
+the CSV. Re-runs overwrite the file (gitignored), so just paste again to refresh.
+
+Output columns: `dataset, table, column_name, category, category_simple,
+data_collection_category, confidence, reasoning, needs_review`. `category_simple`
+rolls the assigned `primary_label` up to the closest "Data type" entry from the
+taxonomy (e.g. `user.behavior.search.term` -> `user.behavior.search`,
+`user.unique_id.client_id` -> `user.unique_id`).
+
+## Taxonomy preprocessing
+
+`build_taxonomy.py` parses the CSV and normalizes it:
+
+- Strips blank/header-only rows.
+- Fixes typos: `user.behaviour.*` -> `user.behavior.*`,
+  `personnel.demographic.Marital_status _orientation` ->
+  `personnel.demographic.sexual_orientation`,
+  `personnel.human_resouces.*` -> `personnel.human_resources.*`.
+- Synthesizes top-level subject labels (`system`, `user`, `company`,
+  `personnel`, `jobapplicants`, `other`) from CSV section headers.
+
+Emits ~133 entries of `{label, parent, level, display_name, description,
+examples}`, where `level` is one of `subject` / `data_type` / `subcategory`
+(which CSV column the entry came from).
+
+## Classifier design
+
+For each profiled column:
+
+1. Fuzzy-match the column name against probes from the source ping (top 3).
+2. Build an LLM prompt with: column name, data type, null rate, pass1
+   description, matched probe (name + description + `data_sensitivity` + `tags`),
+   and the full `taxonomy.json` compacted (~6k tokens, fits easily).
+3. The model returns JSON: `{primary_label, secondary_labels, confidence,
+   reasoning, needs_review, data_collection_category}`.
+   - `data_collection_category` is one of `technical` / `interaction` /
+     `web_activity` / `highly_sensitive` (Mozilla's [4 data collection
+     categories](https://wiki.mozilla.org/Data_Collection#Data_Collection_Categories),
+     the same scale as Glean's `data_sensitivity`). Always emitted, including
+     when no probe matched. The model is told to defer to a Glean-declared
+     `data_sensitivity` unless the column's observed content overrides it.
+4. Write to BQ.
+
+## Explicit non-goals for the PoC
+
+- No ground-truth eval set, no accuracy measurement.
+- No Phase 3 description reconciliation reuse - classifier reads Phase 1 + Phase 2 directly.
+- No Tier 3 (`REPEATED RECORD`, `metrics STRUCT`) handling.
+- No writeback to `schema.yaml` / `global.yaml` / DataHub tags - output stays in BQ for review.
+- No batching / parallelism beyond what the existing scripts already do.
+- No retries on LLM JSON parse failures - log and skip.
+
+## Plans / in progress
+
+Forward-looking work (these supersede some of the design above as they land):
+
+- [`classification/profiler_productionization_plan.md`](classification/profiler_productionization_plan.md)
+  - replace Phase 1 with the productionized profiler from bigquery-etl PR #9503,
+  feed raw profile stats into the classifier prompt, and make descriptions
+  optional. Revisits the "no Tier 3" non-goal (the production profiler adds
+  nested/array tiers).
+- [`classification/fxa_classification_plan.md`](classification/fxa_classification_plan.md)
+  - test classifying all FxA (Mozilla Accounts) data. Revisits the
+  "no ground-truth eval" non-goal and raises restricted-PII handling.
@@ -0,0 +1,3 @@
+taxonomy.json
+Taxonomy*
+*.csv