Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions requirements.in
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
# the private file second). If you see dependency weirdness in the private
# image, check for version overlaps between requirements.txt and
# requirements-private.txt first.
anthropic
attrs==25.4.0
authlib==1.6.12
beautifulsoup4==4.14.3
Expand Down
26 changes: 19 additions & 7 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -140,10 +140,15 @@ annotated-types==0.7.0 \
--hash=sha256:1f02e8b43a8fbbc3f3e0d4f0f4bfc8131bcb4eebe8849b8e5c773f3a1c582a53 \
--hash=sha256:aff07c09a53a08bc8cfccb9c85b05f1aa9a2a6f23728d790723543408344ce89
# via pydantic
anthropic==0.109.2 \
--hash=sha256:d37db299597c7bc124b49b767ff135f1e6456b64af2b2fad4b63b2a1df333cf0 \
--hash=sha256:e0fb4ca5df0ed983248c9c6c3242adc81d9cfddb8725902da53698554117abac
# via -r requirements.in
anyio==4.11.0 \
--hash=sha256:0287e96f4d26d4149305414d4e3bc32f0dcd0862365a4bddea19d7a1ec38c4fc \
--hash=sha256:82a8d0b81e318cc5ce71a5f1f8b5c4e63619620b63141ef8c995fa0db95a57c4
# via
# anthropic
# google-genai
# httpx
# openai
Expand Down Expand Up @@ -518,8 +523,13 @@ distro==1.9.0 \
--hash=sha256:2fa77c6fd8940f116ee1d6b94a2f90b13b5ea8d019b98bc8bafdcabcdd9bdbed \
--hash=sha256:7bffd925d65168f85027d8da9af6bddab658135b840670a223589bc0c8ef02b2
# via
# anthropic
# google-genai
# openai
docstring-parser==0.18.0 \
--hash=sha256:292510982205c12b1248696f44959db3cdd1740237a968ea1e2e7a900eeb2015 \
--hash=sha256:b3fcbed555c47d8479be0796ef7e19c2670d428d72e96da63f3a40122860374b
# via anthropic
exceptiongroup==1.3.1 \
--hash=sha256:8b412432c6055b0b7d14c310000ae93352ed6754f70fa8f7c34141f91c4e3219 \
--hash=sha256:a7a39a3bd276781e98394987d3a5701d0c4edffb633bb7a5144577f82c773598
Expand Down Expand Up @@ -1009,6 +1019,7 @@ httpx==0.28.1 \
--hash=sha256:75e98c5f16b0f35b567856f597f06ff2270a374470a5c2392242528e3e3e42fc \
--hash=sha256:d909fcccc110f8c7faf814ca82a9a4d816bc5a6dbfea25d6591d6985b8ba59ad
# via
# anthropic
# google-genai
# openai
hyperframe==6.1.0 \
Expand Down Expand Up @@ -1054,9 +1065,7 @@ jaraco-classes==3.4.0 \
jeepney==0.8.0 \
--hash=sha256:5efe48d255973902f6badc3ce55e2aa6c5c3b3bc642059ef3a91247bcfcc5806 \
--hash=sha256:c0a454ad016ca575060802ee4d590dd912e35c122fa04e70306de3d076cce755
# via
# keyring
# secretstorage
# via secretstorage
jinja2==3.1.6 \
--hash=sha256:0137fb05990d35f1275a587e9aee6d56da821fc83491a0fb838183be43f66d6d \
--hash=sha256:85ece4451f492d0c13c5dd7c13a64681a86afae63a5f347908daf103ce6d2f67
Expand Down Expand Up @@ -1169,7 +1178,9 @@ jiter==0.11.1 \
--hash=sha256:fa992af648fcee2b850a3286a35f62bbbaeddbb6dbda19a00d8fbc846a947b6e \
--hash=sha256:fe04ea475392a91896d1936367854d346724a1045a247e5d1c196410473b8869 \
--hash=sha256:fe4a431c291157e11cee7c34627990ea75e8d153894365a3bc84b7a959d23ca8
# via openai
# via
# anthropic
# openai
jmespath==1.0.1 \
--hash=sha256:02e2e4cc71b5bcab88332eebf907519190dd9e6e82107fa7f83b1003a6252980 \
--hash=sha256:90261b206d6defd58fdd5e85f478bf633a2901798906be2ad389150c5c60edbe
Expand Down Expand Up @@ -1923,6 +1934,7 @@ pydantic==2.9.1 \
--hash=sha256:1363c7d975c7036df0db2b4a61f2e062fbc0aa5ab5f2772e0ffc7191a4f4bce2 \
--hash=sha256:7aff4db5fdf3cf573d4b3c30926a510a10e19a0774d38fc4967f78beb6deb612
# via
# anthropic
# bigeye-sdk
# google-genai
# mozilla-nimbus-schemas
Expand Down Expand Up @@ -2590,9 +2602,7 @@ s3transfer==0.13.0 \
secretstorage==3.3.3 \
--hash=sha256:2403533ef369eca6d2ba81718576c5e0f564d5cca1b58f73a8b23e7d4eeebd77 \
--hash=sha256:f356e6628222568e3af06f2eba8df495efa13b3b63081dafd4f7d9a7b7bc9f99
# via
# bigeye-sdk
# keyring
# via bigeye-sdk
shellingham==1.5.4 \
--hash=sha256:7ecfff8f2fd72616f7481040475a65b2bf8af90a56c89140852d1120324e8686 \
--hash=sha256:8dbca0739d487e5bd35ab3ca4b36e11c4078f3a234bfce294b0a0291363404de
Expand Down Expand Up @@ -2625,6 +2635,7 @@ sniffio==1.3.1 \
--hash=sha256:2f6da418d1f1e0fddd844478f41680e794e6051915791a034ff65e5f100525a2 \
--hash=sha256:f4324edc670a0f49750a81b895f35c3adb843cca46f0530f79fc1babb23789dc
# via
# anthropic
# anyio
# google-genai
# openai
Expand Down Expand Up @@ -2747,6 +2758,7 @@ typing-extensions==4.15.0 \
--hash=sha256:f0fa19c6845758ab08074a0cfa8b7aecb71c999ca73d62883bc25cc018c4e548
# via
# aiosignal
# anthropic
# anyio
# beautifulsoup4
# cattrs
Expand Down
219 changes: 219 additions & 0 deletions script/metadata/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
# Metadata scripts - data classification PoC

Tooling that profiles BigQuery columns and assigns each one a data-type label
from Mozilla's data taxonomy (`classification/Taxonomy overview - Data
Types.csv`). It reuses a field-profiling + telemetry-lineage pipeline to feed an
LLM classifier.

**Scope**: PoC. Speed over correctness. Single-table runs, eyeball the output,
iterate.

## Pipeline

| Script | Phase | Output table (`mozdata-nonprod.analysis.*`) |
|---|---|---|
| `field_profiler.py` | **1.** Profile every column (null rate, distinct count, top values) and generate a pass-1 description from observed data. | `akomar_data_profiling_v1` |
| `lineage_probe_fetcher.py` | **2.** Walk DataHub upstream lineage to the source ping, then fetch probe definitions (Glean Dictionary or legacy stable-table schema). Captures `data_sensitivity`, `send_in_pings`, and `tags` for Glean probes. | `akomar_metadata_phase2_table_pings_v1`, `akomar_metadata_phase2_ping_probes_v1` |
| `description_reconciler.py` | **3.** *(original PoC, not used by the classifier)* Reconcile phase-1 data-driven descriptions with phase-2 probe intent. | `gkabbz_metadata_phase3_reconciled_v1` |
| `field_classifier.py` | **4.** Classify each column against the taxonomy: primary + secondary labels, confidence, reasoning, data collection category. Reads phases 1 and 2 directly. | `akomar_field_classifications_v1` |

All scripts accept `--table project.dataset.table` and are idempotent
(already-processed rows are skipped on re-run).

## Layout

```
script/metadata/
field_profiler.py - Phase 1: profile every column + pass1 description
lineage_probe_fetcher.py - Phase 2: resolve source ping + fetch probe defs
(Glean probes include data_sensitivity,
send_in_pings, tags)
description_reconciler.py - original Phase 3 for descriptions (unused by the classifier)
field_classifier.py - Phase 4: classify columns against the taxonomy
classify_table.sh - wrapper: runs phases 1->2->4 per table, per model
classification/
Taxonomy overview - Data Types.csv - source of truth (from legal/privacy)
build_taxonomy.py - CSV -> taxonomy.json
taxonomy.json - preprocessed, what the classifier reads
compare_models.py - diff Claude vs Gemini classifications
export_to_sheet.py - export classifications as CSV for Google Sheets
```

## Output tables (`mozdata-nonprod.analysis`)

| Table | Written by | Contents |
|---|---|---|
| `akomar_data_profiling_v1` | `field_profiler.py` | One row per column: null rate, distinct count, top values, pass1 description |
| `akomar_metadata_phase2_table_pings_v1` | `lineage_probe_fetcher.py` | Table -> source ping mapping (via DataHub lineage) |
| `akomar_metadata_phase2_ping_probes_v1` | `lineage_probe_fetcher.py` | Probe defs per ping, incl. `data_sensitivity`, `send_in_pings`, `tags` for Glean |
| `akomar_field_classifications_v1` | `field_classifier.py` | Final classification: `primary_label`, `secondary_labels`, `confidence`, `reasoning`, `needs_review`, `data_collection_category` (technical / interaction / web_activity / highly_sensitive), `model` (full model name) |

## Setup (one-time)

```bash
pip install -r requirements.txt # adds anthropic + google-genai to the venv
export DATAHUB_GMS_TOKEN=...
# Claude backend:
export ANTHROPIC_API_KEY=...
# Gemini backend - uses Vertex AI on the `mozdata` project:
gcloud auth application-default login
python script/metadata/classification/build_taxonomy.py # regenerate taxonomy.json
```

## Usage

**Classify one or more tables end-to-end** with the wrapper:

```bash
script/metadata/classify_table.sh \
moz-fx-data-shared-prod.search_derived.search_clients_daily_v8 \
moz-fx-data-shared-prod.ads_backend_stable.interaction_v1
```

It runs three phases per table (profile -> lineage/probes -> classify) with each
model in `$MODELS`. Default is a single Gemini run
(`gemini-3.1-flash-lite-preview`). To also classify with Claude and diff the two
afterward:

```bash
MODELS="claude-sonnet-4-6 gemini-3.1-flash-lite-preview" \
script/metadata/classify_table.sh "$TABLE"
python script/metadata/classification/compare_models.py --table "$TABLE"
```

**Or run the phases manually** (default model for `field_classifier.py` is
`claude-sonnet-4-6`):

```bash
TABLE=moz-fx-data-shared-prod.search_derived.search_clients_daily_v8
python script/metadata/field_profiler.py --table "$TABLE"
python script/metadata/lineage_probe_fetcher.py --table "$TABLE"
python script/metadata/field_classifier.py --table "$TABLE"
python script/metadata/field_classifier.py --table "$TABLE" --model gemini-3.1-flash-lite-preview
python script/metadata/classification/compare_models.py --table "$TABLE"
```

**Inspect results:**

```sql
SELECT column_name, model, primary_label, confidence, needs_review, reasoning, data_sensitivity
FROM `mozdata-nonprod.analysis.akomar_field_classifications_v1`
WHERE source_table = 'search_clients_daily_v8'
ORDER BY column_name, model;
```

## Model selection

`field_classifier.py --model <full-model-name>` picks the LLM; the backend is
inferred from the name prefix:

- `claude-*` -> Anthropic API (e.g. `claude-sonnet-4-6`, `claude-opus-4-7`).
Requires `ANTHROPIC_API_KEY`.
- `gemini-*` -> Vertex AI on project `mozdata` (e.g.
`gemini-3.1-flash-lite-preview`). Requires application-default credentials.

Default is `claude-sonnet-4-6`; any other prefix is rejected. The destination
table has a `model` column storing the full model name, and the idempotency key
is `(project, dataset, table, column, model)`, so multiple models can be run on
the same table and all rows are kept (including version-to-version comparisons
within a family).

## Comparing models

After classifying a table with two models:

```bash
python script/metadata/classification/compare_models.py --table "$TABLE"
```

If the table has exactly two distinct `model` values they are auto-picked;
otherwise pass them explicitly with `--left` / `--right`. Prints:

- agreement rate on `primary_label`
- per-model confidence distribution (high/medium/low counts)
- side-by-side reasoning for every disagreement, including each model's matched
probe

Add `--show-agreements` to also dump the agreed rows. Omit `--table` to compare
across all classified tables.

## Exporting for Legal review

`export_to_sheet.py` produces a CSV of classifications for manual paste into a
Google Sheet. The source-table list and model are hardcoded at the top of the
script; edit them for a different scope. (Writing directly to Sheets via the API
is blocked by Mozilla's Workspace policy on the gcloud OAuth client's Sheets
scope, so CSV-and-paste is the path of least resistance.)

```bash
python script/metadata/classification/export_to_sheet.py
# writes script/metadata/classification/classifications.csv

# or pipe straight to the macOS clipboard:
python script/metadata/classification/export_to_sheet.py --stdout | pbcopy
```

In Google Sheets: open a fresh sheet, click cell A1, paste; Sheets auto-splits
the CSV. Re-runs overwrite the file (gitignored), so just paste again to refresh.

Output columns: `dataset, table, column_name, category, category_simple,
data_collection_category, confidence, reasoning, needs_review`. `category_simple`
rolls the assigned `primary_label` up to the closest "Data type" entry from the
taxonomy (e.g. `user.behavior.search.term` -> `user.behavior.search`,
`user.unique_id.client_id` -> `user.unique_id`).

## Taxonomy preprocessing

`build_taxonomy.py` parses the CSV and normalizes it:

- Strips blank/header-only rows.
- Fixes typos: `user.behaviour.*` -> `user.behavior.*`,
`personnel.demographic.Marital_status _orientation` ->
`personnel.demographic.sexual_orientation`,
`personnel.human_resouces.*` -> `personnel.human_resources.*`.
- Synthesizes top-level subject labels (`system`, `user`, `company`,
`personnel`, `jobapplicants`, `other`) from CSV section headers.

Emits ~133 entries of `{label, parent, level, display_name, description,
examples}`, where `level` is one of `subject` / `data_type` / `subcategory`
(which CSV column the entry came from).

## Classifier design

For each profiled column:

1. Fuzzy-match the column name against probes from the source ping (top 3).
2. Build an LLM prompt with: column name, data type, null rate, pass1
description, matched probe (name + description + `data_sensitivity` + `tags`),
and the full `taxonomy.json` compacted (~6k tokens, fits easily).
3. The model returns JSON: `{primary_label, secondary_labels, confidence,
reasoning, needs_review, data_collection_category}`.
- `data_collection_category` is one of `technical` / `interaction` /
`web_activity` / `highly_sensitive` (Mozilla's [4 data collection
categories](https://wiki.mozilla.org/Data_Collection#Data_Collection_Categories),
the same scale as Glean's `data_sensitivity`). Always emitted, including
when no probe matched. The model is told to defer to a Glean-declared
`data_sensitivity` unless the column's observed content overrides it.
4. Write to BQ.

## Explicit non-goals for the PoC

- No ground-truth eval set, no accuracy measurement.
- No Phase 3 description reconciliation reuse - classifier reads Phase 1 + Phase 2 directly.
- No Tier 3 (`REPEATED RECORD`, `metrics STRUCT`) handling.
- No writeback to `schema.yaml` / `global.yaml` / DataHub tags - output stays in BQ for review.
- No batching / parallelism beyond what the existing scripts already do.
- No retries on LLM JSON parse failures - log and skip.

## Plans / in progress

Forward-looking work (these supersede some of the design above as they land):

- [`classification/profiler_productionization_plan.md`](classification/profiler_productionization_plan.md)
- replace Phase 1 with the productionized profiler from bigquery-etl PR #9503,
feed raw profile stats into the classifier prompt, and make descriptions
optional. Revisits the "no Tier 3" non-goal (the production profiler adds
nested/array tiers).
- [`classification/fxa_classification_plan.md`](classification/fxa_classification_plan.md)
- test classifying all FxA (Mozilla Accounts) data. Revisits the
"no ground-truth eval" non-goal and raises restricted-PII handling.
3 changes: 3 additions & 0 deletions script/metadata/classification/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
taxonomy.json
Taxonomy*
*.csv
Loading
Loading