PoC Data classification by akkomar · Pull Request #9238 · mozilla/bigquery-etl

akkomar · 2026-04-22T16:18:58Z

Based on #9199

…eline

…ld_profiler Adds per-table performance logging (BQ job ID, table size, GB scanned, elapsed time) to a CSV so profiling cost and runtime can be tracked and extrapolated to the full warehouse. Also fixes a bug where leaf fields nested inside ARRAY columns were not detected as Tier 3 and caused profile queries to fail on tables with complex nested types.

akkomar · 2026-05-05T13:35:17Z

@claude

akkomar · 2026-05-06T19:40:34Z

@claude

github-actions · 2026-05-06T19:40:53Z

Claude finished @akkomar's task in 2m 23s —— View job

Code Review — Update for new commits (PoC Data Classification)

Reviewing the three new commits on top of my prior review:

9888d69 — Allow model choice (adds Gemini support + compare_models.py)
1bd398a — Add data_collection_category to classifier output (+ classify_table.sh)
ac0c5f6 — Add export_to_sheet.py

Critical / new issues introduced by these commits

`requirements.txt`: hash pinning removed (supply-chain regression)

The header changed from

pip-compile --allow-unsafe --generate-hashes

to

pip-compile requirements.in

…and all --hash=sha256:... lines were dropped from requirements.txt. This was almost certainly a side effect of running pip-compile locally to add anthropic and google-genai, not an intentional policy change. Please regenerate with --generate-hashes --allow-unsafe before merging — losing hash pinning weakens supply-chain integrity for the whole repo, well beyond the scope of this PoC.

Also: anthropic and google-genai are listed in requirements.in without version pins. Other entries in this file are pinned. Pin them for reproducible builds.

`compare_models.py:53-55` — same SQL injection pattern flagged before

The new _table_filter_clause interpolates table parts directly into a WHERE clause:

return (
    f"WHERE source_project = '{project}' "
    f"AND source_dataset = '{dataset}' "
    f"AND source_table = '{source_table}'"
)

Note that fetch_rows in the same file (and export_to_sheet.py) does this correctly with QueryJobConfig(query_parameters=...). The pattern is in this same file two functions apart — please use parameterized queries for the WHERE values here too.

`field_classifier.py:256-264` — same SQL injection pattern, also unchanged

load_phase1 still builds the WHERE clause via f-string interpolation:

where += (
    f" AND source_project = '{project}'"
    f" AND source_dataset = '{dataset}'"
    f" AND source_table = '{table}'"
)

Switch to QueryJobConfig(query_parameters=...) — the pattern is now in three files (field_profiler.py, field_classifier.py, description_reconciler.py) plus compare_models.py. Worth a shared helper.

Correctness / code-quality issues in new code

`field_classifier.py:392, 401` — `lambda` assigned to a name

invoke_llm = lambda prompt: call_claude(claude_client, args.model, prompt)

flake8 E731. Use def:

def invoke_llm(prompt):
    return call_claude(claude_client, args.model, prompt)

Or use functools.partial(call_claude, claude_client, args.model).

`compare_models.py:76` — dead conditional

if len(found) > 2 and not (left and right):

At this point left and right is already known to be False (line 61 returned early when both were set), so not (left and right) is always True. The and not (...) clause is dead.

`compare_models.py:81-85` — `--left` / `--right` silently ignored when only one is provided

If only --left foo is set and exactly two models exist in the data, the function falls through to return found[0], found[1] — silently ignoring the user's --left foo. Either require both or neither, or honor a single override and pick the other from found. As written, the partial input is dropped without warning.

`export_to_sheet.py` — hardcoded constants make this single-use

MODEL = "gemini-3.1-flash-lite-preview" (line 37)
TABLES = [...] (lines 38-48) — hardcoded scope of 9 tables
DEST_TABLE = "mozdata-nonprod.analysis.akomar_field_classifications_v1" (line 32)

These should be CLI args or config before this can be reused. As-is, every new export requires editing the file. Reasonable for a one-off PoC export, but worth noting.

`export_to_sheet.py:115` — convoluted ternary for tri-state bool

"TRUE" if r.needs_review else "FALSE" if r.needs_review is False else "",

Cleaner as:

{True: "TRUE", False: "FALSE", None: ""}[r.needs_review]

`classify_table.sh` looks clean

set -euo pipefail, repo-root resolution, clear banners. Two minor nits:

Line 51: for MODEL in $MODELS relies on word-splitting $MODELS. That's intentional here, but a model name containing a space would break. Documented behavior in the header comment, so OK for this use.
No trap to print which table failed when one in a multi-table run blows up — easy add (trap 'echo "FAILED at table: $TABLE" >&2' ERR) if helpful.

`field_classifier.py:391` — `anthropic.Anthropic()` still uses default `max_retries=2`

Re-flagging from prior review — for any meaningful production volume, anthropic.Anthropic(max_retries=5) (or explicit backoff) avoids dropped rows on transient 429s. The new lambda/def rewrite is a good place to add this.

Status of issues from previous review

Most prior PoC-quality issues are unchanged — that's fine for a PoC, but flagging which ones are still open so they don't get lost when this productionizes:

Resolved / improved since last review

field_classifier.py and export_to_sheet.py now correctly use parameterized queries in some places — a good direction. Just not consistent yet.
data_collection_category is wired through schema, prompt, and the new comparison/export tools consistently.
classify_table.sh is a nice ergonomic improvement over the manual phase invocation.
README is much more useful, especially the multi-model + comparison flow.
New compare_models.py is well-structured and uses parameterized queries for its main fetch_pairs query.

Check what changed since last review
Review new/updated files
Post updated review
• Branch: data_classification_based_on_DENG-10944

chelseyklein · 2026-05-28T20:40:27Z

+SUBJECT_LABELS = {
+    "System": "system",
+    "User": "user",
+    "company": "company",


Suggested change

"company": "company",

"Company": "company",

…ation + FxA plans

…ased_on_DENG-10944

scholtzan · 2026-06-16T18:31:02Z

Integration report

View full diff, if available (SQL + DAGs) (requires private-bigquery-etl access)
Private CI run

gkabbz and others added 3 commits April 13, 2026 10:43

feat: Add three phase BQ field profiling and metadata description pip…

66d4455

…eline

PoC data classification

6aed6aa

This comment was marked as outdated.

Sign in to view

scholtzan and others added 3 commits May 5, 2026 13:38

Merge branch 'main' into data_classification_based_on_DENG-10944

906a45e

Allow model choice

9888d69

Add data collection category to classifier output

1bd398a

This comment has been minimized.

Sign in to view

akkomar changed the title ~~DO NOT MERGE PoC Data classification~~ PoC Data classification May 6, 2026

Add script for exporting classified field list

ac0c5f6

This comment has been minimized.

Sign in to view

chelseyklein reviewed May 28, 2026

View reviewed changes

akkomar added 2 commits June 16, 2026 17:56

docs: consolidate classification design into README; add productioniz…

ff4bdcc

…ation + FxA plans

Merge remote-tracking branch 'origin/main' into data_classification_b…

a0d2a32

…ased_on_DENG-10944

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PoC Data classification#9238

PoC Data classification#9238
akkomar wants to merge 9 commits into
mainfrom
data_classification_based_on_DENG-10944

akkomar commented Apr 22, 2026

Uh oh!

akkomar commented May 5, 2026

Uh oh!

This comment was marked as outdated.

This comment has been minimized.

This comment has been minimized.

akkomar commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026 •

edited

Loading

Uh oh!

chelseyklein May 28, 2026 •

edited

Loading

Uh oh!

scholtzan commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

akkomar commented Apr 22, 2026

Uh oh!

akkomar commented May 5, 2026

Uh oh!

This comment was marked as outdated.

This comment has been minimized.

This comment has been minimized.

akkomar commented May 6, 2026

Uh oh!

github-actions Bot commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review — Update for new commits (PoC Data Classification)

Critical / new issues introduced by these commits

requirements.txt: hash pinning removed (supply-chain regression)

compare_models.py:53-55 — same SQL injection pattern flagged before

field_classifier.py:256-264 — same SQL injection pattern, also unchanged

Correctness / code-quality issues in new code

field_classifier.py:392, 401 — lambda assigned to a name

compare_models.py:76 — dead conditional

compare_models.py:81-85 — --left / --right silently ignored when only one is provided

export_to_sheet.py — hardcoded constants make this single-use

export_to_sheet.py:115 — convoluted ternary for tri-state bool

classify_table.sh looks clean

field_classifier.py:391 — anthropic.Anthropic() still uses default max_retries=2

Status of issues from previous review

Resolved / improved since last review

Uh oh!

chelseyklein May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

scholtzan commented Jun 16, 2026

Integration report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented May 6, 2026 •

edited

Loading

`requirements.txt`: hash pinning removed (supply-chain regression)

`compare_models.py:53-55` — same SQL injection pattern flagged before

`field_classifier.py:256-264` — same SQL injection pattern, also unchanged

`field_classifier.py:392, 401` — `lambda` assigned to a name

`compare_models.py:76` — dead conditional

`compare_models.py:81-85` — `--left` / `--right` silently ignored when only one is provided

`export_to_sheet.py` — hardcoded constants make this single-use

`export_to_sheet.py:115` — convoluted ternary for tri-state bool

`classify_table.sh` looks clean

`field_classifier.py:391` — `anthropic.Anthropic()` still uses default `max_retries=2`

chelseyklein May 28, 2026 •

edited

Loading