Skip to content

PoC Data classification#9238

Draft
akkomar wants to merge 9 commits into
mainfrom
data_classification_based_on_DENG-10944
Draft

PoC Data classification#9238
akkomar wants to merge 9 commits into
mainfrom
data_classification_based_on_DENG-10944

Conversation

@akkomar

@akkomar akkomar commented Apr 22, 2026

Copy link
Copy Markdown
Member

Based on #9199

gkabbz and others added 3 commits April 13, 2026 10:43
…ld_profiler

Adds per-table performance logging (BQ job ID, table size, GB scanned, elapsed time) to a CSV so profiling cost and runtime can be tracked and extrapolated to the full warehouse. Also fixes a bug where leaf fields nested inside ARRAY columns were not detected as Tier 3 and caused profile queries to fail on tables with complex nested types.
@akkomar

akkomar commented May 5, 2026

Copy link
Copy Markdown
Member Author

@claude

@github-actions

This comment was marked as outdated.

@github-actions

This comment has been minimized.

@akkomar akkomar changed the title DO NOT MERGE PoC Data classification PoC Data classification May 6, 2026
@github-actions

This comment has been minimized.

@akkomar

akkomar commented May 6, 2026

Copy link
Copy Markdown
Member Author

@claude

@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor

Claude finished @akkomar's task in 2m 23s —— View job


Code Review — Update for new commits (PoC Data Classification)

Reviewing the three new commits on top of my prior review:

  • 9888d69 — Allow model choice (adds Gemini support + compare_models.py)
  • 1bd398a — Add data_collection_category to classifier output (+ classify_table.sh)
  • ac0c5f6 — Add export_to_sheet.py

Critical / new issues introduced by these commits

requirements.txt: hash pinning removed (supply-chain regression)

The header changed from

pip-compile --allow-unsafe --generate-hashes

to

pip-compile requirements.in

…and all --hash=sha256:... lines were dropped from requirements.txt. This was almost certainly a side effect of running pip-compile locally to add anthropic and google-genai, not an intentional policy change. Please regenerate with --generate-hashes --allow-unsafe before merging — losing hash pinning weakens supply-chain integrity for the whole repo, well beyond the scope of this PoC.

Also: anthropic and google-genai are listed in requirements.in without version pins. Other entries in this file are pinned. Pin them for reproducible builds.

compare_models.py:53-55 — same SQL injection pattern flagged before

The new _table_filter_clause interpolates table parts directly into a WHERE clause:

return (
    f"WHERE source_project = '{project}' "
    f"AND source_dataset = '{dataset}' "
    f"AND source_table = '{source_table}'"
)

Note that fetch_rows in the same file (and export_to_sheet.py) does this correctly with QueryJobConfig(query_parameters=...). The pattern is in this same file two functions apart — please use parameterized queries for the WHERE values here too.

field_classifier.py:256-264 — same SQL injection pattern, also unchanged

load_phase1 still builds the WHERE clause via f-string interpolation:

where += (
    f" AND source_project = '{project}'"
    f" AND source_dataset = '{dataset}'"
    f" AND source_table = '{table}'"
)

Switch to QueryJobConfig(query_parameters=...) — the pattern is now in three files (field_profiler.py, field_classifier.py, description_reconciler.py) plus compare_models.py. Worth a shared helper.


Correctness / code-quality issues in new code

field_classifier.py:392, 401lambda assigned to a name

invoke_llm = lambda prompt: call_claude(claude_client, args.model, prompt)

flake8 E731. Use def:

def invoke_llm(prompt):
    return call_claude(claude_client, args.model, prompt)

Or use functools.partial(call_claude, claude_client, args.model).

compare_models.py:76 — dead conditional

if len(found) > 2 and not (left and right):

At this point left and right is already known to be False (line 61 returned early when both were set), so not (left and right) is always True. The and not (...) clause is dead.

compare_models.py:81-85--left / --right silently ignored when only one is provided

If only --left foo is set and exactly two models exist in the data, the function falls through to return found[0], found[1] — silently ignoring the user's --left foo. Either require both or neither, or honor a single override and pick the other from found. As written, the partial input is dropped without warning.

export_to_sheet.py — hardcoded constants make this single-use

  • MODEL = "gemini-3.1-flash-lite-preview" (line 37)
  • TABLES = [...] (lines 38-48) — hardcoded scope of 9 tables
  • DEST_TABLE = "mozdata-nonprod.analysis.akomar_field_classifications_v1" (line 32)

These should be CLI args or config before this can be reused. As-is, every new export requires editing the file. Reasonable for a one-off PoC export, but worth noting.

export_to_sheet.py:115 — convoluted ternary for tri-state bool

"TRUE" if r.needs_review else "FALSE" if r.needs_review is False else "",

Cleaner as:

{True: "TRUE", False: "FALSE", None: ""}[r.needs_review]

classify_table.sh looks clean

set -euo pipefail, repo-root resolution, clear banners. Two minor nits:

  • Line 51: for MODEL in $MODELS relies on word-splitting $MODELS. That's intentional here, but a model name containing a space would break. Documented behavior in the header comment, so OK for this use.
  • No trap to print which table failed when one in a multi-table run blows up — easy add (trap 'echo "FAILED at table: $TABLE" >&2' ERR) if helpful.

field_classifier.py:391anthropic.Anthropic() still uses default max_retries=2

Re-flagging from prior review — for any meaningful production volume, anthropic.Anthropic(max_retries=5) (or explicit backoff) avoids dropped rows on transient 429s. The new lambda/def rewrite is a good place to add this.


Status of issues from previous review

Most prior PoC-quality issues are unchanged — that's fine for a PoC, but flagging which ones are still open so they don't get lost when this productionizes:

  • SQL injection across field_profiler.py, field_classifier.py, description_reconciler.py (and now compare_models.py) — switch all to parameterized queries
  • Storing raw PII in akomar_data_profiling_v1 (field_profiler.py:94-96)
  • Indirect prompt injection risk (column values fed to LLM unsanitized)
  • description_reconciler.py:165normalize_name crashes on None
  • field_profiler.py:343 — silent failure on profile error
  • Imports inside write_perf_row (field_profiler.py:549-551)
  • Debug __main__ fallback still hardcodes a table (field_profiler.py:607-614)
  • Duplicated normalize_name / find_matching_probes between field_classifier.py and description_reconciler.py
  • Dead bq_delete() in lineage_probe_fetcher.py:374-389
  • save_to_bq instantiates its own client in field_profiler.py:502
  • description_reconciler.py still references gkabbz_* tables — README marks Phase 3 as "not used by the classifier"; if it's truly unused, consider dropping it from the PR
  • DATAHUB_GMS_TOKEN raises KeyError
  • build_taxonomy.py:40-42 LABEL_FIXES case-mismatch
  • No tests

Resolved / improved since last review

  • field_classifier.py and export_to_sheet.py now correctly use parameterized queries in some places — a good direction. Just not consistent yet.
  • data_collection_category is wired through schema, prompt, and the new comparison/export tools consistently.
  • classify_table.sh is a nice ergonomic improvement over the manual phase invocation.
  • README is much more useful, especially the multi-model + comparison flow.
  • New compare_models.py is well-structured and uses parameterized queries for its main fetch_pairs query.

SUBJECT_LABELS = {
"System": "system",
"User": "user",
"company": "company",

@chelseyklein chelseyklein May 28, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"company": "company",
"Company": "company",

@scholtzan

Copy link
Copy Markdown
Collaborator

Integration report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants