Skip to content

fix(data): guard ingredient percent normalization#1168

Merged
ericsocrat merged 6 commits into
mainfrom
fix/data-guard-ingredient-percent-normalization
Jun 2, 2026
Merged

fix(data): guard ingredient percent normalization#1168
ericsocrat merged 6 commits into
mainfrom
fix/data-guard-ingredient-percent-normalization

Conversation

@ericsocrat

Copy link
Copy Markdown
Owner

Summary

Fix failed enrichment migration caused by out-of-range OFF ingredient percentages by hardening normalization at generation time, adding targeted tests, adding pre-apply guardrail validation, and sanitizing the already-generated migration artifact.

Root Cause

One OFF payload (EAN 4062300344877) contained an ingredient entry with percent=300, which violated product_ingredient.chk_percent_range during migration apply.

Changes

  1. enrich_ingredients.py
    • Added _normalize_percent_value(...) to parse and sanitize percent/percent_estimate values.
    • Added _find_invalid_percent_rows(...) to detect residual invalid rows before SQL emission.
    • Updated process_ingredients(...) to normalize both percent and percent_estimate.
    • Updated main() to collect anomalies and fail fast if invalid percent rows remain.
  2. test_enrichment_portability.py
    • Added tests for comma-decimal parsing, out-of-range nullification, anomaly logging, process_ingredients sanitization, and invalid-row detector.
  3. RUN_LOCAL.ps1
    • Added Step 2a pre-apply validation that scans generated migration literals and fails if any product_ingredient.percent is outside [0,100].
  4. supabase/migrations/20260601173035_populate_ingredients_allergens.sql
    • Sanitized existing generated artifact (sanitized_rows=1) to replace out-of-range literal with NULL::numeric.

Constraints Respected

  • Did not rerun full OFF enrichment.
  • Did not start a new full RUN_LOCAL.ps1 -Enrich run.
  • Did not weaken chk_percent_range.
  • Fixed generator/normalization path to prevent recurrence.

Verification

Commands

  • python -m pytest test_enrichment_portability.py -q
  • Percent literal scan on generated migration (bad_percent_rows=0, bad_percent_estimate_rows=0)
  • Apply only sanitized migration with ON_ERROR_STOP=1
  • Post-apply read-only metrics queries

Results

  • Tests: 26 passed.
  • Migration apply: success (COMMIT), plus refresh_all_materialized_views() completed.
  • Post-apply metrics:
    • active_products: 8673
    • nutrition_facts_rows: 9241
    • product_ingredient_rows: 94683
    • product_allergen_info_rows: 18929
    • products_with_ingredient_links: 7562
    • products_with_allergen_info: 5686
    • products_with_ingredient_concern_gt_0: 0
    • v_product_confidence_rows: 8673
    • confidence_band distribution: high=7562, medium=1111

Risk

Low-to-medium: behavior change is restricted to invalid percent input handling and adds conservative nullification + explicit validation.

Copilot AI review requested due to automatic review settings June 2, 2026 06:24
@vercel

vercel Bot commented Jun 2, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
tryvit Ready Ready Preview, Comment Jun 2, 2026 7:27am

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

tryvit/RUN_LOCAL.ps1

Lines 241 to 244 in e5e7353

if ($LASTEXITCODE -ne 0) {
Write-Host "WARNING: Enrichment script exited with errors (some products may not have been enriched)." -ForegroundColor DarkYellow
Write-Host " This is expected — not all products are on Open Food Facts." -ForegroundColor DarkGray
}

P2 Badge Stop after enrichment guard failures

With the new sys.exit(1) guard in enrich_ingredients.py, an invalid generated dataset is now reported via a nonzero exit before a fresh migration is written, but RUN_LOCAL.ps1 -Enrich still treats any enrichment-script failure as a warning and continues to select/apply the latest *populate_ingredients_allergens* migration. In that failure scenario the pipeline can apply a stale committed/generated migration instead of stopping, so the intended fail-fast guardrail does not protect local runs.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the OFF ingredient enrichment pipeline to prevent generated migrations from failing due to invalid ingredient percentage values (e.g., out-of-range percent=300) by normalizing inputs, adding validation, and covering the behavior with targeted tests.

Changes:

  • Add percent/percent_estimate normalization + residual invalid-row detection in enrich_ingredients.py, and fail fast before emitting SQL if invalid values remain.
  • Add unit tests around percent parsing/sanitization and invalid-row detection in test_enrichment_portability.py.
  • Add a pre-apply validation step in RUN_LOCAL.ps1 to scan the generated migration for out-of-range percent literals, and sanitize the already-generated migration artifact.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.

File Description
enrich_ingredients.py Normalizes percent fields, logs anomalies, and validates no invalid percent values remain before SQL emission.
test_enrichment_portability.py Adds tests for percent normalization behavior and invalid-row detection.
RUN_LOCAL.ps1 Adds a pre-apply migration scan to catch out-of-range percent literals.
supabase/migrations/20260601173035_populate_ingredients_allergens.sql Sanitizes an already-generated migration row to avoid apply-time constraint failures.
Comments suppressed due to low confidence (1)

enrich_ingredients.py:368

  • process_ingredients() docstring lists keys like ingredient_id/parent_ingredient_id, but the function actually returns name-based fields (ingredient_name, parent_ingredient_name). This mismatch can mislead callers and makes the new percent sanitation harder to reason about when debugging.
    """Extract ingredient rows for a product.

    Returns list of dicts with keys: country, ean, ingredient_id, position,
    percent, percent_estimate, is_sub_ingredient, parent_ingredient_id
    """

Comment thread enrich_ingredients.py
Comment on lines +328 to +340
if parsed < 0 or parsed > 100:
anomalies.append(
{
"ean": ean,
"ingredient_name": ingredient_name,
"field": field_name,
"value": value,
"reason": "out_of_range",
}
)
return None

return round(parsed, 2)
Comment thread enrich_ingredients.py
Comment on lines +343 to +353
def _find_invalid_percent_rows(rows: list[dict]) -> list[dict]:
"""Return rows that still contain out-of-range percent values."""
bad: list[dict] = []
for r in rows:
pct = r.get("percent")
if pct is not None and (pct < 0 or pct > 100):
bad.append(r)
pct_est = r.get("percent_estimate")
if pct_est is not None and (pct_est < 0 or pct_est > 100):
bad.append(r)
return bad
Comment thread RUN_LOCAL.ps1
Comment on lines +256 to +258
line_re = re.compile(
r"\(\s*'(?P<country>[^']+)'\s*,\s*'(?P<ean>[^']+)'\s*,\s*'(?P<name>(?:[^']|'')*)'\s*,\s*(?P<position>\d+)\s*,\s*(?P<pct>NULL|-?\d+(?:\.\d+)?)::numeric\s*,\s*(?P<pct_est>NULL|-?\d+(?:\.\d+)?)::numeric"
)
Copilot AI review requested due to automatic review settings June 2, 2026 07:18

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

enrich_ingredients.py:368

  • The process_ingredients docstring describes ingredient_id/parent_ingredient_id, but the function now emits ingredient_name/parent_ingredient_name rows (IDs are resolved later via JOINs). Updating the docstring here will prevent confusion for callers/tests.
    """Extract ingredient rows for a product.

    Returns list of dicts with keys: country, ean, ingredient_id, position,
    percent, percent_estimate, is_sub_ingredient, parent_ingredient_id
    """

Comment thread enrich_ingredients.py
)
return None

if parsed < 0 or parsed > 100:
Comment thread enrich_ingredients.py
Comment on lines +346 to +352
for r in rows:
pct = r.get("percent")
if pct is not None and (pct < 0 or pct > 100):
bad.append(r)
pct_est = r.get("percent_estimate")
if pct_est is not None and (pct_est < 0 or pct_est > 100):
bad.append(r)
Comment thread RUN_LOCAL.ps1
Comment on lines +260 to +278
bad = []
for lineno, line in enumerate(path.read_text(encoding="utf-8").splitlines(), start=1):
m = line_re.search(line)
if not m:
continue
pct = m.group("pct")
if pct != "NULL":
val = float(pct)
if val < 0 or val > 100:
bad.append((lineno, m.group("country"), m.group("ean"), m.group("name"), val))

if bad:
print(f"Found {len(bad)} out-of-range product_ingredient.percent values in {path.name}", file=sys.stderr)
for row in bad[:5]:
print(
f" line {row[0]} country={row[1]} ean={row[2]} ingredient='{row[3]}' percent={row[4]}",
file=sys.stderr,
)
sys.exit(1)
Comment thread enrich_ingredients.py
Comment on lines 281 to 285
def _clamp_percent_estimate(pct_est: float | None) -> float | None:
"""Clamp negative percent_estimate to 0 and round to 2 decimals."""
if pct_est is None:
return None
return round(max(pct_est, 0), 2)
@ericsocrat ericsocrat merged commit ed9726f into main Jun 2, 2026
17 checks passed
@ericsocrat ericsocrat deleted the fix/data-guard-ingredient-percent-normalization branch June 2, 2026 07:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants