fix(data): guard ingredient percent normalization by ericsocrat · Pull Request #1168 · ericsocrat/tryvit

ericsocrat · 2026-06-02T06:24:47Z

Summary

Fix failed enrichment migration caused by out-of-range OFF ingredient percentages by hardening normalization at generation time, adding targeted tests, adding pre-apply guardrail validation, and sanitizing the already-generated migration artifact.

Root Cause

One OFF payload (EAN 4062300344877) contained an ingredient entry with percent=300, which violated product_ingredient.chk_percent_range during migration apply.

Changes

enrich_ingredients.py
- Added _normalize_percent_value(...) to parse and sanitize percent/percent_estimate values.
- Added _find_invalid_percent_rows(...) to detect residual invalid rows before SQL emission.
- Updated process_ingredients(...) to normalize both percent and percent_estimate.
- Updated main() to collect anomalies and fail fast if invalid percent rows remain.
test_enrichment_portability.py
- Added tests for comma-decimal parsing, out-of-range nullification, anomaly logging, process_ingredients sanitization, and invalid-row detector.
RUN_LOCAL.ps1
- Added Step 2a pre-apply validation that scans generated migration literals and fails if any product_ingredient.percent is outside [0,100].
supabase/migrations/20260601173035_populate_ingredients_allergens.sql
- Sanitized existing generated artifact (sanitized_rows=1) to replace out-of-range literal with NULL::numeric.

Constraints Respected

Did not rerun full OFF enrichment.
Did not start a new full RUN_LOCAL.ps1 -Enrich run.
Did not weaken chk_percent_range.
Fixed generator/normalization path to prevent recurrence.

Verification

Commands

python -m pytest test_enrichment_portability.py -q
Percent literal scan on generated migration (bad_percent_rows=0, bad_percent_estimate_rows=0)
Apply only sanitized migration with ON_ERROR_STOP=1
Post-apply read-only metrics queries

Results

Tests: 26 passed.
Migration apply: success (COMMIT), plus refresh_all_materialized_views() completed.
Post-apply metrics:
- active_products: 8673
- nutrition_facts_rows: 9241
- product_ingredient_rows: 94683
- product_allergen_info_rows: 18929
- products_with_ingredient_links: 7562
- products_with_allergen_info: 5686
- products_with_ingredient_concern_gt_0: 0
- v_product_confidence_rows: 8673
- confidence_band distribution: high=7562, medium=1111

Risk

Low-to-medium: behavior change is restricted to invalid percent input handling and adds conservative nullification + explicit validation.

vercel · 2026-06-02T06:24:53Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
tryvit	Ready	Preview, Comment	Jun 2, 2026 7:27am

chatgpt-codex-connector

💡 Codex Review

tryvit/RUN_LOCAL.ps1

Lines 241 to 244 in e5e7353

    
           if ($LASTEXITCODE -ne 0) { 
        
               Write-Host "WARNING: Enrichment script exited with errors (some products may not have been enriched)." -ForegroundColor DarkYellow 
        
               Write-Host "  This is expected — not all products are on Open Food Facts." -ForegroundColor DarkGray 
        
           }

Stop after enrichment guard failures

With the new sys.exit(1) guard in enrich_ingredients.py, an invalid generated dataset is now reported via a nonzero exit before a fresh migration is written, but RUN_LOCAL.ps1 -Enrich still treats any enrichment-script failure as a warning and continues to select/apply the latest *populate_ingredients_allergens* migration. In that failure scenario the pipeline can apply a stale committed/generated migration instead of stopping, so the intended fail-fast guardrail does not protect local runs.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copilot

Pull request overview

This PR hardens the OFF ingredient enrichment pipeline to prevent generated migrations from failing due to invalid ingredient percentage values (e.g., out-of-range percent=300) by normalizing inputs, adding validation, and covering the behavior with targeted tests.

Changes:

Add percent/percent_estimate normalization + residual invalid-row detection in enrich_ingredients.py, and fail fast before emitting SQL if invalid values remain.
Add unit tests around percent parsing/sanitization and invalid-row detection in test_enrichment_portability.py.
Add a pre-apply validation step in RUN_LOCAL.ps1 to scan the generated migration for out-of-range percent literals, and sanitize the already-generated migration artifact.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.

File	Description
enrich_ingredients.py	Normalizes percent fields, logs anomalies, and validates no invalid percent values remain before SQL emission.
test_enrichment_portability.py	Adds tests for percent normalization behavior and invalid-row detection.
RUN_LOCAL.ps1	Adds a pre-apply migration scan to catch out-of-range percent literals.
supabase/migrations/20260601173035_populate_ingredients_allergens.sql	Sanitizes an already-generated migration row to avoid apply-time constraint failures.

Comments suppressed due to low confidence (1)

enrich_ingredients.py:368

process_ingredients() docstring lists keys like ingredient_id/parent_ingredient_id, but the function actually returns name-based fields (ingredient_name, parent_ingredient_name). This mismatch can mislead callers and makes the new percent sanitation harder to reason about when debugging.

    """Extract ingredient rows for a product.

    Returns list of dicts with keys: country, ean, ingredient_id, position,
    percent, percent_estimate, is_sub_ingredient, parent_ingredient_id
    """

+    if parsed < 0 or parsed > 100:
+        anomalies.append(
+            {
+                "ean": ean,
+                "ingredient_name": ingredient_name,
+                "field": field_name,
+                "value": value,
+                "reason": "out_of_range",
+            }
+        )
+        return None
+
+    return round(parsed, 2)


+def _find_invalid_percent_rows(rows: list[dict]) -> list[dict]:
+    """Return rows that still contain out-of-range percent values."""
+    bad: list[dict] = []
+    for r in rows:
+        pct = r.get("percent")
+        if pct is not None and (pct < 0 or pct > 100):
+            bad.append(r)
+        pct_est = r.get("percent_estimate")
+        if pct_est is not None and (pct_est < 0 or pct_est > 100):
+            bad.append(r)
+    return bad


+line_re = re.compile(
+    r"\(\s*'(?P<country>[^']+)'\s*,\s*'(?P<ean>[^']+)'\s*,\s*'(?P<name>(?:[^']|'')*)'\s*,\s*(?P<position>\d+)\s*,\s*(?P<pct>NULL|-?\d+(?:\.\d+)?)::numeric\s*,\s*(?P<pct_est>NULL|-?\d+(?:\.\d+)?)::numeric"
+)


…1168)

Copilot

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

enrich_ingredients.py:368

The process_ingredients docstring describes ingredient_id/parent_ingredient_id, but the function now emits ingredient_name/parent_ingredient_name rows (IDs are resolved later via JOINs). Updating the docstring here will prevent confusion for callers/tests.

    """Extract ingredient rows for a product.

    Returns list of dicts with keys: country, ean, ingredient_id, position,
    percent, percent_estimate, is_sub_ingredient, parent_ingredient_id
    """

+        )
+        return None
+
+    if parsed < 0 or parsed > 100:


+    for r in rows:
+        pct = r.get("percent")
+        if pct is not None and (pct < 0 or pct > 100):
+            bad.append(r)
+        pct_est = r.get("percent_estimate")
+        if pct_est is not None and (pct_est < 0 or pct_est > 100):
+            bad.append(r)


+bad = []
+for lineno, line in enumerate(path.read_text(encoding="utf-8").splitlines(), start=1):
+    m = line_re.search(line)
+    if not m:
+        continue
+    pct = m.group("pct")
+    if pct != "NULL":
+        val = float(pct)
+        if val < 0 or val > 100:
+            bad.append((lineno, m.group("country"), m.group("ean"), m.group("name"), val))
+
+if bad:
+    print(f"Found {len(bad)} out-of-range product_ingredient.percent values in {path.name}", file=sys.stderr)
+    for row in bad[:5]:
+        print(
+            f"  line {row[0]} country={row[1]} ean={row[2]} ingredient='{row[3]}' percent={row[4]}",
+            file=sys.stderr,
+        )
+    sys.exit(1)


 def _clamp_percent_estimate(pct_est: float | None) -> float | None:
    """Clamp negative percent_estimate to 0 and round to 2 decimals."""
    if pct_est is None:
        return None
    return round(max(pct_est, 0), 2)


fix(data): guard ingredient percent normalization

e5e7353

Copilot AI review requested due to automatic review settings June 2, 2026 06:24

Copilot started reviewing on behalf of ericsocrat June 2, 2026 06:24 View session

vercel Bot deployed to Preview June 2, 2026 06:26 View deployment

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

Copilot AI reviewed Jun 2, 2026

View reviewed changes

chore(docs): update migration count 230→231 for enrichment migration (#…

030d271

…1168)

vercel Bot deployed to Preview June 2, 2026 06:55 View deployment

probe

b18924e

Copilot AI review requested due to automatic review settings June 2, 2026 07:18

Copilot started reviewing on behalf of ericsocrat June 2, 2026 07:18 View session

ericsocrat added 2 commits June 2, 2026 09:18

remove accidental probe file

14850b9

remove accidental probe file

b8b9f43

vercel Bot deployed to Preview June 2, 2026 07:21 View deployment

Copilot AI reviewed Jun 2, 2026

View reviewed changes

fix(migration): add required headers to enrichment migration (#1168)

f4ddd58

vercel Bot deployed to Preview June 2, 2026 07:27 View deployment

ericsocrat merged commit ed9726f into main Jun 2, 2026
17 checks passed

ericsocrat deleted the fix/data-guard-ingredient-percent-normalization branch June 2, 2026 07:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(data): guard ingredient percent normalization#1168

fix(data): guard ingredient percent normalization#1168
ericsocrat merged 6 commits into
mainfrom
fix/data-guard-ingredient-percent-normalization

ericsocrat commented Jun 2, 2026

Uh oh!

vercel Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if ($LASTEXITCODE -ne 0) {
	Write-Host "WARNING: Enrichment script exited with errors (some products may not have been enriched)." -ForegroundColor DarkYellow
	Write-Host " This is expected — not all products are on Open Food Facts." -ForegroundColor DarkGray
	}

Conversation

ericsocrat commented Jun 2, 2026

Summary

Root Cause

Changes

Constraints Respected

Verification

Commands

Results

Risk

Uh oh!

vercel Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vercel Bot commented Jun 2, 2026 •

edited

Loading