fix(data): guard ingredient percent normalization#1168
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
There was a problem hiding this comment.
💡 Codex Review
Lines 241 to 244 in e5e7353
With the new sys.exit(1) guard in enrich_ingredients.py, an invalid generated dataset is now reported via a nonzero exit before a fresh migration is written, but RUN_LOCAL.ps1 -Enrich still treats any enrichment-script failure as a warning and continues to select/apply the latest *populate_ingredients_allergens* migration. In that failure scenario the pipeline can apply a stale committed/generated migration instead of stopping, so the intended fail-fast guardrail does not protect local runs.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
Pull request overview
This PR hardens the OFF ingredient enrichment pipeline to prevent generated migrations from failing due to invalid ingredient percentage values (e.g., out-of-range percent=300) by normalizing inputs, adding validation, and covering the behavior with targeted tests.
Changes:
- Add percent/percent_estimate normalization + residual invalid-row detection in
enrich_ingredients.py, and fail fast before emitting SQL if invalid values remain. - Add unit tests around percent parsing/sanitization and invalid-row detection in
test_enrichment_portability.py. - Add a pre-apply validation step in
RUN_LOCAL.ps1to scan the generated migration for out-of-range percent literals, and sanitize the already-generated migration artifact.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| enrich_ingredients.py | Normalizes percent fields, logs anomalies, and validates no invalid percent values remain before SQL emission. |
| test_enrichment_portability.py | Adds tests for percent normalization behavior and invalid-row detection. |
| RUN_LOCAL.ps1 | Adds a pre-apply migration scan to catch out-of-range percent literals. |
| supabase/migrations/20260601173035_populate_ingredients_allergens.sql | Sanitizes an already-generated migration row to avoid apply-time constraint failures. |
Comments suppressed due to low confidence (1)
enrich_ingredients.py:368
- process_ingredients() docstring lists keys like
ingredient_id/parent_ingredient_id, but the function actually returns name-based fields (ingredient_name,parent_ingredient_name). This mismatch can mislead callers and makes the new percent sanitation harder to reason about when debugging.
"""Extract ingredient rows for a product.
Returns list of dicts with keys: country, ean, ingredient_id, position,
percent, percent_estimate, is_sub_ingredient, parent_ingredient_id
"""
| if parsed < 0 or parsed > 100: | ||
| anomalies.append( | ||
| { | ||
| "ean": ean, | ||
| "ingredient_name": ingredient_name, | ||
| "field": field_name, | ||
| "value": value, | ||
| "reason": "out_of_range", | ||
| } | ||
| ) | ||
| return None | ||
|
|
||
| return round(parsed, 2) |
| def _find_invalid_percent_rows(rows: list[dict]) -> list[dict]: | ||
| """Return rows that still contain out-of-range percent values.""" | ||
| bad: list[dict] = [] | ||
| for r in rows: | ||
| pct = r.get("percent") | ||
| if pct is not None and (pct < 0 or pct > 100): | ||
| bad.append(r) | ||
| pct_est = r.get("percent_estimate") | ||
| if pct_est is not None and (pct_est < 0 or pct_est > 100): | ||
| bad.append(r) | ||
| return bad |
| line_re = re.compile( | ||
| r"\(\s*'(?P<country>[^']+)'\s*,\s*'(?P<ean>[^']+)'\s*,\s*'(?P<name>(?:[^']|'')*)'\s*,\s*(?P<position>\d+)\s*,\s*(?P<pct>NULL|-?\d+(?:\.\d+)?)::numeric\s*,\s*(?P<pct_est>NULL|-?\d+(?:\.\d+)?)::numeric" | ||
| ) |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
enrich_ingredients.py:368
- The
process_ingredientsdocstring describesingredient_id/parent_ingredient_id, but the function now emitsingredient_name/parent_ingredient_namerows (IDs are resolved later via JOINs). Updating the docstring here will prevent confusion for callers/tests.
"""Extract ingredient rows for a product.
Returns list of dicts with keys: country, ean, ingredient_id, position,
percent, percent_estimate, is_sub_ingredient, parent_ingredient_id
"""
| ) | ||
| return None | ||
|
|
||
| if parsed < 0 or parsed > 100: |
| for r in rows: | ||
| pct = r.get("percent") | ||
| if pct is not None and (pct < 0 or pct > 100): | ||
| bad.append(r) | ||
| pct_est = r.get("percent_estimate") | ||
| if pct_est is not None and (pct_est < 0 or pct_est > 100): | ||
| bad.append(r) |
| bad = [] | ||
| for lineno, line in enumerate(path.read_text(encoding="utf-8").splitlines(), start=1): | ||
| m = line_re.search(line) | ||
| if not m: | ||
| continue | ||
| pct = m.group("pct") | ||
| if pct != "NULL": | ||
| val = float(pct) | ||
| if val < 0 or val > 100: | ||
| bad.append((lineno, m.group("country"), m.group("ean"), m.group("name"), val)) | ||
|
|
||
| if bad: | ||
| print(f"Found {len(bad)} out-of-range product_ingredient.percent values in {path.name}", file=sys.stderr) | ||
| for row in bad[:5]: | ||
| print( | ||
| f" line {row[0]} country={row[1]} ean={row[2]} ingredient='{row[3]}' percent={row[4]}", | ||
| file=sys.stderr, | ||
| ) | ||
| sys.exit(1) |
| def _clamp_percent_estimate(pct_est: float | None) -> float | None: | ||
| """Clamp negative percent_estimate to 0 and round to 2 decimals.""" | ||
| if pct_est is None: | ||
| return None | ||
| return round(max(pct_est, 0), 2) |
Summary
Fix failed enrichment migration caused by out-of-range OFF ingredient percentages by hardening normalization at generation time, adding targeted tests, adding pre-apply guardrail validation, and sanitizing the already-generated migration artifact.
Root Cause
One OFF payload (EAN
4062300344877) contained an ingredient entry withpercent=300, which violatedproduct_ingredient.chk_percent_rangeduring migration apply.Changes
enrich_ingredients.py_normalize_percent_value(...)to parse and sanitize percent/percent_estimate values._find_invalid_percent_rows(...)to detect residual invalid rows before SQL emission.process_ingredients(...)to normalize bothpercentandpercent_estimate.main()to collect anomalies and fail fast if invalid percent rows remain.test_enrichment_portability.pyprocess_ingredientssanitization, and invalid-row detector.RUN_LOCAL.ps1product_ingredient.percentis outside[0,100].supabase/migrations/20260601173035_populate_ingredients_allergens.sqlsanitized_rows=1) to replace out-of-range literal withNULL::numeric.Constraints Respected
RUN_LOCAL.ps1 -Enrichrun.chk_percent_range.Verification
Commands
python -m pytest test_enrichment_portability.py -qbad_percent_rows=0,bad_percent_estimate_rows=0)ON_ERROR_STOP=1Results
26 passed.COMMIT), plusrefresh_all_materialized_views()completed.Risk
Low-to-medium: behavior change is restricted to invalid percent input handling and adds conservative nullification + explicit validation.