Summary
The evaluate-repair loop in RewriteWorkflow._run_evaluate_repair_loop runs up to max_repair_iterations rounds for any failing row regardless of whether each iteration is actually improving the rewrite. This burns LLM calls on rows that aren't getting better (or are getting worse) and silently overwrites better attempts with worse ones. This issue proposes adding lightweight per-row convergence tracking, early-stop conditions, and a "keep the best attempt" policy.
Current state
The relevant code lives in src/anonymizer/engine/rewrite/rewrite_workflow.py:278-368. Per failing row, each iteration:
- Calls the repair LLM → produces
_rewritten_text_next.
- Promotes it:
failing_rows[COL_REWRITTEN_TEXT] = failing_rows[COL_REWRITTEN_TEXT_NEXT] (line 348).
- Re-runs the 3 LLM evaluator calls (quality re-answer, privacy re-answer, quality compare) plus metric computation.
- Continues until either
_needs_repair is False for all rows, or max_repair_iterations is reached.
Properties of the current behavior:
- No metric history is preserved. Each re-evaluation overwrites the previous
utility_score, leakage_mass, weighted_leakage_rate, any_high_leaked for that row. Only the _repair_iterations counter survives.
- Last attempt wins, even if worse. Line 348 unconditionally promotes the most recent repair result to
_rewritten_text. If iteration 2 produced higher leakage or lower utility than iteration 1, the loop keeps iteration 2.
- No early stop on stagnation or regression. The only stopping conditions are "all rows pass" and "hit
max_repair_iterations". A row whose leakage_mass plateaus or oscillates burns the full iteration budget.
- No per-row cost ceiling. Each repair iteration triggers 1 repair call + 3 evaluator calls. A row stuck near (but above) the threshold for 3 iterations burns 12 LLM calls and produces no improvement.
Problems this causes
- Wasted LLM spend on rows that aren't converging — at typical configs each non-converging row costs ~4 LLM calls per iteration.
- Silent regressions. A user can get back a worse rewrite than was produced at iteration 1 with no indication that earlier attempts were better.
- No diagnostic signal. Users can't tell from the output whether the repair budget was useful or whether iterations were churning, so they can't tune
max_repair_iterations or risk_tolerance evidence-based.
Proposed work
Engine changes (rewrite_workflow.py + evaluate.py)
Config changes (config/rewrite.py)
Reporting
Tests (tests/engine/rewrite/)
Docs
Out of scope
- Changing the repair prompt itself or the underlying repair model selection.
- Changing the metrics or how
leakage_mass / utility_score are computed.
- Changes to
FinalJudgeWorkflow or _determine_needs_human_review (the human-review flag remains driven by the existing objective metrics, which now reflect the best iteration's metrics rather than the last one).
- Cross-row scheduling / global budget management (separate concern).
Open questions
- Best-attempt definition. Should "best" be
min(leakage_mass) only, or a weighted scalar like leakage_mass - α * utility_score? Starting with pure min(leakage_mass) with utility_score tiebreak is the simplest semantics and matches what _needs_repair currently optimizes.
- Default for
repair_utility_floor. Should this be wired per-risk_tolerance (stricter floors at minimal)? Likely yes.
- Should stagnation/regression detection consider
weighted_leakage_rate instead of raw leakage_mass? Raw mass scales with the number of protected entities, which may produce noisier stagnation detection on rows with many entities.
- Backwards compatibility. Existing user configs don't set the new knobs. Defaults must reproduce today's behavior closely enough that users don't see surprise regressions in metrics on the same input.
Why this matters
For users running long datasets, the repair loop is the most expensive stage of rewrite mode (each non-converging row costs ~4 LLM calls per iteration). Today there is no way for a row to "give up gracefully" on partial progress, no way to keep the best attempt when later iterations regress, and no diagnostic signal to tell whether the iteration budget is being well spent. This change is mostly surgical (one function gets smarter, plus a few config knobs and output columns) and has clear, measurable wins: lower LLM spend, no more silent regressions, and visibility into how well the loop is actually converging.
Related
Summary
The evaluate-repair loop in
RewriteWorkflow._run_evaluate_repair_loopruns up tomax_repair_iterationsrounds for any failing row regardless of whether each iteration is actually improving the rewrite. This burns LLM calls on rows that aren't getting better (or are getting worse) and silently overwrites better attempts with worse ones. This issue proposes adding lightweight per-row convergence tracking, early-stop conditions, and a "keep the best attempt" policy.Current state
The relevant code lives in
src/anonymizer/engine/rewrite/rewrite_workflow.py:278-368. Per failing row, each iteration:_rewritten_text_next.failing_rows[COL_REWRITTEN_TEXT] = failing_rows[COL_REWRITTEN_TEXT_NEXT](line 348)._needs_repairis False for all rows, ormax_repair_iterationsis reached.Properties of the current behavior:
utility_score,leakage_mass,weighted_leakage_rate,any_high_leakedfor that row. Only the_repair_iterationscounter survives._rewritten_text. If iteration 2 produced higher leakage or lower utility than iteration 1, the loop keeps iteration 2.max_repair_iterations". A row whoseleakage_massplateaus or oscillates burns the full iteration budget.Problems this causes
max_repair_iterationsorrisk_toleranceevidence-based.Proposed work
Engine changes (
rewrite_workflow.py+evaluate.py)list[dict]containing at minimum(iteration, utility_score, leakage_mass, weighted_leakage_rate, any_high_leaked)for each iteration including iteration 0.leakage_mass(with a tie-break on highestutility_score). Promote that as the final_rewritten_textrather than always using the last iteration's output.EvaluationCriteria:leakage_massdid not decrease by more thanmin_leakage_improvement(default e.g. 0.05) forstagnation_patience(default 1) consecutive iterations.leakage_massincreased relative to the best attempt so far AND the previous iteration also regressed.utility_scoredrops belowrepair_utility_floor(default e.g. 0.3) — keep the best prior attempt and let the human-review flag handle it downstream._repair_history— the per-iteration metric records (json-serializable)._repair_stop_reason— enum:converged | max_iterations | stagnation | regression | utility_floor | budget._repair_best_iteration— index of the iteration whose rewrite was kept.Config changes (
config/rewrite.py)EvaluationCriteriawith the new tunables (min_leakage_improvement,stagnation_patience,repair_utility_floor, optionalper_row_repair_call_budget).RiskTolerancebundles (minimal/low/moderate/high).Reporting
_run_evaluate_repair_loopand log them: rows converged / rows stopped early by reason / rows that hitmax_repair_iterations/ total repair calls saved vs. naive loop._repair_stop_reasonand average iterations used) on the returnedRewriteResultso callers can introspect.Tests (
tests/engine/rewrite/)Docs
docs/concepts/rewrite.md(or whereverrisk_toleranceis documented) with the new knobs, including a short "how to read_repair_historyand_repair_stop_reason" section.Out of scope
leakage_mass/utility_scoreare computed.FinalJudgeWorkflowor_determine_needs_human_review(the human-review flag remains driven by the existing objective metrics, which now reflect the best iteration's metrics rather than the last one).Open questions
min(leakage_mass)only, or a weighted scalar likeleakage_mass - α * utility_score? Starting with puremin(leakage_mass)withutility_scoretiebreak is the simplest semantics and matches what_needs_repaircurrently optimizes.repair_utility_floor. Should this be wired per-risk_tolerance(stricter floors atminimal)? Likely yes.weighted_leakage_rateinstead of rawleakage_mass? Raw mass scales with the number of protected entities, which may produce noisier stagnation detection on rows with many entities.Why this matters
For users running long datasets, the repair loop is the most expensive stage of rewrite mode (each non-converging row costs ~4 LLM calls per iteration). Today there is no way for a row to "give up gracefully" on partial progress, no way to keep the best attempt when later iterations regress, and no diagnostic signal to tell whether the iteration budget is being well spent. This change is mostly surgical (one function gets smarter, plus a few config knobs and output columns) and has clear, measurable wins: lower LLM spend, no more silent regressions, and visibility into how well the loop is actually converging.
Related
max_repair_iterationsnot respected at runtime