Skip to content

feat(rewrite): cost-aware and convergence-aware repair loop #146

@memadi-nv

Description

@memadi-nv

Summary

The evaluate-repair loop in RewriteWorkflow._run_evaluate_repair_loop runs up to max_repair_iterations rounds for any failing row regardless of whether each iteration is actually improving the rewrite. This burns LLM calls on rows that aren't getting better (or are getting worse) and silently overwrites better attempts with worse ones. This issue proposes adding lightweight per-row convergence tracking, early-stop conditions, and a "keep the best attempt" policy.

Current state

The relevant code lives in src/anonymizer/engine/rewrite/rewrite_workflow.py:278-368. Per failing row, each iteration:

  1. Calls the repair LLM → produces _rewritten_text_next.
  2. Promotes it: failing_rows[COL_REWRITTEN_TEXT] = failing_rows[COL_REWRITTEN_TEXT_NEXT] (line 348).
  3. Re-runs the 3 LLM evaluator calls (quality re-answer, privacy re-answer, quality compare) plus metric computation.
  4. Continues until either _needs_repair is False for all rows, or max_repair_iterations is reached.

Properties of the current behavior:

  • No metric history is preserved. Each re-evaluation overwrites the previous utility_score, leakage_mass, weighted_leakage_rate, any_high_leaked for that row. Only the _repair_iterations counter survives.
  • Last attempt wins, even if worse. Line 348 unconditionally promotes the most recent repair result to _rewritten_text. If iteration 2 produced higher leakage or lower utility than iteration 1, the loop keeps iteration 2.
  • No early stop on stagnation or regression. The only stopping conditions are "all rows pass" and "hit max_repair_iterations". A row whose leakage_mass plateaus or oscillates burns the full iteration budget.
  • No per-row cost ceiling. Each repair iteration triggers 1 repair call + 3 evaluator calls. A row stuck near (but above) the threshold for 3 iterations burns 12 LLM calls and produces no improvement.

Problems this causes

  1. Wasted LLM spend on rows that aren't converging — at typical configs each non-converging row costs ~4 LLM calls per iteration.
  2. Silent regressions. A user can get back a worse rewrite than was produced at iteration 1 with no indication that earlier attempts were better.
  3. No diagnostic signal. Users can't tell from the output whether the repair budget was useful or whether iterations were churning, so they can't tune max_repair_iterations or risk_tolerance evidence-based.

Proposed work

Engine changes (rewrite_workflow.py + evaluate.py)

  • Track a per-row metric history: list[dict] containing at minimum (iteration, utility_score, leakage_mass, weighted_leakage_rate, any_high_leaked) for each iteration including iteration 0.
  • Implement "keep the best attempt" policy: at each iteration, retain the rewrite text + metrics from the iteration with the lowest leakage_mass (with a tie-break on highest utility_score). Promote that as the final _rewritten_text rather than always using the last iteration's output.
  • Add early-stop conditions, configurable via EvaluationCriteria:
    • Stagnation: stop iterating this row if leakage_mass did not decrease by more than min_leakage_improvement (default e.g. 0.05) for stagnation_patience (default 1) consecutive iterations.
    • Regression: stop iterating this row if leakage_mass increased relative to the best attempt so far AND the previous iteration also regressed.
    • Utility floor: stop iterating this row if utility_score drops below repair_utility_floor (default e.g. 0.3) — keep the best prior attempt and let the human-review flag handle it downstream.
    • Per-row LLM call budget (optional, off by default): hard cap on total repair-loop LLM calls per row.
  • Add new output columns alongside existing ones:
    • _repair_history — the per-iteration metric records (json-serializable).
    • _repair_stop_reason — enum: converged | max_iterations | stagnation | regression | utility_floor | budget.
    • _repair_best_iteration — index of the iteration whose rewrite was kept.

Config changes (config/rewrite.py)

  • Extend EvaluationCriteria with the new tunables (min_leakage_improvement, stagnation_patience, repair_utility_floor, optional per_row_repair_call_budget).
  • Wire defaults into the existing RiskTolerance bundles (minimal/low/moderate/high).

Reporting

  • Aggregate convergence stats at the end of _run_evaluate_repair_loop and log them: rows converged / rows stopped early by reason / rows that hit max_repair_iterations / total repair calls saved vs. naive loop.
  • Surface a small summary (counts per _repair_stop_reason and average iterations used) on the returned RewriteResult so callers can introspect.

Tests (tests/engine/rewrite/)

  • Unit tests with mocked evaluator/repair facades that exercise:
    • Convergence after 1 iteration.
    • Stagnation triggers stop and keeps the best attempt.
    • Regression triggers stop and keeps the best attempt (verify last attempt is not promoted).
    • Utility-floor stop preserves the best prior attempt.
    • Per-row budget stop.
    • Backwards-compatible default behavior is unchanged when the new knobs are at conservative defaults.

Docs

  • Update docs/concepts/rewrite.md (or wherever risk_tolerance is documented) with the new knobs, including a short "how to read _repair_history and _repair_stop_reason" section.

Out of scope

  • Changing the repair prompt itself or the underlying repair model selection.
  • Changing the metrics or how leakage_mass / utility_score are computed.
  • Changes to FinalJudgeWorkflow or _determine_needs_human_review (the human-review flag remains driven by the existing objective metrics, which now reflect the best iteration's metrics rather than the last one).
  • Cross-row scheduling / global budget management (separate concern).

Open questions

  1. Best-attempt definition. Should "best" be min(leakage_mass) only, or a weighted scalar like leakage_mass - α * utility_score? Starting with pure min(leakage_mass) with utility_score tiebreak is the simplest semantics and matches what _needs_repair currently optimizes.
  2. Default for repair_utility_floor. Should this be wired per-risk_tolerance (stricter floors at minimal)? Likely yes.
  3. Should stagnation/regression detection consider weighted_leakage_rate instead of raw leakage_mass? Raw mass scales with the number of protected entities, which may produce noisier stagnation detection on rows with many entities.
  4. Backwards compatibility. Existing user configs don't set the new knobs. Defaults must reproduce today's behavior closely enough that users don't see surprise regressions in metrics on the same input.

Why this matters

For users running long datasets, the repair loop is the most expensive stage of rewrite mode (each non-converging row costs ~4 LLM calls per iteration). Today there is no way for a row to "give up gracefully" on partial progress, no way to keep the best attempt when later iterations regress, and no diagnostic signal to tell whether the iteration budget is being well spent. This change is mostly surgical (one function gets smarter, plus a few config knobs and output columns) and has clear, measurable wins: lower LLM spend, no more silent regressions, and visibility into how well the loop is actually converging.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions