Problem
PLAN.md specifies a "Statistical significance check" as step 5 of the optimization loop (Architecture → "EVALUATE & COMPARE"), gates Phase 1 "Done" on a ≥10% improvement with no regression, and defines a PR template reporting before/after scores on train, validation, AND holdout.
But the pipeline currently declares success on any positive raw delta. In evolution/skills/evolve_skill.py:
avg_baseline = sum(baseline_scores) / max(1, len(baseline_scores))
avg_evolved = sum(evolved_scores) / max(1, len(evolved_scores))
improvement = avg_evolved - avg_baseline
...
if improvement > 0:
console.print("✓ Evolution improved skill ...")
GEPA is designed to run on as few as 3 holdout examples (per the plan). At that size a +0.02 average delta is indistinguishable from noise — yet the pipeline treats it as a win, saves "improved" metrics, and surfaces it for a PR. Nothing in the repo distinguishes a real improvement from sampling noise, so the Phase-1 gate ("≥10%, no regression, sensible to a human") can't actually be enforced.
Proposal
A pure-local significance check over the paired holdout scores the pipeline already computes:
- an exact paired randomization test for the p-value (exact for the small holdout sets that are the norm),
- a bootstrap confidence interval + effect size for the magnitude,
- an accept/reject decision combining significance (α=0.05) with the plan's ≥10% effect gate,
and gate the success verdict on that instead of improvement > 0. No new API calls, no new dependencies.
Problem
PLAN.mdspecifies a "Statistical significance check" as step 5 of the optimization loop (Architecture → "EVALUATE & COMPARE"), gates Phase 1 "Done" on a ≥10% improvement with no regression, and defines a PR template reporting before/after scores on train, validation, AND holdout.But the pipeline currently declares success on any positive raw delta. In
evolution/skills/evolve_skill.py:GEPA is designed to run on as few as 3 holdout examples (per the plan). At that size a
+0.02average delta is indistinguishable from noise — yet the pipeline treats it as a win, saves "improved" metrics, and surfaces it for a PR. Nothing in the repo distinguishes a real improvement from sampling noise, so the Phase-1 gate ("≥10%, no regression, sensible to a human") can't actually be enforced.Proposal
A pure-local significance check over the paired holdout scores the pipeline already computes:
and gate the success verdict on that instead of
improvement > 0. No new API calls, no new dependencies.