Evolution declares any positive holdout delta a "win" — no statistical significance check

### Problem

`PLAN.md` specifies a **"Statistical significance check"** as step 5 of the optimization loop (Architecture → "EVALUATE & COMPARE"), gates Phase 1 "Done" on a **≥10% improvement with no regression**, and defines a PR template reporting **before/after scores on train, validation, AND holdout**.

But the pipeline currently declares success on any positive raw delta. In `evolution/skills/evolve_skill.py`:

```python
avg_baseline = sum(baseline_scores) / max(1, len(baseline_scores))
avg_evolved  = sum(evolved_scores) / max(1, len(evolved_scores))
improvement  = avg_evolved - avg_baseline
...
if improvement > 0:
    console.print("✓ Evolution improved skill ...")
```

GEPA is designed to run on **as few as 3 holdout examples** (per the plan). At that size a `+0.02` average delta is indistinguishable from noise — yet the pipeline treats it as a win, saves "improved" metrics, and surfaces it for a PR. Nothing in the repo distinguishes a real improvement from sampling noise, so the Phase-1 gate ("≥10%, no regression, sensible to a human") can't actually be enforced.

### Proposal

A pure-local significance check over the **paired** holdout scores the pipeline already computes:

- an **exact paired randomization test** for the p-value (exact for the small holdout sets that are the norm),
- a bootstrap **confidence interval** + **effect size** for the magnitude,
- an **accept/reject** decision combining significance (α=0.05) with the plan's ≥10% effect gate,

and gate the success verdict on that instead of `improvement > 0`. No new API calls, no new dependencies.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evolution declares any positive holdout delta a "win" — no statistical significance check #135

Problem

Proposal

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Evolution declares any positive holdout delta a "win" — no statistical significance check #135

Description

Problem

Proposal

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions