Skip to content

Evolution declares any positive holdout delta a "win" — no statistical significance check #135

Description

@MaxFreedomPollard

Problem

PLAN.md specifies a "Statistical significance check" as step 5 of the optimization loop (Architecture → "EVALUATE & COMPARE"), gates Phase 1 "Done" on a ≥10% improvement with no regression, and defines a PR template reporting before/after scores on train, validation, AND holdout.

But the pipeline currently declares success on any positive raw delta. In evolution/skills/evolve_skill.py:

avg_baseline = sum(baseline_scores) / max(1, len(baseline_scores))
avg_evolved  = sum(evolved_scores) / max(1, len(evolved_scores))
improvement  = avg_evolved - avg_baseline
...
if improvement > 0:
    console.print("✓ Evolution improved skill ...")

GEPA is designed to run on as few as 3 holdout examples (per the plan). At that size a +0.02 average delta is indistinguishable from noise — yet the pipeline treats it as a win, saves "improved" metrics, and surfaces it for a PR. Nothing in the repo distinguishes a real improvement from sampling noise, so the Phase-1 gate ("≥10%, no regression, sensible to a human") can't actually be enforced.

Proposal

A pure-local significance check over the paired holdout scores the pipeline already computes:

  • an exact paired randomization test for the p-value (exact for the small holdout sets that are the norm),
  • a bootstrap confidence interval + effect size for the magnitude,
  • an accept/reject decision combining significance (α=0.05) with the plan's ≥10% effect gate,

and gate the success verdict on that instead of improvement > 0. No new API calls, no new dependencies.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions