Skip to content

Phase 4 verifier/fitness harness: sealed hidden evals, rollback traces, and adaptive-vs-frozen gates #118

Description

@sunghunkwag

Related to #116, but this issue focuses on a different layer of Phase 4.

#116 is mainly about the code-evolution loop itself: select a target function, mutate it, run tests, and open a branch/PR. That loop is necessary, but I think Phase 4 also needs an explicit verifier/fitness harness around it. Without that layer, a mutate-and-pytest loop can easily produce patches that pass visible checks without proving genuine improvement.

The specific missing layer is:

  • sealed hidden evaluations that candidate code cannot inspect
  • adaptive-vs-frozen baseline comparison
  • patch/revert traces for repo repair attempts
  • residue-driven curriculum from failed attempts
  • rejection caching so failed fixes are not repeatedly rediscovered
  • deterministic lineage/archive for audit and PR review
  • full-suite gating before any improvement claim is accepted

I have a bounded reference implementation of this pattern here:

https://github.com/sunghunkwag/rsi-metaforge-core

Main artifact:

https://github.com/sunghunkwag/rsi-metaforge-core/blob/main/rsi_levels_metaforge_unified.py

Important boundary: this is not a drop-in Hermes Phase 4 implementation. It is a reference implementation for the verifier/fitness harness that Phase 4 code evolution will likely need.

Why this matters for Phase 4

A simple code-evolution loop can answer:

Did the candidate still pass pytest?

But for tool implementation code, Phase 4 also needs to answer:

Did the candidate actually improve behavior on unseen cases?
Did it overfit visible reproduction tests?
Can the patch/revert process be audited?
Can failed attempts become useful curriculum without weakening gates?
Is the adaptive loop better than a frozen baseline?

Those are verifier questions, not mutation questions.

Evidence from the reference artifact

Current validation evidence from the artifact:

  • full test suite: 99 passed, 0 failed, 0 skipped
  • file/repo repair hidden evaluation:
    • adaptive mean: 1.000
    • frozen mean: 0.2037
    • repo repair: adaptive 1.000, frozen 0.7037
  • repair trace includes:
    • read_file -> run_visible_check -> apply_patch -> run_visible_check -> revert_file -> apply_patch -> ...
  • self-forge battery:
    • forged walls: 4/4
    • sealed-gate adopted: 2/4
    • downstream composed task solved with forged operations
  • out-of-closure probe is explicitly reported as not verified, rather than hidden or counted as success

The point is not that the monolithic file should be imported into Hermes. The point is that the verifier pattern is concrete and executable.

Proposed Phase 4 harness shape

A Hermes Phase 4 verifier/fitness harness could be structured as:

  1. select a target tool/file bug or improvement target
  2. create visible reproduction checks for the candidate loop
  3. keep hidden expectations outside the candidate workspace
  4. let the candidate perform patch/search/revert attempts
  5. evaluate adaptive repair against a frozen baseline on unseen eval seeds
  6. accept only if hidden performance improves and the full suite still passes
  7. preserve lineage, rejected candidates, and patch traces for audit/PR review

This makes Phase 4 safer than mutate-and-pytest alone. Pytest is a regression floor; it is not by itself evidence that a candidate learned a general repair strategy or improved tool behavior under unseen conditions.

Suggested acceptance criteria

A minimal Phase 4 verifier MVP could require:

  • one real tool/file target with visible reproduction checks
  • one hidden eval set inaccessible to the candidate
  • one frozen baseline arm and one adaptive arm
  • patch/revert trace logging
  • deterministic replay of accepted candidates
  • full test-suite pass after accepted patch
  • issue/PR summary containing:
    • hidden score delta
    • frozen vs adaptive comparison
    • patch trace digest
    • rejected candidate count
    • final diff summary

That would give Phase 4 a measurable safety and fitness layer before scaling up the mutation engine.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions