Skip to content

feat(rewrite): formal judge rubric selection and score distribution analysis #106

@lipikaramaswamy

Description

@lipikaramaswamy

Background

The rewrite pipeline's final judge currently uses Option A: privacy, quality, naturalness on a 1–10 scale, adapted from an earlier research repo. PR #86 improved the judge prompt wording and iterated on scoring behavior as an interim fix, but the formal rubric decision was never made. This issue closes that out.

Closes #37.

Options under consideration

Option Dimensions Scale Notes
A (current) privacy, quality, naturalness 1–10 Baseline; quality and naturalness overlap in practice
B privacy, utility, naturalness 1–10 with improved anchors Renames quality → utility for clarity; adds explicit anchor descriptions per score band
C privacy, utility, faithfulness, fluency 1–5 More granular separation of reservation (faithfulness) vs. readability (fluency); narrower scale may reduce score clustering

Scope of work

  • Run score distribution analysis on TAB and NVIDIA Synthetic Biographies using current rubric (Option A) — check for score clustering, ceiling effects, and correlation between dimensions
  • Evaluate Options B and C against the same datasets and compare differentiation
  • Document chosen rubric with justification and sample score distributions
  • Update anchor point descriptions for chosen option
  • Update FinalJudgeWorkflow implementation
  • Update test coverage

Related

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions