feat(rewrite): formal judge rubric selection and score distribution analysis

## Background

The rewrite pipeline's final judge currently uses Option A: **privacy, quality, naturalness** on a 1–10 scale, adapted from an earlier research repo. PR #86 improved the judge prompt wording and iterated on scoring behavior as an interim fix, but the formal rubric decision was never made. This issue closes that out.

Closes #37.

## Options under consideration

| Option | Dimensions | Scale | Notes |
|--------|-----------|-------|-------|
| **A (current)** | privacy, quality, naturalness | 1–10 | Baseline; `quality` and `naturalness` overlap in practice |
| **B** | privacy, utility, naturalness | 1–10 with improved anchors | Renames quality → utility for clarity; adds explicit anchor descriptions per score band |
| **C** | privacy, utility, faithfulness, fluency | 1–5 | More granular separation of reservation (faithfulness) vs. readability (fluency); narrower scale may reduce score clustering |

## Scope of work

- [ ] Run score distribution analysis on TAB and NVIDIA Synthetic Biographies using current rubric (Option A) — check for score clustering, ceiling effects, and correlation between dimensions
- [ ] Evaluate Options B and C against the same datasets and compare differentiation
- [ ] Document chosen rubric with justification and sample score distributions
- [ ] Update anchor point descriptions for chosen option
- [ ] Update `FinalJudgeWorkflow` implementation
- [ ] Update test coverage

## Related
- Supersedes #37
- See #98 for LLM-as-a-judge coverage for REPLACE mode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(rewrite): formal judge rubric selection and score distribution analysis #106

Background

Options under consideration

Scope of work

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Option	Dimensions	Scale	Notes
A (current)	privacy, quality, naturalness	1–10	Baseline; `quality` and `naturalness` overlap in practice
B	privacy, utility, naturalness	1–10 with improved anchors	Renames quality → utility for clarity; adds explicit anchor descriptions per score band
C	privacy, utility, faithfulness, fluency	1–5	More granular separation of reservation (faithfulness) vs. readability (fluency); narrower scale may reduce score clustering

feat(rewrite): formal judge rubric selection and score distribution analysis #106

Description

Background

Options under consideration

Scope of work

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions