Skip to content

Task 2 - Judge the Judge#3

Open
sposhiy33 wants to merge 2 commits into
mainfrom
shrey-task-2
Open

Task 2 - Judge the Judge#3
sposhiy33 wants to merge 2 commits into
mainfrom
shrey-task-2

Conversation

@sposhiy33

Copy link
Copy Markdown
Collaborator

Mainly used Claude Code for this.

  1. Sampled 1000 from the dataset instead of 500. Note that the "unsure" category does not very many samples (31), so the subsample includes all of them
  2. Three models: gemini 2.5 pro, Claude sonnet 4.6, GPT 5.0
  3. Claude Sonnet 4.6 refused to answer 31 of the questions despite many re-tries. Probably flagged them as unsafe questions. correlations were run on the subset of questions that were answered. Note that this 31 is NOT the same as the 31 samples that were determined as "unsure" by Gemma
  4. following output directions from the document, extra generation and open router scripts are not included. I can add them if necessary. Analysis code is not included, just the main results.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the Task 2 “Judge the Judge” deliverables to the repo, summarizing how three additional LLM judges (Sonnet, GPT-5, Gemini) agree/disagree with Gemma on a 1000-sample subset and reporting inter-judge agreement statistics.

Changes:

  • Added per-judge verdict distribution counts for Gemma/Sonnet/GPT-5/Gemini.
  • Added pairwise Cohen’s κ matrix and overall Krippendorff’s α.
  • Added a markdown report with headline disagreement rates, topic breakdowns, and qualitative examples.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 4 comments.

File Description
outputs/task_2_judgethejudge/per_judge_distribution.json Stores verdict distribution counts per judge, including error/missing counts.
outputs/task_2_judgethejudge/agreement_matrix.json Stores pairwise Cohen’s κ (with n) and Krippendorff’s α for the panel.
outputs/task_2_judgethejudge/disagreement_analysis.md Human-readable summary of key metrics and sampled disagreement cases.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

| judge | verdict | reasoning |
|---|---|---|
| gemma | disagree | The answers disagree on the number of rotatable bonds, with Answer 1 stating there are 8 and Answer 2 stating there are 6. |
| sonnet | — | |

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example uses an em dash / empty cells for Sonnet’s missing verdict. To keep the qualitative examples consistent with the quantitative summaries (which use the "error_or_missing" category), please represent missing verdicts explicitly (e.g., "error_or_missing") and briefly note why it’s missing (refusal, parse error, etc.). This also matters if these examples are intended to reflect the same "majority of 3 new judges" criterion used in the headline stats.

Suggested change
| sonnet | | |
| sonnet | error_or_missing | No usable Sonnet verdict is present in this example excerpt; the specific cause of the missing verdict is not specified here. |

Copilot uses AI. Check for mistakes.
Comment on lines +73 to +78
**Phase-2 answer:** The compound, a simple thiol containing both hydroxyl and sulfhydryl groups, serves as a model thiol for studying cysteine proteinase active-site chemistry. Its reactivity profile with these enzymes provides insight into catalytic-site solvation properties, with the ratio of its reaction kinetics relative to other thiol probes varying markedly across different cysteine proteinases such as papaya peptidase A, actinidin, ficin, and papain. This variation reflects differences in the electrostatic e

| judge | verdict | reasoning |
|---|---|---|
| gemma | unclear | Answer 1 claims the compound acts as a stimulator/activator, whereas Answer 2 describes it as a model thiol used to study reaction kinetics and active-site chemistry without explicitly stating whether it activates or inhibits the enzymes. |
| sonnet | agree | Both answers agree that the compound interacts with cysteine proteinases through their active-site sulfhydryl (cysteine) groups. Answer 1 focuses on its role as an activator/stimulator for enzymes requiring reduced sulfhydryl groups, while Answer 2 provides more detail about its use as a model thiol probe to study active-site chemistry. Both are consistent in describing the compound's interaction |

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several Phase-2 answers appear to be truncated mid-word (e.g., ending with "electrostatic e") without an ellipsis or pointer to the full text. Please either include the complete answers or mark truncation explicitly (and ideally link/reference where the full text can be found, such as the corresponding row in samples.jsonl) so the qualitative examples are interpretable.

Suggested change
**Phase-2 answer:** The compound, a simple thiol containing both hydroxyl and sulfhydryl groups, serves as a model thiol for studying cysteine proteinase active-site chemistry. Its reactivity profile with these enzymes provides insight into catalytic-site solvation properties, with the ratio of its reaction kinetics relative to other thiol probes varying markedly across different cysteine proteinases such as papaya peptidase A, actinidin, ficin, and papain. This variation reflects differences in the electrostatic e
| judge | verdict | reasoning |
|---|---|---|
| gemma | unclear | Answer 1 claims the compound acts as a stimulator/activator, whereas Answer 2 describes it as a model thiol used to study reaction kinetics and active-site chemistry without explicitly stating whether it activates or inhibits the enzymes. |
| sonnet | agree | Both answers agree that the compound interacts with cysteine proteinases through their active-site sulfhydryl (cysteine) groups. Answer 1 focuses on its role as an activator/stimulator for enzymes requiring reduced sulfhydryl groups, while Answer 2 provides more detail about its use as a model thiol probe to study active-site chemistry. Both are consistent in describing the compound's interaction |
**Phase-2 answer:** The compound, a simple thiol containing both hydroxyl and sulfhydryl groups, serves as a model thiol for studying cysteine proteinase active-site chemistry. Its reactivity profile with these enzymes provides insight into catalytic-site solvation properties, with the ratio of its reaction kinetics relative to other thiol probes varying markedly across different cysteine proteinases such as papaya peptidase A, actinidin, ficin, and papain. This variation reflects differences in the electrostatic environment **[truncated in this report; see the corresponding row in `samples.jsonl` for the full text]**.
| judge | verdict | reasoning |
|---|---|---|
| gemma | unclear | Answer 1 claims the compound acts as a stimulator/activator, whereas Answer 2 describes it as a model thiol used to study reaction kinetics and active-site chemistry without explicitly stating whether it activates or inhibits the enzymes. |
| sonnet | agree | Both answers agree that the compound interacts with cysteine proteinases through their active-site sulfhydryl (cysteine) groups. Answer 1 focuses on its role as an activator/stimulator for enzymes requiring reduced sulfhydryl groups, while Answer 2 provides more detail about its use as a model thiol probe to study active-site chemistry. Both are consistent in describing the compound's interaction **[reasoning truncated in this report; see the corresponding row in `samples.jsonl` for the full text]**. |

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sposhiy33 add the code

Comment on lines +5 to +7
- Evaluable rows (all 4 judges voted + Gemma): **1000 / 1000**
- Rows where **majority of the 3 new judges disagrees with Gemma**: **47** (4.7 %)
- Verdict: **WITHIN** the 15 % threshold. **>15 % → gold subset noisier than claimed.**

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The headline metrics are internally inconsistent with the rest of the report: it states all 4 judges voted for 1000/1000 rows, but later the Sonnet distribution shows 23 err/missing and the kappa tables use n=977 for any pair involving Sonnet. Please reconcile this by either (a) restricting the headline/majority-disagrees stats to the fully-answered subset (likely 977) or (b) explicitly defining how missing Sonnet votes are handled and adjusting the wording/denominators accordingly ("majority of the 3 new judges" is not true if Sonnet is missing).

Copilot uses AI. Check for mistakes.
Comment on lines +9 to +14
"sonnet": {
"agree": 499,
"disagree": 466,
"unclear": 12,
"error_or_missing": 23,
"total": 1000

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description mentions Sonnet refused to answer 31 questions, but this output reports only 23 as "error_or_missing". Please update either the PR description or the outputs to reflect the same refusal/missing count so readers don’t misinterpret the evaluable subset size.

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sposhiy33 add the code

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luistafoi if results are fine then comment "good to merge"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luistafoi if results are fine then comment "good to merge"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luistafoi if results are fine then comment "good to merge"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luistafoi if results are fine then comment "good to merge"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luistafoi if results are fine then comment "good to merge"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luistafoi if results are fine then comment "good to merge"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants