Add Task13 disagreement findings and dataset exploration by DKundnani · Pull Request #7 · vinash85/Chem2TextQA

DKundnani · 2026-04-24T23:49:00Z

Task 13 — Disagreement deep-dive + Phase 1 vs Phase 2 head-to-head
Goal. Characterize the 21,164 verdict == "disagree" Q&A: what kinds of disagreements are they, and which model (Phase 1 = Gemini 3 Flash or Phase 2 = Kimi K2.5) is more often the one getting it right?

Stage 1 — Structural/statistical analysis:
• More of intuitive discovery on which bucket there is more or less disagreement.
• Per-topic disagree rate: disagree_count / (agree + disagree) for each topic bucket (structural / functional / engineering / other). Expect functional > structural.
• Compound hotspots: which CIDs have >50% disagree rate? Correlate with MW, heavy-atom count, evidence volume, split.
• Length analysis: is there a correlation between answer-length asymmetry (P1 short vs P2 long, or vice versa) and disagreement?

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

kushalviit · 2026-04-25T00:40:18Z

@luistafoi if results are fine then comment "good to merge"

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    "            {k:v for k,v in sorted(slices[\"topic_raw\"].items(),\n",
+    "                                   key=lambda kv:(kv[1][1]/(kv[1][0]+kv[1][1]) if sum(kv[1]) else 0))[:15]\n",
+    "             if sum(v)>=500}, sort=\"rate\")"


+11 train 1,2-Dichloroethane
+```
+
+**Analysis:** CIDs are non-contiguous (4, 6, 11 — no 5, 7, 8, 9, 10), confirming PubChem CIDs are preserved as-is rather than renumbered. All three are in the `train` split; the file is **not** sorted or grouped by split (cell 5 confirms all three splits are interleaved throughout).


+    "from pathlib import Path\n",
+    "\n",
+    "DATASET_PATH = Path(\"dataset_final.jsonl\")\n",
+    "DATASET_PATH.exists(), DATASET_PATH.stat().st_size / 1e6"


+    "# Buckets: structural / functional / engineering / other.\n",
+    "# `engineering` is normally folded into `functional` by topic_bucket.py;\n",
+    "# we split it out here so the design-leverage QAs can be inspected separately.\n",
+    "from collections import Counter\n",


+    "# Per-topic disagree rate: disagree_count / (agree + disagree)\n",
+    "# Buckets: structural / functional / engineering / other.\n",
+    "from collections import Counter\n",
+    "from topic_bucket import bucket_topic\n",
+    "\n",
+    "ENGINEERING_KEYS = {\"engineering\", \"design_levers\", \"design\"}\n",
+    "ENGINEERING_SUBS = (\"engineer\", \"design\")\n",
+    "\n",
+    "def bucket4(topic):\n",
+    "    t = (topic or \"\").strip().lower().replace(\"-\", \"_\")\n",
+    "    if t in ENGINEERING_KEYS or any(s in t for s in ENGINEERING_SUBS):\n",
+    "        return \"engineering\"\n",
+    "    return bucket_topic(topic)\n",
+    "\n",
+    "BUCKETS = (\"structural\", \"functional\", \"engineering\", \"other\")\n",
+    "counts = {b: Counter() for b in BUCKETS}\n",
+    "\n",
+    "for rec in iter_records():\n",
+    "    for qa in rec.get(\"qa_pairs\", []):\n",
+    "        counts[bucket4(qa.get(\"topic\"))][qa.get(\"verdict\")] += 1\n",
+    "\n",
+    "print(f\"{'bucket':<12} {'agree':>8} {'disagree':>9} {'a+d':>8} {'disagree_rate':>14}\")\n",
+    "for b in BUCKETS:\n",
+    "    a = counts[b].get(\"agree\", 0)\n",
+    "    d = counts[b].get(\"disagree\", 0)\n",
+    "    denom = a + d\n",
+    "    rate = d / denom if denom else float(\"nan\")\n",
+    "    print(f\"{b:<12} {a:>8} {d:>9} {denom:>8} {rate:>14.4f}\")\n",
+    "\n",
+    "print(\"\\nfull verdict distribution per bucket:\")\n",
+    "for b in BUCKETS:\n",
+    "    print(f\"  {b}: {dict(counts[b])}\")"
+   ]


Add Task13 disagreement findings and dataset exploration

076b905

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DKundnani requested review from avi-lab and luistafoi April 24, 2026 23:49

kushalviit reviewed Apr 25, 2026

View reviewed changes

Comment thread Task13_disagreement/explore_dataset.ipynb

kushalviit Apr 25, 2026

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luistafoi if results are fine then comment "good to merge"

Macaulay001 requested a review from Copilot April 25, 2026 02:20

Copilot started reviewing on behalf of Macaulay001 April 25, 2026 02:21 View session

Copilot AI reviewed Apr 25, 2026

Macaulay001 requested a review from Copilot April 30, 2026 15:42

Copilot started reviewing on behalf of Macaulay001 April 30, 2026 15:42 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Task13 disagreement findings and dataset exploration#7

Add Task13 disagreement findings and dataset exploration#7
DKundnani wants to merge 1 commit into
mainfrom
task13

DKundnani commented Apr 24, 2026

Uh oh!

kushalviit Apr 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

DKundnani commented Apr 24, 2026

Uh oh!

kushalviit Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants