Skip to content

Add Task13 disagreement findings and dataset exploration#7

Open
DKundnani wants to merge 1 commit into
mainfrom
task13
Open

Add Task13 disagreement findings and dataset exploration#7
DKundnani wants to merge 1 commit into
mainfrom
task13

Conversation

@DKundnani

Copy link
Copy Markdown
Collaborator

Task 13 — Disagreement deep-dive + Phase 1 vs Phase 2 head-to-head
Goal. Characterize the 21,164 verdict == "disagree" Q&A: what kinds of disagreements are they, and which model (Phase 1 = Gemini 3 Flash or Phase 2 = Kimi K2.5) is more often the one getting it right?

Stage 1 — Structural/statistical analysis:
• More of intuitive discovery on which bucket there is more or less disagreement.
• Per-topic disagree rate: disagree_count / (agree + disagree) for each topic bucket (structural / functional / engineering / other). Expect functional > structural.
• Compound hotspots: which CIDs have >50% disagree rate? Correlate with MW, heavy-atom count, evidence volume, split.
• Length analysis: is there a correlation between answer-length asymmetry (P1 short vs P2 long, or vice versa) and disagreement?

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@DKundnani DKundnani requested review from avi-lab and luistafoi April 24, 2026 23:49

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luistafoi if results are fine then comment "good to merge"

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +723 to +725
" {k:v for k,v in sorted(slices[\"topic_raw\"].items(),\n",
" key=lambda kv:(kv[1][1]/(kv[1][0]+kv[1][1]) if sum(kv[1]) else 0))[:15]\n",
" if sum(v)>=500}, sort=\"rate\")"
11 train 1,2-Dichloroethane
```

**Analysis:** CIDs are non-contiguous (4, 6, 11 — no 5, 7, 8, 9, 10), confirming PubChem CIDs are preserved as-is rather than renumbered. All three are in the `train` split; the file is **not** sorted or grouped by split (cell 5 confirms all three splits are interleaved throughout).
"from pathlib import Path\n",
"\n",
"DATASET_PATH = Path(\"dataset_final.jsonl\")\n",
"DATASET_PATH.exists(), DATASET_PATH.stat().st_size / 1e6"
"# Buckets: structural / functional / engineering / other.\n",
"# `engineering` is normally folded into `functional` by topic_bucket.py;\n",
"# we split it out here so the design-leverage QAs can be inspected separately.\n",
"from collections import Counter\n",
Comment on lines +269 to +301
"# Per-topic disagree rate: disagree_count / (agree + disagree)\n",
"# Buckets: structural / functional / engineering / other.\n",
"from collections import Counter\n",
"from topic_bucket import bucket_topic\n",
"\n",
"ENGINEERING_KEYS = {\"engineering\", \"design_levers\", \"design\"}\n",
"ENGINEERING_SUBS = (\"engineer\", \"design\")\n",
"\n",
"def bucket4(topic):\n",
" t = (topic or \"\").strip().lower().replace(\"-\", \"_\")\n",
" if t in ENGINEERING_KEYS or any(s in t for s in ENGINEERING_SUBS):\n",
" return \"engineering\"\n",
" return bucket_topic(topic)\n",
"\n",
"BUCKETS = (\"structural\", \"functional\", \"engineering\", \"other\")\n",
"counts = {b: Counter() for b in BUCKETS}\n",
"\n",
"for rec in iter_records():\n",
" for qa in rec.get(\"qa_pairs\", []):\n",
" counts[bucket4(qa.get(\"topic\"))][qa.get(\"verdict\")] += 1\n",
"\n",
"print(f\"{'bucket':<12} {'agree':>8} {'disagree':>9} {'a+d':>8} {'disagree_rate':>14}\")\n",
"for b in BUCKETS:\n",
" a = counts[b].get(\"agree\", 0)\n",
" d = counts[b].get(\"disagree\", 0)\n",
" denom = a + d\n",
" rate = d / denom if denom else float(\"nan\")\n",
" print(f\"{b:<12} {a:>8} {d:>9} {denom:>8} {rate:>14.4f}\")\n",
"\n",
"print(\"\\nfull verdict distribution per bucket:\")\n",
"for b in BUCKETS:\n",
" print(f\" {b}: {dict(counts[b])}\")"
]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants