Conversation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kushalviit
reviewed
Apr 25, 2026
There was a problem hiding this comment.
@luistafoi if results are fine then comment "good to merge"
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+723
to
+725
| " {k:v for k,v in sorted(slices[\"topic_raw\"].items(),\n", | ||
| " key=lambda kv:(kv[1][1]/(kv[1][0]+kv[1][1]) if sum(kv[1]) else 0))[:15]\n", | ||
| " if sum(v)>=500}, sort=\"rate\")" |
| 11 train 1,2-Dichloroethane | ||
| ``` | ||
|
|
||
| **Analysis:** CIDs are non-contiguous (4, 6, 11 — no 5, 7, 8, 9, 10), confirming PubChem CIDs are preserved as-is rather than renumbered. All three are in the `train` split; the file is **not** sorted or grouped by split (cell 5 confirms all three splits are interleaved throughout). |
| "from pathlib import Path\n", | ||
| "\n", | ||
| "DATASET_PATH = Path(\"dataset_final.jsonl\")\n", | ||
| "DATASET_PATH.exists(), DATASET_PATH.stat().st_size / 1e6" |
| "# Buckets: structural / functional / engineering / other.\n", | ||
| "# `engineering` is normally folded into `functional` by topic_bucket.py;\n", | ||
| "# we split it out here so the design-leverage QAs can be inspected separately.\n", | ||
| "from collections import Counter\n", |
Comment on lines
+269
to
+301
| "# Per-topic disagree rate: disagree_count / (agree + disagree)\n", | ||
| "# Buckets: structural / functional / engineering / other.\n", | ||
| "from collections import Counter\n", | ||
| "from topic_bucket import bucket_topic\n", | ||
| "\n", | ||
| "ENGINEERING_KEYS = {\"engineering\", \"design_levers\", \"design\"}\n", | ||
| "ENGINEERING_SUBS = (\"engineer\", \"design\")\n", | ||
| "\n", | ||
| "def bucket4(topic):\n", | ||
| " t = (topic or \"\").strip().lower().replace(\"-\", \"_\")\n", | ||
| " if t in ENGINEERING_KEYS or any(s in t for s in ENGINEERING_SUBS):\n", | ||
| " return \"engineering\"\n", | ||
| " return bucket_topic(topic)\n", | ||
| "\n", | ||
| "BUCKETS = (\"structural\", \"functional\", \"engineering\", \"other\")\n", | ||
| "counts = {b: Counter() for b in BUCKETS}\n", | ||
| "\n", | ||
| "for rec in iter_records():\n", | ||
| " for qa in rec.get(\"qa_pairs\", []):\n", | ||
| " counts[bucket4(qa.get(\"topic\"))][qa.get(\"verdict\")] += 1\n", | ||
| "\n", | ||
| "print(f\"{'bucket':<12} {'agree':>8} {'disagree':>9} {'a+d':>8} {'disagree_rate':>14}\")\n", | ||
| "for b in BUCKETS:\n", | ||
| " a = counts[b].get(\"agree\", 0)\n", | ||
| " d = counts[b].get(\"disagree\", 0)\n", | ||
| " denom = a + d\n", | ||
| " rate = d / denom if denom else float(\"nan\")\n", | ||
| " print(f\"{b:<12} {a:>8} {d:>9} {denom:>8} {rate:>14.4f}\")\n", | ||
| "\n", | ||
| "print(\"\\nfull verdict distribution per bucket:\")\n", | ||
| "for b in BUCKETS:\n", | ||
| " print(f\" {b}: {dict(counts[b])}\")" | ||
| ] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Task 13 — Disagreement deep-dive + Phase 1 vs Phase 2 head-to-head
Goal. Characterize the 21,164 verdict == "disagree" Q&A: what kinds of disagreements are they, and which model (Phase 1 = Gemini 3 Flash or Phase 2 = Kimi K2.5) is more often the one getting it right?
Stage 1 — Structural/statistical analysis:
• More of intuitive discovery on which bucket there is more or less disagreement.
• Per-topic disagree rate: disagree_count / (agree + disagree) for each topic bucket (structural / functional / engineering / other). Expect functional > structural.
• Compound hotspots: which CIDs have >50% disagree rate? Correlate with MW, heavy-atom count, evidence volume, split.
• Length analysis: is there a correlation between answer-length asymmetry (P1 short vs P2 long, or vice versa) and disagreement?