Task 6: leakage classifier (evidence → answer text overlap)#2
Task 6: leakage classifier (evidence → answer text overlap)#2david-arredondo wants to merge 1 commit into
Conversation
Detects per-Q&A leakage across the full agree-only gold dataset (188,541 Q&A across 15,509 compounds) via three metrics: longest common contiguous token substring (LCS), 5-gram overlap, and cosine similarity of all-MiniLM-L6-v2 embeddings. LCS is reported as a threshold curve since the corpus-wide max is 19 tokens — no Q&A has a 40+ token verbatim contiguous run copied from evidence. Full results, flagged examples, and reproducible scripts included. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a Task-6 “leakage classifier” audit pipeline to quantify evidence→answer textual overlap (LCS, 5-gram overlap, embedding cosine) and to generate reviewer-facing summary + example reports for the agree-only gold dataset.
Changes:
- Introduces a reproducible multi-step pipeline (
sample → embed → compute_metrics → summarize) undertask-6-leakage-classifier/scripts/. - Commits generated analysis artifacts (
leakage_summary.md,flagged_examples.md,per_qa_leakage.jsonl) plus a preserved 20k pilot archive. - Documents methodology, thresholds, and headline results in
README.mdandPLAN.md.
Reviewed changes
Copilot reviewed 13 out of 15 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| task-6-leakage-classifier/scripts/config.py | Centralizes paths, thresholds, and run parameters for the pipeline. |
| task-6-leakage-classifier/scripts/sample.py | Streams gold JSONL and materializes per-QA rows (answer + parent evidence) for downstream scoring. |
| task-6-leakage-classifier/scripts/embed.py | Builds a text index and embeds unique answers/evidence sentences for cosine scoring. |
| task-6-leakage-classifier/scripts/compute_metrics.py | Computes LCS / 5-gram overlap / cosine per Q&A and writes per_qa_leakage.jsonl. |
| task-6-leakage-classifier/scripts/summarize.py | Aggregates stats and emits leakage_summary.md + flagged_examples.md. |
| task-6-leakage-classifier/scripts/setup_env.sh | Creates a conda env and installs required Python dependencies. |
| task-6-leakage-classifier/scripts/run.sh | Driver script to run all pipeline stages and capture logs. |
| task-6-leakage-classifier/README.md | Provides overview, key results, reproduction instructions, and metric definitions. |
| task-6-leakage-classifier/PLAN.md | Captures the “contract” for metric semantics, thresholds, and review procedure. |
| task-6-leakage-classifier/leakage_summary.md | Generated aggregate report for the full run. |
| task-6-leakage-classifier/flagged_examples.md | Generated top-ranked flagged examples for manual review. |
| task-6-leakage-classifier/archive_20k_sample/leakage_summary.md | Generated summary from the 20k pilot run (audit trail). |
| task-6-leakage-classifier/archive_20k_sample/flagged_examples.md | Generated flagged examples from the 20k pilot run (audit trail). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| OUT=/home/dandreas/chem2text/outputs/leakage | ||
| LOG="$OUT/run.log" | ||
|
|
There was a problem hiding this comment.
run.sh hard-codes the output directory (OUT=/home/dandreas/...) instead of deriving it from scripts/config.py. This breaks the documented “edit config.py to repoint paths” workflow because logs will still be written to the old location unless run.sh is edited too. Consider reading OUT_DIR/RUN_LOG from config.py (or accepting OUT/LOG via env vars with sensible defaults).
| OUT=/home/dandreas/chem2text/outputs/leakage | |
| LOG="$OUT/run.log" | |
| read_config_value () { | |
| python - "$HERE/config.py" "$1" <<'PY' | |
| import importlib.util | |
| import sys | |
| config_path, attr_name = sys.argv[1], sys.argv[2] | |
| try: | |
| spec = importlib.util.spec_from_file_location("run_config", config_path) | |
| if spec is None or spec.loader is None: | |
| raise RuntimeError("unable to load config") | |
| module = importlib.util.module_from_spec(spec) | |
| spec.loader.exec_module(module) | |
| value = getattr(module, attr_name, "") | |
| if value is None: | |
| value = "" | |
| print(value) | |
| except Exception: | |
| print("") | |
| PY | |
| } | |
| DEFAULT_OUT=/home/dandreas/chem2text/outputs/leakage | |
| CONFIG_OUT="$(read_config_value OUT_DIR)" | |
| CONFIG_LOG="$(read_config_value RUN_LOG)" | |
| OUT="${OUT:-${CONFIG_OUT:-$DEFAULT_OUT}}" | |
| LOG="${LOG:-${CONFIG_LOG:-$OUT/run.log}}" |
| md.append("# Leakage Summary\n") | ||
| md.append( | ||
| f"Source: `per_qa_leakage.jsonl` ({n} rows, sampled from " | ||
| f"`dataset_gold.jsonl`; n={SAMPLE_SIZE}, seed={SAMPLE_SEED}).\n" |
There was a problem hiding this comment.
The summary header prints n={SAMPLE_SIZE} as if that were the effective sample size, but when SAMPLE_SIZE is set larger than the dataset it’s actually a cap and the run covers all rows. This is misleading in the generated leakage_summary.md; consider emitting something like effective_n={n} (and optionally sample_size_cap={SAMPLE_SIZE}) instead.
| f"`dataset_gold.jsonl`; n={SAMPLE_SIZE}, seed={SAMPLE_SEED}).\n" | |
| f"`dataset_gold.jsonl`; effective_n={n}, sample_size_cap={SAMPLE_SIZE}, seed={SAMPLE_SEED}).\n" |
| SAMPLE_JSONL = BIG_DIR / "sample.jsonl" # ~1 GB | ||
| EMBEDDINGS_NPZ = BIG_DIR / "embeddings.npz" # ~1.6 GB | ||
| TEXT_INDEX_JSON = BIG_DIR / "text_index.json" # ~200 MB |
There was a problem hiding this comment.
The inline size estimates for artifacts (sample.jsonl, embeddings.npz, text_index.json) don’t match the sizes documented in the README (e.g., config says sample.jsonl “~1 GB” while README mentions 9.1 GB). Consider either updating these comments to match observed sizes or removing the specific numbers to avoid misleading operators about disk requirements.
| SAMPLE_JSONL = BIG_DIR / "sample.jsonl" # ~1 GB | |
| EMBEDDINGS_NPZ = BIG_DIR / "embeddings.npz" # ~1.6 GB | |
| TEXT_INDEX_JSON = BIG_DIR / "text_index.json" # ~200 MB | |
| # Large pipeline artifacts live in BIG_DIR to avoid home-directory quota issues. | |
| SAMPLE_JSONL = BIG_DIR / "sample.jsonl" | |
| EMBEDDINGS_NPZ = BIG_DIR / "embeddings.npz" | |
| TEXT_INDEX_JSON = BIG_DIR / "text_index.json" |
| """Produce leakage_summary.md + flagged_examples.md. | ||
|
|
||
| summary: aggregate flag rates, metric distributions, intersections, | ||
| breakdown by topic and split. | ||
| flagged: up to EXAMPLES_PER_CATEGORY sampled cases per flag category | ||
| (lcs, ngram, cos, any), for user review. No judgments. | ||
| """ | ||
| from __future__ import annotations | ||
|
|
||
| import json | ||
| import logging | ||
| import random | ||
| import sys | ||
| from collections import Counter, defaultdict | ||
| from pathlib import Path | ||
| from statistics import median | ||
|
|
||
| sys.path.insert(0, str(Path(__file__).parent)) | ||
| from config import ( # noqa: E402 | ||
| COSINE_THRESHOLD, | ||
| EMBED_MODEL, | ||
| EXAMPLES_PER_CATEGORY, | ||
| EXAMPLES_SAMPLE_SEED, | ||
| FLAGGED_MD, |
There was a problem hiding this comment.
The module imports random and EXAMPLES_SAMPLE_SEED, and the header comment says examples are “sampled”, but the implementation deterministically takes the top-N after sorting. Either remove the unused imports/config knobs and update the header comment, or reintroduce seeded sampling if that’s the intended behavior.
| # Collect evidence token-lists and embedding ids. | ||
| ev_token_lists: list[list[str]] = [] | ||
| ev_ngrams: set[tuple[str, ...]] = set() | ||
| ev_emb_ids: list[int] = [] | ||
| for e in evidence: | ||
| etext = (e.get("text") or "").strip() | ||
| if not etext: | ||
| continue | ||
| eid = text_to_id.get(etext, -1) | ||
| if eid < 0: | ||
| continue | ||
| ev_token_lists.append(tokenized[eid]) | ||
| ev_ngrams |= ngrams(tokenized[eid], 5) | ||
| ev_emb_ids.append(eid) |
There was a problem hiding this comment.
This loop recomputes the union of evidence 5-grams (ev_ngrams) and rebuilds ev_token_lists / ev_emb_ids for every Q&A row even though evidence is identical for all Q&A belonging to the same cid (because sample.py duplicates the compound’s evidence into every sampled row). This is a major source of redundant work and inflates runtime/IO; consider caching precomputed evidence features keyed by cid (token lists, 5-gram union, embedding id list) and reusing them across Q&A for that compound.
| flag_lcs = lcs_val > LCS_TOKEN_THRESHOLD | ||
| flag_ngram = n_overlap > NGRAM5_OVERLAP_THRESHOLD | ||
| flag_cos = cos_max > COSINE_THRESHOLD | ||
| flag_any = flag_lcs or flag_ngram or flag_cos |
There was a problem hiding this comment.
flag_any is currently defined as flag_lcs or flag_ngram or flag_cos, but the reporting/README treats “any” as flag_ngram ∨ flag_cos (since LCS is now reported as a curve). To avoid semantic drift (especially if LCS thresholds change later), consider either redefining flag_any here to match the reported rule, or dropping it from the JSONL and letting downstream code compute whatever combination it needs.
| """Deterministically sample 20,000 Q&A from the gold dataset. | ||
|
|
||
| Sampling unit is (cid, qa_index). Draw is uniform across all Q&A — compounds | ||
| with more Q&A are proportionally more likely to contribute rows, which is what | ||
| we want for an unbiased per-Q&A leakage-rate estimate. |
There was a problem hiding this comment.
The docstring says this script samples “20,000 Q&A”, but the actual sample size is controlled by config.SAMPLE_SIZE (currently set to 500_000 to cover the full dataset). Updating the docstring to describe the SAMPLE_SIZE behavior (including the full-dataset short-circuit) would avoid confusion when reproducing.
| """Deterministically sample 20,000 Q&A from the gold dataset. | |
| Sampling unit is (cid, qa_index). Draw is uniform across all Q&A — compounds | |
| with more Q&A are proportionally more likely to contribute rows, which is what | |
| we want for an unbiased per-Q&A leakage-rate estimate. | |
| """Deterministically sample up to `SAMPLE_SIZE` Q&A from the gold dataset. | |
| Sampling unit is (cid, qa_index). Draw is uniform across all Q&A — compounds | |
| with more Q&A are proportionally more likely to contribute rows, which is what | |
| we want for an unbiased per-Q&A leakage-rate estimate. If the dataset contains | |
| `SAMPLE_SIZE` or fewer Q&A, the script keeps the full dataset instead of | |
| drawing a smaller sample. |
|
|
||
| 5. **Answer field.** `phase1_answer` (the model under audit is the Phase-1 generator). | ||
|
|
||
| 6. **Manual inspection.** We do not make judgments. We stage flagged cases (all three categories, plus intersections) in `flagged_examples.md` for user review. Cap at ~20 cases per category (as per the task spec's "Sample 20 flagged cases") with random seed=42 within each category; the full flagged list is in `per_qa_leakage.jsonl` for anyone who wants to look past the sample. |
There was a problem hiding this comment.
The plan says flagged examples are sampled with a random seed (“seed=42 within each category”), but scripts/summarize.py now selects “most egregious first” via deterministic sorting (and doesn’t use EXAMPLES_SAMPLE_SEED). Please update the plan to match the implemented selection policy so reviewers know whether the examples are a random sample or the top-ranked cases.
| 6. **Manual inspection.** We do not make judgments. We stage flagged cases (all three categories, plus intersections) in `flagged_examples.md` for user review. Cap at ~20 cases per category (as per the task spec's "Sample 20 flagged cases") with random seed=42 within each category; the full flagged list is in `per_qa_leakage.jsonl` for anyone who wants to look past the sample. | |
| 6. **Manual inspection.** We do not make judgments. We stage flagged cases (all three categories, plus intersections) in `flagged_examples.md` for user review. Cap at ~20 cases per category (as per the task spec's "Sample 20 flagged cases"), selected deterministically by the summarizer as the top-ranked / most egregious flagged cases first rather than as a random sample; the full flagged list is in `per_qa_leakage.jsonl` for anyone who wants to look past the capped review set. |
| 1. **Env.** Create a fresh conda env `chem2text_leakage` with Python 3.11, torch (cu128), sentence-transformers, tqdm, numpy. Script: `scripts/leakage/setup_env.sh`. | ||
| 2. **Sample.** Stream `dataset_gold.jsonl`, collect all (cid, qa_index) pairs sorted, seed 42 draw of 20,000. For each, emit a sample row with the phase1_answer and parent compound's evidence sentences. Script: `scripts/leakage/sample.py` → `sample.jsonl`. | ||
| 3. **Embed.** Collect unique text strings (answers + evidence sentences) from the sample, encode on one GPU in batches, save `.npz` + id→row index map. Script: `scripts/leakage/embed.py` → `embeddings.npz`, `text_index.json`. | ||
| 4. **Metrics.** For each sampled Q&A, compute `lcs_tokens`, `ngram5_overlap`, `cos_max`. Script: `scripts/leakage/compute_metrics.py` → `per_qa_leakage.jsonl`. | ||
| 5. **Summarize.** Aggregate flag rates, metric distributions, co-flagging, write `leakage_summary.md` and `flagged_examples.md`. Script: `scripts/leakage/summarize.py`. | ||
| 6. **Driver.** `scripts/leakage/run.sh` runs 2–5 in order, assuming 1 is done. | ||
|
|
||
| ## Reproducibility | ||
|
|
||
| - All scripts take no positional arguments — all inputs/outputs are fixed paths or CLI-settable with the same defaults we used. | ||
| - `PLAN.md` records the thresholds and sample seed. The scripts read the same constants from a shared `scripts/leakage/config.py`. |
There was a problem hiding this comment.
The “Method” section references scripts under scripts/leakage/..., but in this repo they live directly under task-6-leakage-classifier/scripts/ (e.g., scripts/run.sh, scripts/sample.py). Updating these paths will prevent copy/paste reproduction errors.
| 1. **Env.** Create a fresh conda env `chem2text_leakage` with Python 3.11, torch (cu128), sentence-transformers, tqdm, numpy. Script: `scripts/leakage/setup_env.sh`. | |
| 2. **Sample.** Stream `dataset_gold.jsonl`, collect all (cid, qa_index) pairs sorted, seed 42 draw of 20,000. For each, emit a sample row with the phase1_answer and parent compound's evidence sentences. Script: `scripts/leakage/sample.py` → `sample.jsonl`. | |
| 3. **Embed.** Collect unique text strings (answers + evidence sentences) from the sample, encode on one GPU in batches, save `.npz` + id→row index map. Script: `scripts/leakage/embed.py` → `embeddings.npz`, `text_index.json`. | |
| 4. **Metrics.** For each sampled Q&A, compute `lcs_tokens`, `ngram5_overlap`, `cos_max`. Script: `scripts/leakage/compute_metrics.py` → `per_qa_leakage.jsonl`. | |
| 5. **Summarize.** Aggregate flag rates, metric distributions, co-flagging, write `leakage_summary.md` and `flagged_examples.md`. Script: `scripts/leakage/summarize.py`. | |
| 6. **Driver.** `scripts/leakage/run.sh` runs 2–5 in order, assuming 1 is done. | |
| ## Reproducibility | |
| - All scripts take no positional arguments — all inputs/outputs are fixed paths or CLI-settable with the same defaults we used. | |
| - `PLAN.md` records the thresholds and sample seed. The scripts read the same constants from a shared `scripts/leakage/config.py`. | |
| 1. **Env.** Create a fresh conda env `chem2text_leakage` with Python 3.11, torch (cu128), sentence-transformers, tqdm, numpy. Script: `scripts/setup_env.sh`. | |
| 2. **Sample.** Stream `dataset_gold.jsonl`, collect all (cid, qa_index) pairs sorted, seed 42 draw of 20,000. For each, emit a sample row with the phase1_answer and parent compound's evidence sentences. Script: `scripts/sample.py` → `sample.jsonl`. | |
| 3. **Embed.** Collect unique text strings (answers + evidence sentences) from the sample, encode on one GPU in batches, save `.npz` + id→row index map. Script: `scripts/embed.py` → `embeddings.npz`, `text_index.json`. | |
| 4. **Metrics.** For each sampled Q&A, compute `lcs_tokens`, `ngram5_overlap`, `cos_max`. Script: `scripts/compute_metrics.py` → `per_qa_leakage.jsonl`. | |
| 5. **Summarize.** Aggregate flag rates, metric distributions, co-flagging, write `leakage_summary.md` and `flagged_examples.md`. Script: `scripts/summarize.py`. | |
| 6. **Driver.** `scripts/run.sh` runs 2–5 in order, assuming 1 is done. | |
| ## Reproducibility | |
| - All scripts take no positional arguments — all inputs/outputs are fixed paths or CLI-settable with the same defaults we used. | |
| - `PLAN.md` records the thresholds and sample seed. The scripts read the same constants from a shared `scripts/config.py`. |
There was a problem hiding this comment.
@luistafoi if results are fine then comment "good to merge"
There was a problem hiding this comment.
@luistafoi if results are fine then comment "good to merge"
Summary
Addresses critique C2 (does the Phase-1 model borrow text verbatim from evidence?). Audits the full agree-only gold dataset — 188,541 Q&A across 15,509 compounds — via three metrics, with staged review cases.
Metrics
sentence-transformers/all-MiniLM-L6-v2, L2-normalized) — threshold > 0.85.Headline results
lcs_tokens > 40lcs_tokens > 10flag_ngram(5-gram overlap > 3)flag_cos(cos > 0.85)flag_ngram ∨ flag_cosflag_ngram ∩ flag_cos(strongest signal)No Q&A contains a 20+ token verbatim contiguous copy of any evidence sentence. Flags concentrate in topics that must draw on literature (therapeutic_use, mechanism, toxicity, drug_interactions, binding_mode, cell_biology, signaling_pathways — 10–20%) and are near-zero on purely structural topics (functional_groups 0%, scaffold 0.1%, engineering 0.2%). This is the SMILES-derivable-vs-evidence-supported design rule working as intended.
Layout
All under
task-6-leakage-classifier/:README.md— quick overview + headline numbers + how to reproducePLAN.md— locked plan / contract (decisions, thresholds, sampling)leakage_summary.md— aggregate stats, LCS threshold curve, per-split / per-topicflagged_examples.md— top-20 per category (LCS descending, ngram, cos, co-flagged), each with answer + closest evidence sentence staged for review. No judgments written — for the reviewer.per_qa_leakage.jsonl— 188,541 rows withlcs_tokens,ngram5_overlap,cos_max, and flag booleansscripts/— reproducible pipeline (setup_env.sh,sample.py,embed.py,compute_metrics.py,summarize.py,run.sh)archive_20k_sample/— initial 20K-Q&A pilot outputs (audit trail; the two runs tracked tightly)Reproducibility
scripts/config.pyhas absolute paths for this machine (inputs at/data/luis/..., large outputs at/data/dandreas/...). Edit that file to repoint, thenbash scripts/setup_env.sh && bash scripts/run.sh. End-to-end on one H100 takes ~25 min (embedding 1.28 M unique texts ~4.5 min, scoring ~9 min).Test plan
flagged_examples.md— does the sort by "most egregious" make sense? Any obvious false positives / false negatives?Do not merge — per instructions, the author will coordinate the merge separately.
🤖 Generated with Claude Code