Skip to content

Task 6: leakage classifier (evidence → answer text overlap)#2

Open
david-arredondo wants to merge 1 commit into
mainfrom
task-6-leakage-classifier
Open

Task 6: leakage classifier (evidence → answer text overlap)#2
david-arredondo wants to merge 1 commit into
mainfrom
task-6-leakage-classifier

Conversation

@david-arredondo

Copy link
Copy Markdown
Collaborator

Summary

Addresses critique C2 (does the Phase-1 model borrow text verbatim from evidence?). Audits the full agree-only gold dataset — 188,541 Q&A across 15,509 compounds — via three metrics, with staged review cases.

Metrics

  • LCS (longest common contiguous token substring) — reported as a threshold curve rather than a single flag; max observed across the corpus is 19 tokens.
  • 5-gram overlap (word-level, answer ∩ union of parent-compound evidence 5-grams) — threshold > 3.
  • Cosine (max cosine between answer embedding and any evidence-sentence embedding, sentence-transformers/all-MiniLM-L6-v2, L2-normalized) — threshold > 0.85.

Headline results

Signal Count Rate
lcs_tokens > 40 0 0.000%
lcs_tokens > 10 153 0.081%
flag_ngram (5-gram overlap > 3) 1,821 0.97%
flag_cos (cos > 0.85) 4,068 2.16%
flag_ngram ∨ flag_cos 5,569 2.95%
flag_ngram ∩ flag_cos (strongest signal) 320 0.17%

No Q&A contains a 20+ token verbatim contiguous copy of any evidence sentence. Flags concentrate in topics that must draw on literature (therapeutic_use, mechanism, toxicity, drug_interactions, binding_mode, cell_biology, signaling_pathways — 10–20%) and are near-zero on purely structural topics (functional_groups 0%, scaffold 0.1%, engineering 0.2%). This is the SMILES-derivable-vs-evidence-supported design rule working as intended.

Layout

All under task-6-leakage-classifier/:

  • README.md — quick overview + headline numbers + how to reproduce
  • PLAN.md — locked plan / contract (decisions, thresholds, sampling)
  • leakage_summary.md — aggregate stats, LCS threshold curve, per-split / per-topic
  • flagged_examples.md — top-20 per category (LCS descending, ngram, cos, co-flagged), each with answer + closest evidence sentence staged for review. No judgments written — for the reviewer.
  • per_qa_leakage.jsonl — 188,541 rows with lcs_tokens, ngram5_overlap, cos_max, and flag booleans
  • scripts/ — reproducible pipeline (setup_env.sh, sample.py, embed.py, compute_metrics.py, summarize.py, run.sh)
  • archive_20k_sample/ — initial 20K-Q&A pilot outputs (audit trail; the two runs tracked tightly)

Reproducibility

scripts/config.py has absolute paths for this machine (inputs at /data/luis/..., large outputs at /data/dandreas/...). Edit that file to repoint, then bash scripts/setup_env.sh && bash scripts/run.sh. End-to-end on one H100 takes ~25 min (embedding 1.28 M unique texts ~4.5 min, scoring ~9 min).

Test plan

  • Spot-check ~20 flagged cases in flagged_examples.md — does the sort by "most egregious" make sense? Any obvious false positives / false negatives?
  • Validate that the LCS threshold curve gives an actionable operating point for filtering (if that's desired as a dataset-hygiene step).
  • Sanity-check per-topic rates vs the design-rule expectation (structural ≈ 0%, functional > 0%).

Do not merge — per instructions, the author will coordinate the merge separately.

🤖 Generated with Claude Code

Detects per-Q&A leakage across the full agree-only gold dataset (188,541 Q&A
across 15,509 compounds) via three metrics: longest common contiguous token
substring (LCS), 5-gram overlap, and cosine similarity of all-MiniLM-L6-v2
embeddings. LCS is reported as a threshold curve since the corpus-wide max
is 19 tokens — no Q&A has a 40+ token verbatim contiguous run copied from
evidence. Full results, flagged examples, and reproducible scripts included.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Task-6 “leakage classifier” audit pipeline to quantify evidence→answer textual overlap (LCS, 5-gram overlap, embedding cosine) and to generate reviewer-facing summary + example reports for the agree-only gold dataset.

Changes:

  • Introduces a reproducible multi-step pipeline (sample → embed → compute_metrics → summarize) under task-6-leakage-classifier/scripts/.
  • Commits generated analysis artifacts (leakage_summary.md, flagged_examples.md, per_qa_leakage.jsonl) plus a preserved 20k pilot archive.
  • Documents methodology, thresholds, and headline results in README.md and PLAN.md.

Reviewed changes

Copilot reviewed 13 out of 15 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
task-6-leakage-classifier/scripts/config.py Centralizes paths, thresholds, and run parameters for the pipeline.
task-6-leakage-classifier/scripts/sample.py Streams gold JSONL and materializes per-QA rows (answer + parent evidence) for downstream scoring.
task-6-leakage-classifier/scripts/embed.py Builds a text index and embeds unique answers/evidence sentences for cosine scoring.
task-6-leakage-classifier/scripts/compute_metrics.py Computes LCS / 5-gram overlap / cosine per Q&A and writes per_qa_leakage.jsonl.
task-6-leakage-classifier/scripts/summarize.py Aggregates stats and emits leakage_summary.md + flagged_examples.md.
task-6-leakage-classifier/scripts/setup_env.sh Creates a conda env and installs required Python dependencies.
task-6-leakage-classifier/scripts/run.sh Driver script to run all pipeline stages and capture logs.
task-6-leakage-classifier/README.md Provides overview, key results, reproduction instructions, and metric definitions.
task-6-leakage-classifier/PLAN.md Captures the “contract” for metric semantics, thresholds, and review procedure.
task-6-leakage-classifier/leakage_summary.md Generated aggregate report for the full run.
task-6-leakage-classifier/flagged_examples.md Generated top-ranked flagged examples for manual review.
task-6-leakage-classifier/archive_20k_sample/leakage_summary.md Generated summary from the 20k pilot run (audit trail).
task-6-leakage-classifier/archive_20k_sample/flagged_examples.md Generated flagged examples from the 20k pilot run (audit trail).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +8 to +10
OUT=/home/dandreas/chem2text/outputs/leakage
LOG="$OUT/run.log"

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run.sh hard-codes the output directory (OUT=/home/dandreas/...) instead of deriving it from scripts/config.py. This breaks the documented “edit config.py to repoint paths” workflow because logs will still be written to the old location unless run.sh is edited too. Consider reading OUT_DIR/RUN_LOG from config.py (or accepting OUT/LOG via env vars with sensible defaults).

Suggested change
OUT=/home/dandreas/chem2text/outputs/leakage
LOG="$OUT/run.log"
read_config_value () {
python - "$HERE/config.py" "$1" <<'PY'
import importlib.util
import sys
config_path, attr_name = sys.argv[1], sys.argv[2]
try:
spec = importlib.util.spec_from_file_location("run_config", config_path)
if spec is None or spec.loader is None:
raise RuntimeError("unable to load config")
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
value = getattr(module, attr_name, "")
if value is None:
value = ""
print(value)
except Exception:
print("")
PY
}
DEFAULT_OUT=/home/dandreas/chem2text/outputs/leakage
CONFIG_OUT="$(read_config_value OUT_DIR)"
CONFIG_LOG="$(read_config_value RUN_LOG)"
OUT="${OUT:-${CONFIG_OUT:-$DEFAULT_OUT}}"
LOG="${LOG:-${CONFIG_LOG:-$OUT/run.log}}"

Copilot uses AI. Check for mistakes.
md.append("# Leakage Summary\n")
md.append(
f"Source: `per_qa_leakage.jsonl` ({n} rows, sampled from "
f"`dataset_gold.jsonl`; n={SAMPLE_SIZE}, seed={SAMPLE_SEED}).\n"

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The summary header prints n={SAMPLE_SIZE} as if that were the effective sample size, but when SAMPLE_SIZE is set larger than the dataset it’s actually a cap and the run covers all rows. This is misleading in the generated leakage_summary.md; consider emitting something like effective_n={n} (and optionally sample_size_cap={SAMPLE_SIZE}) instead.

Suggested change
f"`dataset_gold.jsonl`; n={SAMPLE_SIZE}, seed={SAMPLE_SEED}).\n"
f"`dataset_gold.jsonl`; effective_n={n}, sample_size_cap={SAMPLE_SIZE}, seed={SAMPLE_SEED}).\n"

Copilot uses AI. Check for mistakes.
Comment on lines +19 to +21
SAMPLE_JSONL = BIG_DIR / "sample.jsonl" # ~1 GB
EMBEDDINGS_NPZ = BIG_DIR / "embeddings.npz" # ~1.6 GB
TEXT_INDEX_JSON = BIG_DIR / "text_index.json" # ~200 MB

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The inline size estimates for artifacts (sample.jsonl, embeddings.npz, text_index.json) don’t match the sizes documented in the README (e.g., config says sample.jsonl “~1 GB” while README mentions 9.1 GB). Consider either updating these comments to match observed sizes or removing the specific numbers to avoid misleading operators about disk requirements.

Suggested change
SAMPLE_JSONL = BIG_DIR / "sample.jsonl" # ~1 GB
EMBEDDINGS_NPZ = BIG_DIR / "embeddings.npz" # ~1.6 GB
TEXT_INDEX_JSON = BIG_DIR / "text_index.json" # ~200 MB
# Large pipeline artifacts live in BIG_DIR to avoid home-directory quota issues.
SAMPLE_JSONL = BIG_DIR / "sample.jsonl"
EMBEDDINGS_NPZ = BIG_DIR / "embeddings.npz"
TEXT_INDEX_JSON = BIG_DIR / "text_index.json"

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +24
"""Produce leakage_summary.md + flagged_examples.md.

summary: aggregate flag rates, metric distributions, intersections,
breakdown by topic and split.
flagged: up to EXAMPLES_PER_CATEGORY sampled cases per flag category
(lcs, ngram, cos, any), for user review. No judgments.
"""
from __future__ import annotations

import json
import logging
import random
import sys
from collections import Counter, defaultdict
from pathlib import Path
from statistics import median

sys.path.insert(0, str(Path(__file__).parent))
from config import ( # noqa: E402
COSINE_THRESHOLD,
EMBED_MODEL,
EXAMPLES_PER_CATEGORY,
EXAMPLES_SAMPLE_SEED,
FLAGGED_MD,

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module imports random and EXAMPLES_SAMPLE_SEED, and the header comment says examples are “sampled”, but the implementation deterministically takes the top-N after sorting. Either remove the unused imports/config knobs and update the header comment, or reintroduce seeded sampling if that’s the intended behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +113 to +126
# Collect evidence token-lists and embedding ids.
ev_token_lists: list[list[str]] = []
ev_ngrams: set[tuple[str, ...]] = set()
ev_emb_ids: list[int] = []
for e in evidence:
etext = (e.get("text") or "").strip()
if not etext:
continue
eid = text_to_id.get(etext, -1)
if eid < 0:
continue
ev_token_lists.append(tokenized[eid])
ev_ngrams |= ngrams(tokenized[eid], 5)
ev_emb_ids.append(eid)

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop recomputes the union of evidence 5-grams (ev_ngrams) and rebuilds ev_token_lists / ev_emb_ids for every Q&A row even though evidence is identical for all Q&A belonging to the same cid (because sample.py duplicates the compound’s evidence into every sampled row). This is a major source of redundant work and inflates runtime/IO; consider caching precomputed evidence features keyed by cid (token lists, 5-gram union, embedding id list) and reusing them across Q&A for that compound.

Copilot uses AI. Check for mistakes.
Comment on lines +144 to +147
flag_lcs = lcs_val > LCS_TOKEN_THRESHOLD
flag_ngram = n_overlap > NGRAM5_OVERLAP_THRESHOLD
flag_cos = cos_max > COSINE_THRESHOLD
flag_any = flag_lcs or flag_ngram or flag_cos

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flag_any is currently defined as flag_lcs or flag_ngram or flag_cos, but the reporting/README treats “any” as flag_ngram ∨ flag_cos (since LCS is now reported as a curve). To avoid semantic drift (especially if LCS thresholds change later), consider either redefining flag_any here to match the reported rule, or dropping it from the JSONL and letting downstream code compute whatever combination it needs.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +5
"""Deterministically sample 20,000 Q&A from the gold dataset.

Sampling unit is (cid, qa_index). Draw is uniform across all Q&A — compounds
with more Q&A are proportionally more likely to contribute rows, which is what
we want for an unbiased per-Q&A leakage-rate estimate.

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says this script samples “20,000 Q&A”, but the actual sample size is controlled by config.SAMPLE_SIZE (currently set to 500_000 to cover the full dataset). Updating the docstring to describe the SAMPLE_SIZE behavior (including the full-dataset short-circuit) would avoid confusion when reproducing.

Suggested change
"""Deterministically sample 20,000 Q&A from the gold dataset.
Sampling unit is (cid, qa_index). Draw is uniform across all Q&Acompounds
with more Q&A are proportionally more likely to contribute rows, which is what
we want for an unbiased per-Q&A leakage-rate estimate.
"""Deterministically sample up to `SAMPLE_SIZE` Q&A from the gold dataset.
Sampling unit is (cid, qa_index). Draw is uniform across all Q&Acompounds
with more Q&A are proportionally more likely to contribute rows, which is what
we want for an unbiased per-Q&A leakage-rate estimate. If the dataset contains
`SAMPLE_SIZE` or fewer Q&A, the script keeps the full dataset instead of
drawing a smaller sample.

Copilot uses AI. Check for mistakes.

5. **Answer field.** `phase1_answer` (the model under audit is the Phase-1 generator).

6. **Manual inspection.** We do not make judgments. We stage flagged cases (all three categories, plus intersections) in `flagged_examples.md` for user review. Cap at ~20 cases per category (as per the task spec's "Sample 20 flagged cases") with random seed=42 within each category; the full flagged list is in `per_qa_leakage.jsonl` for anyone who wants to look past the sample.

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan says flagged examples are sampled with a random seed (“seed=42 within each category”), but scripts/summarize.py now selects “most egregious first” via deterministic sorting (and doesn’t use EXAMPLES_SAMPLE_SEED). Please update the plan to match the implemented selection policy so reviewers know whether the examples are a random sample or the top-ranked cases.

Suggested change
6. **Manual inspection.** We do not make judgments. We stage flagged cases (all three categories, plus intersections) in `flagged_examples.md` for user review. Cap at ~20 cases per category (as per the task spec's "Sample 20 flagged cases") with random seed=42 within each category; the full flagged list is in `per_qa_leakage.jsonl` for anyone who wants to look past the sample.
6. **Manual inspection.** We do not make judgments. We stage flagged cases (all three categories, plus intersections) in `flagged_examples.md` for user review. Cap at ~20 cases per category (as per the task spec's "Sample 20 flagged cases"), selected deterministically by the summarizer as the top-ranked / most egregious flagged cases first rather than as a random sample; the full flagged list is in `per_qa_leakage.jsonl` for anyone who wants to look past the capped review set.

Copilot uses AI. Check for mistakes.
Comment on lines +51 to +61
1. **Env.** Create a fresh conda env `chem2text_leakage` with Python 3.11, torch (cu128), sentence-transformers, tqdm, numpy. Script: `scripts/leakage/setup_env.sh`.
2. **Sample.** Stream `dataset_gold.jsonl`, collect all (cid, qa_index) pairs sorted, seed 42 draw of 20,000. For each, emit a sample row with the phase1_answer and parent compound's evidence sentences. Script: `scripts/leakage/sample.py` → `sample.jsonl`.
3. **Embed.** Collect unique text strings (answers + evidence sentences) from the sample, encode on one GPU in batches, save `.npz` + id→row index map. Script: `scripts/leakage/embed.py` → `embeddings.npz`, `text_index.json`.
4. **Metrics.** For each sampled Q&A, compute `lcs_tokens`, `ngram5_overlap`, `cos_max`. Script: `scripts/leakage/compute_metrics.py` → `per_qa_leakage.jsonl`.
5. **Summarize.** Aggregate flag rates, metric distributions, co-flagging, write `leakage_summary.md` and `flagged_examples.md`. Script: `scripts/leakage/summarize.py`.
6. **Driver.** `scripts/leakage/run.sh` runs 2–5 in order, assuming 1 is done.

## Reproducibility

- All scripts take no positional arguments — all inputs/outputs are fixed paths or CLI-settable with the same defaults we used.
- `PLAN.md` records the thresholds and sample seed. The scripts read the same constants from a shared `scripts/leakage/config.py`.

Copilot AI Apr 24, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “Method” section references scripts under scripts/leakage/..., but in this repo they live directly under task-6-leakage-classifier/scripts/ (e.g., scripts/run.sh, scripts/sample.py). Updating these paths will prevent copy/paste reproduction errors.

Suggested change
1. **Env.** Create a fresh conda env `chem2text_leakage` with Python 3.11, torch (cu128), sentence-transformers, tqdm, numpy. Script: `scripts/leakage/setup_env.sh`.
2. **Sample.** Stream `dataset_gold.jsonl`, collect all (cid, qa_index) pairs sorted, seed 42 draw of 20,000. For each, emit a sample row with the phase1_answer and parent compound's evidence sentences. Script: `scripts/leakage/sample.py``sample.jsonl`.
3. **Embed.** Collect unique text strings (answers + evidence sentences) from the sample, encode on one GPU in batches, save `.npz` + id→row index map. Script: `scripts/leakage/embed.py``embeddings.npz`, `text_index.json`.
4. **Metrics.** For each sampled Q&A, compute `lcs_tokens`, `ngram5_overlap`, `cos_max`. Script: `scripts/leakage/compute_metrics.py``per_qa_leakage.jsonl`.
5. **Summarize.** Aggregate flag rates, metric distributions, co-flagging, write `leakage_summary.md` and `flagged_examples.md`. Script: `scripts/leakage/summarize.py`.
6. **Driver.** `scripts/leakage/run.sh` runs 2–5 in order, assuming 1 is done.
## Reproducibility
- All scripts take no positional arguments — all inputs/outputs are fixed paths or CLI-settable with the same defaults we used.
- `PLAN.md` records the thresholds and sample seed. The scripts read the same constants from a shared `scripts/leakage/config.py`.
1. **Env.** Create a fresh conda env `chem2text_leakage` with Python 3.11, torch (cu128), sentence-transformers, tqdm, numpy. Script: `scripts/setup_env.sh`.
2. **Sample.** Stream `dataset_gold.jsonl`, collect all (cid, qa_index) pairs sorted, seed 42 draw of 20,000. For each, emit a sample row with the phase1_answer and parent compound's evidence sentences. Script: `scripts/sample.py``sample.jsonl`.
3. **Embed.** Collect unique text strings (answers + evidence sentences) from the sample, encode on one GPU in batches, save `.npz` + id→row index map. Script: `scripts/embed.py``embeddings.npz`, `text_index.json`.
4. **Metrics.** For each sampled Q&A, compute `lcs_tokens`, `ngram5_overlap`, `cos_max`. Script: `scripts/compute_metrics.py``per_qa_leakage.jsonl`.
5. **Summarize.** Aggregate flag rates, metric distributions, co-flagging, write `leakage_summary.md` and `flagged_examples.md`. Script: `scripts/summarize.py`.
6. **Driver.** `scripts/run.sh` runs 2–5 in order, assuming 1 is done.
## Reproducibility
- All scripts take no positional arguments — all inputs/outputs are fixed paths or CLI-settable with the same defaults we used.
- `PLAN.md` records the thresholds and sample seed. The scripts read the same constants from a shared `scripts/config.py`.

Copilot uses AI. Check for mistakes.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luistafoi if results are fine then comment "good to merge"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luistafoi if results are fine then comment "good to merge"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants