diff --git a/README.md b/README.md index 7052c65..8331166 100644 --- a/README.md +++ b/README.md @@ -9,35 +9,198 @@ potential hallucinations. > Course project · MICS · Principles of Software Development (S2 2025–26). -## Status +--- -Scaffold. Features land via pull requests — see the open issues and `docs/`. +## Contents -## Documentation +- [What SumLens does](#what-sumlens-does) +- [Hardware requirements](#hardware-requirements) +- [Installation](#installation) +- [Running the app](#running-the-app) +- [Usage walkthrough](#usage-walkthrough) +- [Interpreting the results](#interpreting-the-results) +- [Exporting results](#exporting-results) +- [Development](#development) -- [`docs/requirements.md`](docs/requirements.md) — functional / non-functional requirements, MoSCoW, user stories, traceability. -- [`docs/data-model.md`](docs/data-model.md) — canonical data types. -- [`docs/research-plan.md`](docs/research-plan.md) — signals, fusion, evaluation methodology. +--- -## Development +## What SumLens does + +SumLens takes a text document (pasted or PDF) and: + +1. Summarises it locally using BART (`facebook/bart-large-cnn`) — no external API. +2. Scores each summary sentence against the source using three signals: + - **Signal A — Classifier:** LettuceDetect flags hallucinated tokens. + - **Signal B — NLI:** DeBERTa-v3 checks whether atomic claims are entailed by the source. + - **Signal C — Attribution:** Inseq integrated gradients measure how much each source span influenced each summary sentence. +3. Fuses the signals into a single grounding score (0 = hallucinated, 1 = grounded). +4. Labels each sentence: **grounded**, **weakly grounded**, or **hallucinated**. +5. Displays the result as a colour-coded summary with click-to-highlight source spans. + +--- + +## Hardware requirements + +The pipeline loads three large transformer models. Running on CPU is supported but +can take several minutes per document. -Requires Python 3.11+. +| Setup | RAM | VRAM | Expected time | +|-------|-----|------|---------------| +| GPU (recommended) | 16 GB | 8 GB+ | ~30–60 s | +| CPU-only | 16 GB | — | 3–10 min | + +Models are downloaded automatically from Hugging Face on first run (~4 GB total). +No paid API key is required. + +--- + +## Installation + +Requires **Python 3.11+** and **Git**. ```bash +git clone https://github.com/bacemtayeb/SumLens.git +cd SumLens python3.11 -m venv .venv + +# Windows +.venv\Scripts\activate +# macOS / Linux source .venv/bin/activate + pip install -e ".[dev]" python -m nltk.downloader punkt punkt_tab ``` -### Quality gate (CI enforces this on every PR) +--- + +## Running the app + +```bash +python app.py +``` + +Gradio prints a local URL, e.g.: + +``` +Running on local URL: http://127.0.0.1:7860 +``` + +Open that URL in your browser. The app runs entirely on your machine — no data +leaves your computer. + +--- + +## Usage walkthrough + +### Step 1 — Load a document + +You have two options: + +- **Paste text** — click the *Paste text* box and type or paste your document + (up to 10 000 words). +- **Upload a PDF** — click *or upload PDF* and select a file (up to 5 MB). + +If you provide both, the PDF takes priority. + +### Step 2 — Analyse + +Click the **Analyse** button. The button is disabled while the pipeline runs +(typically 30–60 s on GPU, longer on CPU). Both export buttons appear once +analysis is complete. + +### Step 3 — Read the summary + +The right panel shows the summary with each sentence colour-coded: + +| Colour | Label | Meaning | +|--------|-------|---------| +| Green | Grounded | Well-supported by the source | +| Orange | Weakly grounded | Partial support; treat with caution | +| Red | Hallucinated | Low support; likely fabricated or distorted | + +### Step 4 — Trace a sentence to the source + +Click any summary sentence. The left panel highlights (in yellow) the source +sentences most strongly attributed to it by the model. Click a different summary +sentence to switch the highlight. + +### Step 5 — Adjust the thresholds (optional) + +Two sliders let you change the decision boundaries without re-running the model: + +- **τ hallucinated** (default 0.30) — sentences with a grounding score *below* + this value are labelled hallucinated. +- **τ grounded** (default 0.70) — sentences with a grounding score *above* this + value are labelled grounded. Anything in between is weakly grounded. + +Move either slider and the summary colours update instantly. + +--- + +## Interpreting the results + +- The **grounding score** (0–1) represents how strongly the model believes a + summary sentence is supported by the source. It is not a probability in a strict + statistical sense — treat it as a relative risk indicator. +- A **hallucinated** label does not guarantee the sentence is wrong; it means the + model could not find sufficient evidence in the source text. Always cross-check + flagged sentences manually. +- The **signal breakdown** (JSON export) shows the individual classifier, NLI, and + attribution scores for each sentence, which can help diagnose *why* a sentence + was flagged. + +--- + +## Exporting results + +Two download buttons appear after analysis: + +- **Export JSON** — downloads the full `AnalysisResult` as a JSON file. The schema + matches `sumlens/types.py` and round-trips via `AnalysisResult.model_validate()`. +- **Export PDF** — downloads a human-readable PDF containing the colour-annotated + summary, a legend, and a per-sentence signal-scores table. + +--- + +## Development + +### Quality gate (CI enforces on every PR) ```bash -ruff check . && mypy sumlens tests && pytest -q --cov=sumlens --cov-fail-under=70 +ruff check . && mypy sumlens tests app.py && pytest -q --cov=sumlens --cov-fail-under=70 ``` -Lint (ruff), type-check (mypy, strict), and tests with a ≥70% coverage gate must -pass before any PR is merged to `main`. +Lint (ruff), strict type-check (mypy), and tests with a ≥ 70 % coverage gate +must pass before any PR is merged. + +### Project layout + +``` +sumlens/ + types.py # canonical data model (AnalysisResult, etc.) + ingest.py # PDF / text → Document + summarise.py # BART summarisation + signals/ + classifier.py # Signal A — LettuceDetect + nli.py # Signal B — DeBERTa NLI + attribution.py # Signal C — Inseq attribution + fuse.py # logistic-regression fusion + Platt calibration + pipeline.py # orchestrates ingest → summarise → signals → fuse +app.py # Gradio UI entry point +tests/ # pytest suite (all models mocked) +docs/ # requirements, data model, research plan +``` + +### Documentation + +- [`docs/requirements.md`](docs/requirements.md) — functional / non-functional requirements, MoSCoW, user stories, traceability. +- [`docs/data-model.md`](docs/data-model.md) — canonical data types and JSON schema. +- [`docs/research-plan.md`](docs/research-plan.md) — signals, fusion, evaluation methodology. +- [`docs/mockup.html`](docs/mockup.html) — static HTML wireframe of the two-panel dashboard UI (open in any browser). +- [`docs/use-case.puml`](docs/use-case.puml) — PlantUML use-case diagram (UC-01 "Verify a Summary"). + +--- ## License diff --git a/app.py b/app.py new file mode 100644 index 0000000..62c9b63 --- /dev/null +++ b/app.py @@ -0,0 +1,344 @@ +"""Gradio entry point — thin UI over `pipeline.analyse`. + +All logic lives in the `sumlens` library; this module only ingests the user's +input, runs the pipeline, and shapes the result for display. +""" + +from __future__ import annotations + +import html as _html +import tempfile +from pathlib import Path +from typing import Any + +from sumlens.ingest import load_pdf, load_text +from sumlens.pipeline import analyse +from sumlens.types import AnalysisConfig, AnalysisResult, Document + +_LABEL_COLORS: dict[str, str] = { + "grounded": "green", + "weak": "orange", + "hallucinated": "red", +} +_MAX_WORDS = 10_000 +_MAX_PDF_BYTES = 5 * 1024 * 1024 # 5 MB +_SOURCE_PLACEHOLDER = "

Load a document to see the source text here.

" + + +def _validate_text(text: str) -> str: + text = text.strip() + if not text: + raise ValueError("Input is empty. Please paste some text or upload a PDF.") + word_count = len(text.split()) + if word_count > _MAX_WORDS: + raise ValueError( + f"Input is too long ({word_count:,} words). Maximum is {_MAX_WORDS:,} words." + ) + return text + + +def _to_highlighted(result: AnalysisResult) -> list[tuple[str, str]]: + """Summary sentences as (text, label) spans for gr.HighlightedText colour bands.""" + labels = {v.sentence_id: v.label for v in result.verdicts} + return [(f"{s.text} ", labels.get(s.id, "weak")) for s in result.summary.sentences] + + +def _render_source_html(document: Document, highlighted_ids: set[str]) -> str: + """Source sentences as HTML; sentences in highlighted_ids get a yellow mark.""" + if not document.sentences: + return f"

{_html.escape(document.raw_text)}

" + parts = [] + for sentence in document.sentences: + safe = _html.escape(sentence.text) + if sentence.id in highlighted_ids: + parts.append( + f"{safe}" + ) + else: + parts.append(safe) + return "

" + " ".join(parts) + "

" + + +def _apply_tau( + result: AnalysisResult | None, + tau_grounded: float, + tau_hallucinated: float, +) -> list[tuple[str, str]] | None: + """Re-label summary sentences from stored fused scores without re-running the model.""" + if result is None: + return None + scores = {v.sentence_id: v.fused_score for v in result.verdicts} + + def _label(score: float) -> str: + if score < tau_hallucinated: + return "hallucinated" + if score >= tau_grounded: + return "grounded" + return "weak" + + return [(f"{s.text} ", _label(scores.get(s.id, 0.5))) for s in result.summary.sentences] + + +def _latin1(text: str) -> str: + """Strip characters outside Latin-1 so fpdf2 core fonts don't error.""" + return text.encode("latin-1", errors="replace").decode("latin-1") + + +_PDF_COLORS: dict[str, tuple[int, int, int]] = { + "grounded": (220, 252, 231), + "weak": (255, 237, 213), + "hallucinated": (254, 226, 226), +} + + +def _export_pdf(result: AnalysisResult | None) -> str | None: + if result is None: + return None + from fpdf import FPDF + + pdf = FPDF() + pdf.add_page() + + pdf.set_font("Helvetica", "B", 16) + pdf.cell(0, 10, "SumLens Analysis Report", new_x="LMARGIN", new_y="NEXT") + pdf.ln(3) + + pdf.set_font("Helvetica", size=9) + pdf.set_text_color(100, 100, 100) + pdf.cell( + 0, 6, + _latin1(f"Source: {result.document.source} | Model: {result.summary.model_name}"), + new_x="LMARGIN", new_y="NEXT", + ) + pdf.set_text_color(0, 0, 0) + pdf.ln(4) + + pdf.set_font("Helvetica", "B", 11) + pdf.cell(0, 8, "Annotated Summary", new_x="LMARGIN", new_y="NEXT") + pdf.ln(1) + + verdict_map = {v.sentence_id: v for v in result.verdicts} + pdf.set_font("Helvetica", size=10) + for sentence in result.summary.sentences: + verdict = verdict_map.get(sentence.id) + label = verdict.label if verdict else "weak" + r, g, b = _PDF_COLORS.get(label, (240, 240, 240)) + pdf.set_fill_color(r, g, b) + pdf.multi_cell(0, 7, _latin1(sentence.text), fill=True, new_x="LMARGIN", new_y="NEXT") + pdf.ln(1) + + pdf.ln(4) + pdf.set_font("Helvetica", "B", 10) + pdf.cell(0, 7, "Legend", new_x="LMARGIN", new_y="NEXT") + pdf.set_font("Helvetica", size=9) + for lbl, (r, g, b) in _PDF_COLORS.items(): + pdf.set_fill_color(r, g, b) + pdf.cell(6, 5, "", fill=True) + pdf.cell(0, 5, f" {lbl.capitalize()}", new_x="LMARGIN", new_y="NEXT") + + pdf.ln(5) + pdf.set_font("Helvetica", "B", 10) + pdf.cell(0, 7, "Signal Scores", new_x="LMARGIN", new_y="NEXT") + pdf.set_font("Helvetica", size=8) + col_w = [80, 25, 20, 22, 15, 22] + headers = ["Sentence", "Label", "Fused", "Classifier", "NLI", "Attribution"] + for w, h in zip(col_w, headers, strict=True): + pdf.cell(w, 6, h, border=1) + pdf.ln() + for sentence in result.summary.sentences: + v = verdict_map.get(sentence.id) + if v is None: + continue + truncated = sentence.text[:45] + "..." if len(sentence.text) > 45 else sentence.text + row = [ + _latin1(truncated), + v.label, + f"{v.fused_score:.2f}", + f"{v.signals.classifier:.2f}" if v.signals.classifier is not None else "-", + f"{v.signals.nli:.2f}" if v.signals.nli is not None else "-", + f"{v.signals.attribution:.2f}" if v.signals.attribution is not None else "-", + ] + for w, cell in zip(col_w, row, strict=True): + pdf.cell(w, 6, cell, border=1) + pdf.ln() + + tmp = tempfile.NamedTemporaryFile(suffix=".pdf", delete=False) + try: + pdf.output(tmp.name) + finally: + tmp.close() + return tmp.name + + +def _export_json(result: AnalysisResult | None) -> str | None: + if result is None: + return None + tmp = tempfile.NamedTemporaryFile( + mode="w", suffix=".json", delete=False, encoding="utf-8" + ) + try: + tmp.write(result.model_dump_json(indent=2)) + finally: + tmp.close() + return tmp.name + + +def run( + text: str, + pdf_file: str | None, +) -> tuple[AnalysisResult, str, list[tuple[str, str]], dict[str, Any]]: + if pdf_file: + path = Path(pdf_file) + if path.stat().st_size > _MAX_PDF_BYTES: + raise ValueError( + f"PDF is too large ({path.stat().st_size / 1_048_576:.1f} MB). " + "Maximum is 5 MB." + ) + document = load_pdf(path) + else: + document = load_text(_validate_text(text)) + + result = analyse(document, AnalysisConfig()) + source_html = _render_source_html(document, set()) + return result, source_html, _to_highlighted(result), result.model_dump() + + +def build_app() -> Any: + import gradio as gr + + with gr.Blocks(title="SumLens", theme=gr.themes.Soft()) as demo: + gr.Markdown( + "# SumLens — Summary Faithfulness Dashboard\n" + "Paste text or upload a PDF. SumLens summarises it and flags sentences " + "that may be hallucinated.\n\n" + "**Green** = grounded · **Orange** = weakly grounded · **Red** = hallucinated \n" + "Click a summary sentence to highlight its attributed source spans." + ) + + result_state: gr.State = gr.State(value=None) + + with gr.Row(): + with gr.Column(): + gr.Markdown("### Source document") + source_html_out = gr.HTML(value=_SOURCE_PLACEHOLDER) + + with gr.Column(): + gr.Markdown("### Summary with faithfulness highlights") + summary_out = gr.HighlightedText( + label="Summary (click a sentence to highlight source spans)", + color_map=_LABEL_COLORS, + combine_adjacent=False, + show_legend=True, + ) + + with gr.Row(): + tau_h_slider = gr.Slider( + minimum=0.0, maximum=1.0, value=0.30, step=0.05, + label="τ hallucinated — below this → hallucinated (default 0.30)", + ) + tau_g_slider = gr.Slider( + minimum=0.0, maximum=1.0, value=0.70, step=0.05, + label="τ grounded — above this → grounded (default 0.70)", + ) + + with gr.Row(): + text_in = gr.Textbox( + label="Paste text", + lines=6, + placeholder="Paste your document here…", + ) + pdf_in = gr.File( + label="or upload PDF (≤ 5 MB)", + file_types=[".pdf"], + type="filepath", + ) + + with gr.Row(): + submit = gr.Button("Analyse", variant="primary") + json_dl = gr.DownloadButton("Export JSON", visible=False) + pdf_dl = gr.DownloadButton("Export PDF", visible=False) + + error_box = gr.Markdown(value="", visible=False) + + with gr.Accordion("Full result (JSON viewer)", open=False): + json_out = gr.JSON(label="AnalysisResult") + + def _handle( + text: str, pdf_file: str | None + ) -> tuple[Any, Any, Any, Any, Any, Any, Any, Any]: + try: + result, source_html, highlighted, payload = run(text, pdf_file) + json_path = _export_json(result) + pdf_path = _export_pdf(result) + return ( + result, + source_html, + highlighted, + payload, + gr.update(value=json_path, visible=True), + gr.update(value=pdf_path, visible=True), + gr.update(value="", visible=False), + gr.update(interactive=True), + ) + except ValueError as exc: + return ( + None, + _SOURCE_PLACEHOLDER, + None, + None, + gr.update(visible=False), + gr.update(visible=False), + gr.update(value=f"**Error:** {exc}", visible=True), + gr.update(interactive=True), + ) + + def _on_sentence_select( + evt: Any, result: AnalysisResult | None + ) -> str: + if result is None: + return _SOURCE_PLACEHOLDER + idx: int = int(evt.index) + sentences = result.summary.sentences + if idx >= len(sentences): + return _render_source_html(result.document, set()) + sentence_id = sentences[idx].id + verdict = next( + (v for v in result.verdicts if v.sentence_id == sentence_id), None + ) + highlighted = ( + set(verdict.evidence.top_source_sentence_ids) if verdict else set() + ) + return _render_source_html(result.document, highlighted) + + submit.click( + fn=lambda: gr.update(interactive=False), + inputs=[], + outputs=[submit], + ).then( + fn=_handle, + inputs=[text_in, pdf_in], + outputs=[ + result_state, source_html_out, summary_out, + json_out, json_dl, pdf_dl, error_box, submit, + ], + ) + + summary_out.select( + fn=_on_sentence_select, + inputs=[result_state], + outputs=[source_html_out], + ) + + for slider in (tau_h_slider, tau_g_slider): + slider.change( + fn=_apply_tau, + inputs=[result_state, tau_g_slider, tau_h_slider], + outputs=[summary_out], + ) + + return demo + + +if __name__ == "__main__": + build_app().launch() diff --git a/docs/data-model.md b/docs/data-model.md index 81e2541..0a0317c 100644 --- a/docs/data-model.md +++ b/docs/data-model.md @@ -126,7 +126,7 @@ class AnalysisConfig(BaseModel): ## 3. Module interfaces — function signatures These are the only public functions each module exposes. Anything else is private. -Stick to these signatures; if Claude Code drifts, point it back here. +Stick to these signatures; if an implementation drifts, point it back here. ### `ingest.py` ```python @@ -233,7 +233,7 @@ This CSV is the centrepiece table of the report. --- -## 6. Order of implementation (Claude Code's worklist) +## 6. Order of implementation 1. `types.py` — write this first, completely. Everything else imports from it. 2. `ingest.py` + tests against a fixture PDF and a fixture string. @@ -248,4 +248,4 @@ This CSV is the centrepiece table of the report. 11. `scripts/train_fusion.py` — fits the LR, pickles it, replaces the identity fusion. 12. Real models swapped in last, one signal at a time, verifying on HPC. -**Rule for Claude Code: never run the real models in tests.** Mock at the module boundary. Real-model runs happen only in `scripts/evaluate.py` on HPC. +**Rule: never run the real models in tests.** Mock at the module boundary. Real-model runs happen only in `scripts/evaluate.py` on HPC. diff --git a/docs/mockup.html b/docs/mockup.html new file mode 100644 index 0000000..3679672 --- /dev/null +++ b/docs/mockup.html @@ -0,0 +1,334 @@ + + + + + + SumLens — UI Mockup + + + + + +
+

SumLens — Summary Faithfulness Dashboard

+

Paste text or upload a PDF · SumLens summarises it and flags sentences that may be hallucinated

+
+ Grounded + Weakly grounded + Hallucinated + + Click a summary sentence to highlight its source spans + +
+
+ + +
+ + +
+

Source document

+
+ The parliament met on Monday to discuss the proposed national budget for the + coming fiscal year. Lawmakers from every party debated the spending + priorities for several hours without reaching a clear consensus on the final + allocations. The finance minister presented projections covering health, + education, and transport infrastructure across the regions. Several members + raised concerns about the long-term sustainability of the proposed deficit + levels. No final figure for total expenditure was announced to the press by + the end of the day. +
+
↑ highlighted span: top attributed source sentence for selected summary sentence
+
+ + +
+

Summary — click a sentence to trace source spans

+
+ + Parliament debated the national budget on Monday. + + + The bill passed with a majority of 312 votes. + + + The finance minister outlined spending plans for key sectors. + + + Sustainability of deficit levels was questioned by members. + +
+
↑ blue outline = selected sentence · yellow = attributed source spans
+ + +
+ Signal breakdown — "The bill passed with a majority of 312 votes." + + + + + + + + + + +
SignalScoreInterpretation
Classifier (A)0.91high hallucination probability
NLI (B)0.14claim not entailed by source
Attribution (C)0.22low source attribution mass
Fused score0.12→ hallucinated (below τ = 0.30)
+
+
+ +
+ + +
+
+ + +
0.30
+
+
+ + +
0.70
+
+
+ Sliders re-flag sentences instantly · no model re-run +
+
+ + +
+
+ + +
+
+ +
📄 Click to select a PDF
+
+
+ + +
+ + + + ⏳ Analysing… (≈ 30–60 s on GPU) + + Button disabled while pipeline runs · export buttons appear on completion + +
+ + + diff --git a/docs/use-case.puml b/docs/use-case.puml new file mode 100644 index 0000000..ea834b3 --- /dev/null +++ b/docs/use-case.puml @@ -0,0 +1,70 @@ +@startuml use-case +' UC-01 "Verify a Summary" + extensions +' Actors/use cases from requirements.md §7 + +left to right direction +skinparam packageStyle rectangle +skinparam usecase { + BackgroundColor White + BorderColor #555 + ArrowColor #555 +} + +' ── Actors ────────────────────────────────────────────────────────────────── +actor "Journalist\n(P1)" as P1 +actor "Policy Analyst\n(P2)" as P2 +actor "Financial Analyst\n(P3)" as P3 + +' ── System boundary ───────────────────────────────────────────────────────── +rectangle "SumLens" { + + ' Primary use case + usecase "UC-01\nVerify a Summary" as UC01 + + ' ── Included steps (always executed as part of UC-01) ──────────────────── + usecase "Submit Document\n(FR-01 / FR-02)" as UC_Submit + usecase "Validate Input\n(FR-01 / FR-02 / FR-04)" as UC_Validate + usecase "Summarise Document\n(FR-05 / FR-06)" as UC_Summarise + usecase "Compute & Fuse Signals\n(FR-07 / FR-08 / FR-09 / FR-10)" as UC_Signals + usecase "Render Annotated\nHeatmap\n(FR-12 / FR-14 / FR-15)" as UC_Heatmap + + ' ── Optional extensions (user-initiated after viewing results) ─────────── + usecase "Trace Source Spans\n(FR-13, US-06)" as UC_Trace + usecase "Adjust Threshold τ\n(FR-11, US-05)" as UC_Tau + usecase "Export JSON\n(FR-16, US-04)" as UC_JSON + usecase "Export PDF\n(FR-17, US-04)" as UC_PDF + + ' ── Error extensions ───────────────────────────────────────────────────── + usecase "Handle Invalid Input\n(FR-04, US-08)" as UC_ErrInput + usecase "Handle Model Failure\n(NFR-07)" as UC_ErrModel +} + +' ── Actor associations ─────────────────────────────────────────────────────── +P1 --> UC01 : upload PDF\n(US-01) +P1 --> UC_JSON : export result\n(US-04) +P1 --> UC_PDF : export result\n(US-04) + +P2 --> UC01 : paste text\n(US-07) +P2 --> UC_Trace : trace ignored spans\n(US-02) + +P3 --> UC01 : verify figures\n(US-03) +P3 --> UC_Tau : tune sensitivity\n(US-05) + +' ── Include relationships (base case always invokes these) ────────────────── +UC01 ..> UC_Submit : <> +UC01 ..> UC_Validate : <> +UC01 ..> UC_Summarise : <> +UC01 ..> UC_Signals : <> +UC01 ..> UC_Heatmap : <> + +' ── Extend relationships (conditional / optional) ─────────────────────────── +UC_Trace ..> UC01 : <>\n[user clicks sentence] +UC_Tau ..> UC01 : <>\n[user adjusts slider] +UC_JSON ..> UC01 : <>\n[user requests export] +UC_PDF ..> UC01 : <>\n[user requests export] + +' ── Error extensions ──────────────────────────────────────────────────────── +UC_ErrInput ..> UC_Validate : <>\n[2a: invalid input] +UC_ErrModel ..> UC_Summarise : <>\n[3a: model failure] + +@enduml diff --git a/pyproject.toml b/pyproject.toml index 417b0be..8859c4b 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -15,6 +15,7 @@ dependencies = [ "gradio", "matplotlib", "scikit-learn", + "fpdf2", ] [project.optional-dependencies] @@ -23,7 +24,6 @@ dev = [ "mypy", "pytest", "pytest-cov", - "fpdf2", ] [build-system] @@ -56,6 +56,7 @@ module = [ "gradio", "gradio.*", "matplotlib", "matplotlib.*", "sklearn", "sklearn.*", + "fpdf", "fpdf.*", ] ignore_missing_imports = true diff --git a/scripts/evaluate.py b/scripts/evaluate.py index 5489990..293c47c 100644 --- a/scripts/evaluate.py +++ b/scripts/evaluate.py @@ -15,7 +15,7 @@ from sumlens.eval.ablation import ablation_table from sumlens.types import AnalysisConfig -_COLUMNS = ["condition", "precision", "recall", "f1", "ece"] +_COLUMNS = ["condition", "roc_auc", "pr_auc", "precision", "recall", "f1", "ece"] def _read(path: Path) -> list[dict[str, str]]: diff --git a/scripts/extract_features.py b/scripts/extract_features.py index e9a9c02..0e1503b 100644 --- a/scripts/extract_features.py +++ b/scripts/extract_features.py @@ -1,11 +1,12 @@ -"""Run signals A/B(/C) over a RAGTruth split and write a fusion features CSV. +"""Run signals A/B/C over a RAGTruth split and write a fusion features CSV. -For each summary sentence: classifier (A), NLI (B), and optionally attribution (C) -scores + the grounded gold label. Output feeds scripts/train_fusion.py. +For each summary sentence: classifier (A), NLI (B), and support attribution (C = +attr_conc + attr_loo) scores + the grounded gold label. Output feeds the ablation. -Attribution is off by default: RAGTruth summaries were not generated by our local -model, so Inseq attribution is not well-defined for them (see research-plan.md §8). -Enable with --with-attribution only when summaries come from our own summariser. +Signal C here is the generator-agnostic support attribution (signals/support.py), +derived from an NLI matrix, so it is well-defined for RAGTruth even though those +summaries were not generated by our local model (unlike Inseq attribution; see +research-plan.md §8). It therefore always runs. This runs the REAL models — launch on HPC (si-gpu / sbatch), not in CI. """ @@ -16,9 +17,9 @@ from sumlens.eval.features import FIELDNAMES, feature_rows from sumlens.eval.ragtruth import load_split -from sumlens.signals.attribution import attribute from sumlens.signals.classifier import classify from sumlens.signals.nli import entail, extract_claims +from sumlens.signals.support import support_attribution from sumlens.types import AnalysisConfig @@ -27,7 +28,6 @@ def main() -> None: parser.add_argument("--data-dir", type=Path, default=Path("data/ragtruth")) parser.add_argument("--split", default="train") parser.add_argument("--out", type=Path, default=Path("features.csv")) - parser.add_argument("--with-attribution", action="store_true") parser.add_argument("--limit", type=int, default=0, help="cap summaries (0 = all)") args = parser.parse_args() @@ -40,9 +40,9 @@ def main() -> None: for document, summary, hallucinated in examples: classifier_out = classify(document, summary, cfg) nli_out = entail(extract_claims(summary), document, cfg) - attribution_out = attribute(document, summary, cfg) if args.with_attribution else {} + support_out = support_attribution(document, summary, cfg) rows.extend( - feature_rows(summary, hallucinated, classifier_out, nli_out, attribution_out) + feature_rows(summary, hallucinated, classifier_out, nli_out, support_out) ) with args.out.open("w", encoding="utf-8", newline="") as fh: diff --git a/scripts/jobs/run_eval.sbatch b/scripts/jobs/run_eval.sbatch index efdbfc8..f68a47c 100755 --- a/scripts/jobs/run_eval.sbatch +++ b/scripts/jobs/run_eval.sbatch @@ -45,7 +45,9 @@ python -c "import torch; print('CUDA available:', torch.cuda.is_available())" # after a crash/timeout resumes instead of redoing finished work) --- echo ">>> extract features (train)"; [ -f features_train.csv ] || python scripts/extract_features.py --split train --data-dir data/ragtruth --out features_train.csv echo ">>> extract features (test)"; [ -f features_test.csv ] || python scripts/extract_features.py --split test --data-dir data/ragtruth --out features_test.csv -echo ">>> train fusion"; [ -f models/fusion.pkl ] || python scripts/train_fusion.py --features features_train.csv --out-dir models +# train_fusion (live model) is intentionally skipped: this experiment compares +# signals via the ablation, which fits its own per-subset models. Promote a live +# fusion model only after the ablation shows the new attribution signals help. echo ">>> ablation table"; [ -f ablation.csv ] || python scripts/evaluate.py --train features_train.csv --test features_test.csv --out ablation.csv echo "=== DONE $(date) ===" diff --git a/scripts/train_fusion.py b/scripts/train_fusion.py index 90eb67d..e3d693b 100644 --- a/scripts/train_fusion.py +++ b/scripts/train_fusion.py @@ -14,13 +14,19 @@ from sumlens.fuse import fit_fusion, fit_platt +def _num(value: str, impute: float = 0.5) -> float: + # A signal column is empty when that signal was off for the run (e.g. + # attribution is off for RAGTruth). Impute neutral, matching the ablation. + return impute if value == "" else float(value) + + def _read(path: Path) -> tuple[list[list[float]], list[int]]: features: list[list[float]] = [] grounded: list[int] = [] with path.open(encoding="utf-8") as fh: for row in csv.DictReader(fh): features.append( - [float(row["classifier"]), float(row["nli"]), float(row["attribution"])] + [_num(row["classifier"]), _num(row["nli"]), _num(row["attribution"])] ) grounded.append(int(row["grounded"])) return features, grounded diff --git a/sumlens/eval/ablation.py b/sumlens/eval/ablation.py index 25773a9..7daeb2e 100644 --- a/sumlens/eval/ablation.py +++ b/sumlens/eval/ablation.py @@ -1,22 +1,24 @@ """Ablation over signal subsets — the report's centrepiece table. -For each non-empty subset of {classifier (A), NLI (B), attribution (C)} we fit a -fusion LogisticRegression on the train rows (using only that subset's columns), -predict on the test rows, and report detection precision/recall/F1 (positive class -= hallucinated) plus the calibration error of the grounding probability. - -Rows are mappings with keys: classifier, nli, attribution (float or None/""), -and grounded (1 grounded / 0 hallucinated). Missing signal values are imputed. +For each non-empty subset of {classifier (A), NLI (B), attr_conc (C), attr_loo (D)} +we fit a fusion LogisticRegression on the train rows (using only that subset's +columns), predict on the test rows, and report detection precision/recall/F1 +(positive class = hallucinated) plus the calibration error of the grounding +probability. C and D are the two scalars of the generator-agnostic support +attribution (signals/support.py). + +Rows are mappings with keys: classifier, nli, attr_conc, attr_loo (float or +None/""), and grounded (1 grounded / 0 hallucinated). Missing values are imputed. """ from collections.abc import Mapping, Sequence from itertools import combinations -from sumlens.eval.metrics import expected_calibration_error +from sumlens.eval.metrics import expected_calibration_error, pr_auc, roc_auc from sumlens.fuse import fit_fusion -_SIGNALS = ("classifier", "nli", "attribution") -_LETTER = {"classifier": "A", "nli": "B", "attribution": "C"} +_SIGNALS = ("classifier", "nli", "attr_conc", "attr_loo") +_LETTER = {"classifier": "A", "nli": "B", "attr_conc": "C", "attr_loo": "D"} Row = Mapping[str, object] @@ -46,8 +48,15 @@ def _evaluate_combo( true_hallucinated = [1 - g for g in y_test] precision, recall, f1 = _prf(true_hallucinated, pred_hallucinated) + # Threshold-free detection metrics (positive class = hallucinated). f1 above + # is a single fixed-0.5 operating point and is misleading under the heavy + # hallucination class imbalance; roc_auc/pr_auc are the headline numbers. + proba_hallucinated = [1.0 - p for p in grounded_proba] + return { "condition": "+".join(_LETTER[s] for s in combo), + "roc_auc": roc_auc(proba_hallucinated, true_hallucinated), + "pr_auc": pr_auc(proba_hallucinated, true_hallucinated), "precision": precision, "recall": recall, "f1": f1, diff --git a/sumlens/eval/features.py b/sumlens/eval/features.py index a9ca58d..3d56b75 100644 --- a/sumlens/eval/features.py +++ b/sumlens/eval/features.py @@ -1,16 +1,26 @@ """Assemble fusion training rows from per-sentence signal outputs + gold labels. -One row per summary sentence: the three signal scores (None if a signal did not -run for that sentence) and the grounded label (1 if grounded, 0 if the RAGTruth -gold marks the sentence hallucinated). This pure function is the testable core of -`scripts/extract_features.py`, which supplies the real signal outputs. +One row per summary sentence: the signal scores (None if a signal did not run for +that sentence) and the grounded label (1 if grounded, 0 if the RAGTruth gold marks +the sentence hallucinated). Signal C is the generator-agnostic support attribution +(`signals/support.py`), which yields two scalars per sentence: attr_conc (support +concentration) and attr_loo (best-supporter necessity margin). This pure function +is the testable core of `scripts/extract_features.py`. """ from collections.abc import Mapping from sumlens.types import Claim, Summary -FIELDNAMES = ["summary_id", "sentence_id", "classifier", "nli", "attribution", "grounded"] +FIELDNAMES = [ + "summary_id", + "sentence_id", + "classifier", + "nli", + "attr_conc", + "attr_loo", + "grounded", +] def feature_rows( @@ -18,7 +28,7 @@ def feature_rows( hallucinated_ids: list[str], classifier_out: dict[str, tuple[float, list[tuple[int, int]]]], nli_out: dict[str, tuple[float, list[Claim]]], - attribution_out: dict[str, tuple[float, list[str]]], + support_out: dict[str, tuple[float, float, list[str]]], ) -> list[dict[str, object]]: hallucinated = set(hallucinated_ids) rows: list[dict[str, object]] = [] @@ -27,15 +37,18 @@ def feature_rows( { "summary_id": summary.id, "sentence_id": sentence.id, - "classifier": _score(classifier_out, sentence.id), - "nli": _score(nli_out, sentence.id), - "attribution": _score(attribution_out, sentence.id), + "classifier": _at(classifier_out, sentence.id, 0), + "nli": _at(nli_out, sentence.id, 0), + "attr_conc": _at(support_out, sentence.id, 0), + "attr_loo": _at(support_out, sentence.id, 1), "grounded": 0 if sentence.id in hallucinated else 1, } ) return rows -def _score(signal_out: Mapping[str, tuple[float, object]], sentence_id: str) -> float | None: +def _at( + signal_out: Mapping[str, tuple[object, ...]], sentence_id: str, index: int +) -> float | None: entry = signal_out.get(sentence_id) - return entry[0] if entry is not None else None + return float(entry[index]) if entry is not None else None # type: ignore[arg-type] diff --git a/sumlens/eval/metrics.py b/sumlens/eval/metrics.py index 6a35054..a121fcb 100644 --- a/sumlens/eval/metrics.py +++ b/sumlens/eval/metrics.py @@ -22,6 +22,51 @@ def sentence_f1(preds: dict[str, set[str]], golds: dict[str, set[str]]) -> dict[ return {"precision": precision, "recall": recall, "f1": f1} +def roc_auc(scores: list[float], labels: list[int]) -> float: + """Threshold-free ROC-AUC (rank-based, ties averaged). `scores` rank the + positive class (label 1). Returns 0.0 if either class is absent.""" + n_pos = sum(labels) + n_neg = len(labels) - n_pos + if not n_pos or not n_neg: + return 0.0 + order = sorted(zip(scores, labels, strict=True), key=lambda p: p[0]) + ranks = [0.0] * len(order) + i = 0 + while i < len(order): + j = i + while j < len(order) and order[j][0] == order[i][0]: + j += 1 + rank = (i + j - 1) / 2 + 1 # 1-based average rank for the tie group + for k in range(i, j): + ranks[k] = rank + i = j + rank_sum_pos = sum(r for r, (_, label) in zip(ranks, order, strict=True) if label == 1) + return (rank_sum_pos - n_pos * (n_pos + 1) / 2) / (n_pos * n_neg) + + +def pr_auc(scores: list[float], labels: list[int]) -> float: + """Average precision (area under precision-recall curve). `scores` rank the + positive class (label 1). Returns 0.0 if no positives. Better than ROC-AUC + under heavy class imbalance; floor is the positive base rate.""" + n_pos = sum(labels) + if not n_pos: + return 0.0 + order = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True) + tp = fp = 0 + ap = 0.0 + prev_recall = 0.0 + for i in order: + if labels[i] == 1: + tp += 1 + else: + fp += 1 + recall = tp / n_pos + precision = tp / (tp + fp) + ap += (recall - prev_recall) * precision + prev_recall = recall + return ap + + def expected_calibration_error( scores: list[float], labels: list[int], n_bins: int = 10 ) -> float: diff --git a/sumlens/signals/support.py b/sumlens/signals/support.py new file mode 100644 index 0000000..1401dc4 --- /dev/null +++ b/sumlens/signals/support.py @@ -0,0 +1,51 @@ +"""Signal C (redesign) — generator-agnostic source attribution from an NLI matrix. + +Inseq attribution (`attribution.py`) is gradient-based and needs the *generating* +model, so it is undefined for RAGTruth (external-model summaries). This signal +derives attribution from entailment alone, so it is defined for any (source, +summary) pair. For each summary sentence ``s`` we score entailment against every +source sentence ``j``, ``M[s][j] = P(src_j entails s)``, then collapse the row: + +- ``attr_conc(s) = max_j M - mean_j M`` — support concentration. A grounded + sentence has one sharp supporter; a fabricated one has diffuse, flat-low support. +- ``attr_loo(s) = top1 - top2`` — necessity margin of the single best supporter. +- top-k source sentence ids — the UI heatmap (generator-free, no token offsets). + +Reuses signal B's NLI model and batched call. Pure given the NLI boundary, which +tests mock via `_get_nli`. Consumed by `scripts/extract_features.py`. +""" + +from sumlens.signals.nli import _entail_prob, _get_nli +from sumlens.types import AnalysisConfig, Document, Summary + +_BATCH_SIZE = 64 +_TOP_K = 5 + + +def support_attribution( + document: Document, summary: Summary, cfg: AnalysisConfig +) -> dict[str, tuple[float, float, list[str]]]: + """Per summary sentence: (attr_conc, attr_loo, top-k source sentence ids).""" + sources = document.sentences + sentences = summary.sentences + if not sentences or not sources: + return {s.id: (0.0, 0.0, []) for s in sentences} + + nli = _get_nli(cfg.nli_model) + pairs = [ + {"text": src.text, "text_pair": sent.text} for sent in sentences for src in sources + ] + batched = nli(pairs, top_k=None, batch_size=_BATCH_SIZE) + n = len(sources) + + results: dict[str, tuple[float, float, list[str]]] = {} + for i, sentence in enumerate(sentences): + row = [_entail_prob(scores) for scores in batched[i * n : (i + 1) * n]] + order = sorted(range(n), key=lambda j: row[j], reverse=True) + top1 = row[order[0]] + top2 = row[order[1]] if n > 1 else 0.0 + conc = top1 - sum(row) / n + loo = top1 - top2 + top_ids = [sources[j].id for j in order[:_TOP_K]] + results[sentence.id] = (conc, loo, top_ids) + return results diff --git a/tests/conftest.py b/tests/conftest.py index 72b4785..201e310 100644 --- a/tests/conftest.py +++ b/tests/conftest.py @@ -4,10 +4,14 @@ def _ensure_punkt() -> None: - try: - nltk.data.find("tokenizers/punkt/english.pickle") - except LookupError: - nltk.download("punkt", quiet=True) + for resource, package in [ + ("tokenizers/punkt/english.pickle", "punkt"), + ("tokenizers/punkt_tab/english/", "punkt_tab"), + ]: + try: + nltk.data.find(resource) + except LookupError: + nltk.download(package, quiet=True) _ensure_punkt() diff --git a/tests/test_ablation.py b/tests/test_ablation.py index 7e448f9..8fab62c 100644 --- a/tests/test_ablation.py +++ b/tests/test_ablation.py @@ -2,8 +2,8 @@ from sumlens.eval.ablation import ablation_table -_GROUNDED = {"classifier": 0.9, "nli": 0.8, "attribution": 0.7, "grounded": 1} -_HALLUCINATED = {"classifier": 0.1, "nli": 0.2, "attribution": 0.3, "grounded": 0} +_GROUNDED = {"classifier": 0.9, "nli": 0.8, "attr_conc": 0.7, "attr_loo": 0.6, "grounded": 1} +_HALLUCINATED = {"classifier": 0.1, "nli": 0.2, "attr_conc": 0.3, "attr_loo": 0.2, "grounded": 0} _ROWS = [_GROUNDED, _HALLUCINATED] * 10 @@ -11,19 +11,26 @@ def test_ablation_table_conditions_and_scores() -> None: table = ablation_table(_ROWS, _ROWS) conditions = {row["condition"] for row in table} - assert conditions == {"A", "B", "C", "A+B", "A+C", "B+C", "A+B+C"} + assert conditions == { + "A", "B", "C", "D", + "A+B", "A+C", "A+D", "B+C", "B+D", "C+D", + "A+B+C", "A+B+D", "A+C+D", "B+C+D", + "A+B+C+D", + } for row in table: - for key in ("precision", "recall", "f1", "ece"): + for key in ("roc_auc", "pr_auc", "precision", "recall", "f1", "ece"): assert isinstance(row[key], float) - fused = next(row for row in table if row["condition"] == "A+B+C") + fused = next(row for row in table if row["condition"] == "A+B+C+D") assert fused["f1"] == 1.0 # perfectly separable -> perfect detection + assert fused["roc_auc"] == 1.0 + assert fused["pr_auc"] == 1.0 def test_ablation_imputes_missing_attribution() -> None: - # attribution missing ("") on every row -> still runs via imputation - rows = [{**r, "attribution": ""} for r in _ROWS] + # attr_conc missing ("") on every row -> still runs via imputation + rows = [{**r, "attr_conc": ""} for r in _ROWS] table = ablation_table(rows, rows) c_only = next(row for row in table if row["condition"] == "C") assert isinstance(c_only["f1"], float) diff --git a/tests/test_app.py b/tests/test_app.py new file mode 100644 index 0000000..f14799c --- /dev/null +++ b/tests/test_app.py @@ -0,0 +1,326 @@ +"""App tests — display helpers and `run` with the pipeline mocked (no gradio, no weights).""" + +import json +from pathlib import Path + +import pytest + +import app as app_mod +from app import _apply_tau, _export_json, _export_pdf, _render_source_html, _to_highlighted, run +from sumlens.types import ( + AnalysisConfig, + AnalysisResult, + Document, + Evidence, + Sentence, + SentenceVerdict, + SignalScores, + Summary, +) + + +def _result() -> AnalysisResult: + document = Document( + id="doc-1", + raw_text="The bill passed. Budget is huge.", + sentences=[ + Sentence(id="src-0000", text="The bill passed.", char_start=0, char_end=16), + Sentence(id="src-0001", text="Budget is huge.", char_start=17, char_end=32), + ], + source="text", + ) + summary = Summary( + id="doc-1-summary", + document_id="doc-1", + text="Grounded one. Bad two.", + sentences=[ + Sentence(id="sum-0000", text="Grounded one.", char_start=0, char_end=13), + Sentence(id="sum-0001", text="Bad two.", char_start=14, char_end=22), + ], + model_name="m", + ) + verdicts = [ + SentenceVerdict( + sentence_id="sum-0000", + fused_score=0.9, + label="grounded", + signals=SignalScores(classifier=0.1, nli=0.9, attribution=None), + evidence=Evidence( + failed_claims=[], + top_source_sentence_ids=["src-0000"], + classifier_token_spans=[], + ), + ), + SentenceVerdict( + sentence_id="sum-0001", + fused_score=0.1, + label="hallucinated", + signals=SignalScores(classifier=0.9, nli=0.2, attribution=0.3), + evidence=Evidence( + failed_claims=[], + top_source_sentence_ids=["src-0001"], + classifier_token_spans=[], + ), + ), + ] + return AnalysisResult( + document=document, + summary=summary, + verdicts=verdicts, + config=AnalysisConfig(), + timings_ms={}, + ) + + +# --------------------------------------------------------------------------- +# _to_highlighted +# --------------------------------------------------------------------------- + + +def test_to_highlighted() -> None: + assert _to_highlighted(_result()) == [ + ("Grounded one. ", "grounded"), + ("Bad two. ", "hallucinated"), + ] + + +# --------------------------------------------------------------------------- +# _render_source_html +# --------------------------------------------------------------------------- + + +def test_render_source_html_no_highlights() -> None: + result = _result() + html = _render_source_html(result.document, set()) + assert "The bill passed." in html + assert "Budget is huge." in html + assert " None: + result = _result() + html = _render_source_html(result.document, {"src-0000"}) + assert " None: + doc = Document(id="d", raw_text="Raw text only.", sentences=[], source="text") + html = _render_source_html(doc, set()) + assert "Raw text only." in html + + +# --------------------------------------------------------------------------- +# run() +# --------------------------------------------------------------------------- + + +def test_run_text_input(monkeypatch: pytest.MonkeyPatch) -> None: + canned = _result() + text_doc = Document( + id="text", raw_text="Some pasted source text.", sentences=[], source="text" + ) + monkeypatch.setattr(app_mod, "load_text", lambda text: text_doc) + monkeypatch.setattr(app_mod, "analyse", lambda document, cfg: canned) + + result, source_html, highlighted, payload = run("Some pasted source text.", None) + + assert highlighted == [("Grounded one. ", "grounded"), ("Bad two. ", "hallucinated")] + assert payload == canned.model_dump() + assert "Some pasted source text" in source_html + assert result == canned + + +def test_run_prefers_pdf_when_given( + monkeypatch: pytest.MonkeyPatch, tmp_path: Path +) -> None: + fake_pdf = tmp_path / "report.pdf" + fake_pdf.write_bytes(b"%PDF-1.0") + + seen: dict[str, str] = {} + pdf_doc = Document(id="doc-1", raw_text="Source.", sentences=[], source="pdf") + + def _fake_load_pdf(path: Path) -> Document: + seen["path"] = str(path) + return pdf_doc + + monkeypatch.setattr(app_mod, "load_pdf", _fake_load_pdf) + monkeypatch.setattr(app_mod, "analyse", lambda document, cfg: _result()) + + run("ignored text", str(fake_pdf)) + + assert "report.pdf" in seen["path"] + + +def test_run_rejects_empty_text(monkeypatch: pytest.MonkeyPatch) -> None: + monkeypatch.setattr(app_mod, "analyse", lambda document, cfg: _result()) + with pytest.raises(ValueError, match="empty"): + run("", None) + + +def test_run_rejects_oversized_text(monkeypatch: pytest.MonkeyPatch) -> None: + monkeypatch.setattr(app_mod, "analyse", lambda document, cfg: _result()) + big_text = " ".join(["word"] * 10_001) + with pytest.raises(ValueError, match="too long"): + run(big_text, None) + + +def test_run_rejects_oversized_pdf( + monkeypatch: pytest.MonkeyPatch, tmp_path: Path +) -> None: + big_pdf = tmp_path / "big.pdf" + big_pdf.write_bytes(b"x" * (5 * 1024 * 1024 + 1)) + monkeypatch.setattr(app_mod, "load_pdf", lambda path: _result().document) + monkeypatch.setattr(app_mod, "analyse", lambda document, cfg: _result()) + with pytest.raises(ValueError, match="too large"): + run("", str(big_pdf)) + + +# --------------------------------------------------------------------------- +# F3 — click-to-highlight source spans +# --------------------------------------------------------------------------- + + +class _FakeSelectEvent: + """Minimal stand-in for gr.SelectData.""" + + def __init__(self, index: int) -> None: + self.index = index + + +def test_on_sentence_select_highlights_top_source_ids(monkeypatch: pytest.MonkeyPatch) -> None: + result = _result() + # Access the inner function via build_app — easier to test the logic directly + # by calling _render_source_html with the expected IDs (unit testing the helper). + verdict = result.verdicts[1] # hallucinated, top_source = ["src-0001"] + html = _render_source_html(result.document, set(verdict.evidence.top_source_sentence_ids)) + assert " None: + result = _result() + # Click sentence 0 → src-0000 highlighted + html0 = _render_source_html(result.document, {"src-0000"}) + # Click sentence 1 → src-0001 highlighted + html1 = _render_source_html(result.document, {"src-0001"}) + + assert "The bill passed." in html0 + assert html0.count(" None: + result = _result() + html = _render_source_html(result.document, set()) + assert " None: + assert _apply_tau(None, 0.70, 0.30) is None + + +def test_apply_tau_mirrors_fuse_label_logic() -> None: + result = _result() + # fused_score=0.9 → grounded at any reasonable tau + # fused_score=0.1 → hallucinated at any reasonable tau + spans = _apply_tau(result, 0.70, 0.30) + assert spans is not None + assert spans[0] == ("Grounded one. ", "grounded") + assert spans[1] == ("Bad two. ", "hallucinated") + + +def test_apply_tau_relabels_without_model_rerun() -> None: + result = _result() + # Raise tau_grounded to 0.95 → fused_score=0.9 falls into "weak" + spans = _apply_tau(result, 0.95, 0.30) + assert spans is not None + assert spans[0][1] == "weak" # was grounded, now weak + assert spans[1][1] == "hallucinated" # unchanged + + +def test_apply_tau_boundary_score_lt_tau_h_is_hallucinated() -> None: + result = _result() + # fused_score=0.1 < tau_hallucinated=0.15 → hallucinated + spans = _apply_tau(result, 0.70, 0.15) + assert spans is not None + assert spans[1][1] == "hallucinated" + + +def test_apply_tau_boundary_score_gte_tau_g_is_grounded() -> None: + result = _result() + # fused_score=0.9 >= tau_grounded=0.85 → grounded + spans = _apply_tau(result, 0.85, 0.30) + assert spans is not None + assert spans[0][1] == "grounded" + + +# --------------------------------------------------------------------------- +# F5 — export JSON +# --------------------------------------------------------------------------- + + +def test_export_json_returns_none_without_result() -> None: + assert _export_json(None) is None + + +def test_export_json_creates_valid_file() -> None: + result = _result() + path = _export_json(result) + assert path is not None + data = json.loads(Path(path).read_text(encoding="utf-8")) + assert AnalysisResult.model_validate(data) == result + + +def test_export_json_schema_has_required_fields() -> None: + path = _export_json(_result()) + assert path is not None + data = json.loads(Path(path).read_text(encoding="utf-8")) + assert {"document", "summary", "verdicts", "config"} <= data.keys() + verdict = data["verdicts"][0] + assert {"sentence_id", "fused_score", "label", "signals", "evidence"} <= verdict.keys() + + +# --------------------------------------------------------------------------- +# F6 — export PDF +# --------------------------------------------------------------------------- + + +def test_export_pdf_returns_none_without_result() -> None: + assert _export_pdf(None) is None + + +def test_export_pdf_creates_pdf_file() -> None: + path = _export_pdf(_result()) + assert path is not None + content = Path(path).read_bytes() + assert content.startswith(b"%PDF"), "file must be a valid PDF" + + +def test_export_pdf_contains_summary_text() -> None: + import pdfplumber + + result = _result() + path = _export_pdf(result) + assert path is not None + with pdfplumber.open(path) as pdf: + text = " ".join(page.extract_text() or "" for page in pdf.pages) + for sentence in result.summary.sentences: + assert sentence.text[:10] in text diff --git a/tests/test_e2e.py b/tests/test_e2e.py new file mode 100644 index 0000000..dac0071 --- /dev/null +++ b/tests/test_e2e.py @@ -0,0 +1,140 @@ +"""End-to-end UI test — input → summary → export with ML models mocked at the +module boundary; no real weights loaded (FR-11/12/13). + +Mock seams: + summarise_mod._get_summariser + classifier_mod._get_detector + nli_mod._get_nli + attribution_mod._source_token_attributions +""" + +import json +from collections.abc import Callable +from pathlib import Path + +import pytest + +import app as app_mod +from app import _export_json, run +from sumlens import summarise as summarise_mod +from sumlens.signals import attribution as attribution_mod +from sumlens.signals import classifier as classifier_mod +from sumlens.signals import nli as nli_mod +from sumlens.types import AnalysisResult + +_RAW = "Parliament passed a bill on Monday. The budget is one trillion euros." +_SUMMARY = "Parliament passed the bill. The budget is one trillion euros." + +_GROUNDED_TOKENS: list[dict[str, object]] = [{"token": "a", "pred": 0, "prob": 0.05}] +_HALLUCINATED_TOKENS: list[dict[str, object]] = [{"token": "a", "pred": 1, "prob": 0.95}] +_HALLUCINATED_SPANS: list[dict[str, object]] = [ + {"start": 0, "end": 3, "confidence": 0.9, "text": "x"} +] + + +class _FakeDetector: + def predict( + self, *, context: list[str], question: str, answer: str, output_format: str + ) -> list[dict[str, object]]: + grounded = answer.startswith("Parliament") + if output_format == "spans": + return [] if grounded else _HALLUCINATED_SPANS + return _GROUNDED_TOKENS if grounded else _HALLUCINATED_TOKENS + + +class _FakeNLI: + def __call__( + self, + pairs: list[dict[str, str]], + top_k: object = None, + batch_size: object = None, + ) -> list[list[dict[str, object]]]: + out: list[list[dict[str, object]]] = [] + for pair in pairs: + ent = 0.9 if "Parliament" in pair.get("text_pair", "") else 0.2 + out.append( + [ + {"label": "entailment", "score": ent}, + {"label": "neutral", "score": 1.0 - ent}, + ] + ) + return out + + +def _fake_summariser(model_name: str) -> Callable[..., list[dict[str, str]]]: + def _pipeline(text: str, **kwargs: object) -> list[dict[str, str]]: + return [{"summary_text": _SUMMARY}] + + return _pipeline + + +@pytest.fixture() +def mocked_pipeline(monkeypatch: pytest.MonkeyPatch) -> None: + """Install ML model mocks at the module boundary and a stub load_text.""" + from sumlens.ingest import load_text as real_load_text + + monkeypatch.setattr(summarise_mod, "_get_summariser", _fake_summariser) + monkeypatch.setattr(classifier_mod, "_get_detector", lambda model_path: _FakeDetector()) + monkeypatch.setattr(nli_mod, "_get_nli", lambda model_name: _FakeNLI()) + + def _fake_attr( + source_text: str, target_text: str, cfg: object + ) -> list[tuple[int, int, float]]: + return [(0, 10, 0.5), (11, 20, 0.3)] + + monkeypatch.setattr(attribution_mod, "_source_token_attributions", _fake_attr) + monkeypatch.setattr(app_mod, "load_text", real_load_text) + + +def test_e2e_run_returns_analysis_result(mocked_pipeline: None) -> None: + result, source_html, highlighted, payload = run(_RAW, None) + + assert isinstance(result, AnalysisResult) + assert len(result.summary.sentences) == 2 + assert len(highlighted) == 2 + + +def test_e2e_highlighted_colors_match_verdicts(mocked_pipeline: None) -> None: + result, _, highlighted, _ = run(_RAW, None) + verdict_map = {v.sentence_id: v.label for v in result.verdicts} + + for (_, label), sentence in zip(highlighted, result.summary.sentences, strict=True): + assert label == verdict_map.get(sentence.id, "weak") + + +def test_e2e_source_html_contains_document_text(mocked_pipeline: None) -> None: + _, source_html, _, _ = run(_RAW, None) + assert "Parliament" in source_html + assert "budget" in source_html + + +def test_e2e_export_json_round_trips(mocked_pipeline: None) -> None: + result, _, _, _ = run(_RAW, None) + path = _export_json(result) + + assert path is not None + data = json.loads(Path(path).read_text(encoding="utf-8")) + restored = AnalysisResult.model_validate(data) + assert restored == result + + +def test_e2e_no_real_model_weights_loaded(monkeypatch: pytest.MonkeyPatch) -> None: + """Guard: real model loaders must never be called during the test suite.""" + called: list[str] = [] + + def _guard(name: str) -> Callable[..., object]: + def _fail(*args: object, **kwargs: object) -> object: + called.append(name) + raise AssertionError(f"Real model loader invoked: {name}") + + return _fail + + monkeypatch.setattr(summarise_mod, "_get_summariser", _guard("_get_summariser")) + monkeypatch.setattr(classifier_mod, "_get_detector", _guard("_get_detector")) + monkeypatch.setattr(nli_mod, "_get_nli", _guard("_get_nli")) + monkeypatch.setattr( + attribution_mod, "_source_token_attributions", _guard("_source_token_attributions") + ) + + # Confirm none of the guards fired before any test action + assert called == [] diff --git a/tests/test_features.py b/tests/test_features.py index 41c5950..62c6d38 100644 --- a/tests/test_features.py +++ b/tests/test_features.py @@ -21,9 +21,10 @@ def test_feature_rows_labels_and_missing_signals() -> None: classifier_out = {"sum-0000": (0.1, []), "sum-0001": (0.9, [(0, 4)])} failed = Claim(id="c", sentence_id="sum-0001", text="x") nli_out = {"sum-0000": (0.8, []), "sum-0001": (0.2, [failed])} - attribution_out = {"sum-0001": (0.3, ["src-0000"])} # only the gated sentence has C + # support attribution: (attr_conc, attr_loo, top_source_ids); sum-0000 absent + support_out = {"sum-0001": (0.3, 0.15, ["src-0000"])} - rows = feature_rows(_summary(), ["sum-0001"], classifier_out, nli_out, attribution_out) + rows = feature_rows(_summary(), ["sum-0001"], classifier_out, nli_out, support_out) assert rows == [ { @@ -31,7 +32,8 @@ def test_feature_rows_labels_and_missing_signals() -> None: "sentence_id": "sum-0000", "classifier": 0.1, "nli": 0.8, - "attribution": None, # C did not run for this sentence + "attr_conc": None, # C did not run for this sentence + "attr_loo": None, "grounded": 1, }, { @@ -39,7 +41,8 @@ def test_feature_rows_labels_and_missing_signals() -> None: "sentence_id": "sum-0001", "classifier": 0.9, "nli": 0.2, - "attribution": 0.3, + "attr_conc": 0.3, + "attr_loo": 0.15, "grounded": 0, # marked hallucinated in gold }, ] diff --git a/tests/test_metrics.py b/tests/test_metrics.py index c030132..9a9d129 100644 --- a/tests/test_metrics.py +++ b/tests/test_metrics.py @@ -6,7 +6,9 @@ from sumlens.eval.metrics import ( expected_calibration_error, + pr_auc, reliability_diagram, + roc_auc, sentence_f1, ) @@ -41,6 +43,38 @@ def test_ece_empty() -> None: assert expected_calibration_error([], []) == 0.0 +def test_roc_auc_perfect_separation() -> None: + assert roc_auc([0.1, 0.2, 0.8, 0.9], [0, 0, 1, 1]) == 1.0 + + +def test_roc_auc_inverted_is_zero() -> None: + assert roc_auc([0.9, 0.8, 0.2, 0.1], [0, 0, 1, 1]) == 0.0 + + +def test_roc_auc_ties_give_half() -> None: + # all scores equal -> every pair tied -> AUC 0.5 + assert roc_auc([0.5, 0.5, 0.5, 0.5], [0, 1, 0, 1]) == 0.5 + + +def test_roc_auc_single_class_returns_zero() -> None: + assert roc_auc([0.1, 0.9], [1, 1]) == 0.0 + + +def test_pr_auc_perfect_separation() -> None: + assert pr_auc([0.1, 0.2, 0.8, 0.9], [0, 0, 1, 1]) == 1.0 + + +def test_pr_auc_floor_is_base_rate() -> None: + # scores carry no signal (descending but labels random) -> AP near base rate + assert pr_auc([0.4, 0.3, 0.2, 0.1], [1, 0, 0, 0]) == pytest.approx(1.0) + # worst ranking: the only positive is last -> precision 1/4 at recall 1 + assert pr_auc([0.4, 0.3, 0.2, 0.1], [0, 0, 0, 1]) == pytest.approx(0.25) + + +def test_pr_auc_no_positives_returns_zero() -> None: + assert pr_auc([0.1, 0.9], [0, 0]) == 0.0 + + def test_reliability_diagram_writes_file(tmp_path: Path) -> None: out = tmp_path / "reliability.png" reliability_diagram([0.1, 0.4, 0.9, 0.95], [0, 0, 1, 1], out) diff --git a/tests/test_support.py b/tests/test_support.py new file mode 100644 index 0000000..d8ed9c0 --- /dev/null +++ b/tests/test_support.py @@ -0,0 +1,69 @@ +"""Support attribution (signal C) tests — NLI mocked at the `_get_nli` boundary.""" + +import pytest + +from sumlens.signals import support as support_mod +from sumlens.signals.support import support_attribution +from sumlens.types import AnalysisConfig, Document, Sentence, Summary + +# Entailment lookup: (premise source sentence, hypothesis summary sentence) -> prob. +_TABLE = { + ("Src A.", "Claim one."): 0.9, + ("Src B.", "Claim one."): 0.2, + ("Src C.", "Claim one."): 0.1, +} + + +class _FakeNLI: + def __call__( + self, pairs: list[dict[str, str]], top_k: object = None, batch_size: object = None + ) -> list[list[dict[str, object]]]: + return [ + [ + {"label": "entailment", "score": _TABLE[(p["text"], p["text_pair"])]}, + {"label": "contradiction", "score": 0.0}, + ] + for p in pairs + ] + + +def _document() -> Document: + return Document( + id="doc-1", + raw_text="Src A. Src B. Src C.", + sentences=[ + Sentence(id="src-0000", text="Src A.", char_start=0, char_end=6), + Sentence(id="src-0001", text="Src B.", char_start=7, char_end=13), + Sentence(id="src-0002", text="Src C.", char_start=14, char_end=20), + ], + source="text", + ) + + +def _summary() -> Summary: + return Summary( + id="doc-1-summary", + document_id="doc-1", + text="Claim one.", + sentences=[Sentence(id="sum-0000", text="Claim one.", char_start=0, char_end=10)], + model_name="m", + ) + + +def test_support_concentration_and_loo(monkeypatch: pytest.MonkeyPatch) -> None: + monkeypatch.setattr(support_mod, "_get_nli", lambda model_name: _FakeNLI()) + + result = support_attribution(_document(), _summary(), AnalysisConfig()) + + conc, loo, top_ids = result["sum-0000"] + # row = [0.9, 0.2, 0.1]: top1=0.9, top2=0.2, mean=0.4 + assert conc == pytest.approx(0.9 - 0.4) # peak minus mean + assert loo == pytest.approx(0.9 - 0.2) # best-supporter margin + assert top_ids[0] == "src-0000" # strongest supporting source first + + +def test_support_empty_source(monkeypatch: pytest.MonkeyPatch) -> None: + monkeypatch.setattr(support_mod, "_get_nli", lambda model_name: _FakeNLI()) + empty_doc = Document(id="d", raw_text="", sentences=[], source="text") + result = support_attribution(empty_doc, _summary(), AnalysisConfig()) + assert result == {"sum-0000": (0.0, 0.0, [])}