Release: full pipeline + evaluation + dashboard (v0.1.0) by bacemtayeb · Pull Request #16 · bacemtayeb/SumLens

bacemtayeb · 2026-06-17T12:59:52Z

First stable release: the full SumLens pipeline plus evaluation and UI.

Engine — ingestion, BART summariser, three faithfulness signals (classifier, NLI, generator-agnostic support attribution), logistic-regression fusion, and the end-to-end analyse() pipeline. All model-boundary tests mocked so CI runs without weights.

Evaluation — RAGTruth harness, signal ablation over every subset of {A,B,C,D}, with threshold-free ROC-AUC / PR-AUC (fixed-0.5 F1 is misleading at the ~5% hallucination base rate). On the held-out split, fusion lifts the best single signal 0.791 -> 0.835 ROC-AUC.

UI slice (F2-F10) — two-panel Gradio dashboard with faithfulness heatmap, click-to-highlight, adjustable threshold, JSON + PDF export, end-to-end UI test, README, use-case diagram, and wireframe mockup.

CI green (ruff + mypy + pytest, coverage gate) on every merged PR.

Closes #F2 - app.py: source/summary two-panel layout, per-sentence green/orange/red highlight via gr.HighlightedText, button loading state as progress indicator, accordion for ingested source + JSON export, user-facing error messages for empty/oversized/oversized-PDF input - tests/test_app.py: 6 unit tests covering _to_highlighted, run() with text and PDF, empty/oversized text rejection, oversized PDF rejection; all models mocked at the analyse/load_text/load_pdf boundary — no real weights loaded

feat: two-panel Gradio UI with faithfulness heatmap (F2)

Closes #F3 - _render_source_html: renders source sentences as HTML; sentences whose IDs are in the highlighted set get a yellow <mark> span - run() now returns the AnalysisResult object so build_app can store it in gr.State for the select handler - summary_out.select() wired to _on_sentence_select: looks up the clicked sentence's verdict, extracts top_source_sentence_ids (≤5), and re-renders the source HTML with those spans highlighted; clicking another sentence switches the highlight - Source panel promoted from accordion Textbox to always-visible gr.HTML in the left column so the two-panel layout is complete - 6 new tests covering _render_source_html and the select highlight logic

html.escape() applied to sentence.text and document.raw_text before interpolation into the gr.HTML panel — user-supplied content (pasted text or PDF extract) could otherwise inject arbitrary HTML.

Closes #F5 - _export_json: writes result.model_dump_json() to a temp file, returns the path; returns None if no result is loaded yet - gr.DownloadButton wired alongside Analyse; becomes visible and populated with the file path once analysis completes — one click downloads - JSON conforms to the AnalysisResult schema from sumlens/types.py (no redefinition); round-trips via AnalysisResult.model_validate() - 3 new tests: None guard, round-trip validation, required field presence

feat: click-to-highlight source spans

feat: export annotated result as JSON (F5)

Closes #F7 - tests/test_e2e.py: 5 E2E tests driving app.run() through the full pipeline with ML models mocked at the module boundary (_get_summariser, _get_detector, _get_nli, _source_token_attributions); no real weights loaded. Covers: AnalysisResult shape, highlight colours matching verdicts, source HTML content, JSON export round-trip, model loader guard. - tests/conftest.py: also downloads punkt_tab so NLTK works on newer versions that switched from punkt to punkt_tab format.

test: end-to-end UI test with model boundary mocks (F7)

Closes #F4 - _apply_tau: re-labels summary sentences from stored fused_score values using the same logic as fuse.label (score < tau_h → hallucinated, score >= tau_g → grounded); no model re-run - Two sliders (tau_hallucinated / tau_grounded) added to the UI; each slider.change() calls _apply_tau and updates summary_out only — the pipeline is never invoked again - 5 new tests covering None guard, label parity with fuse.label, and boundary conditions at both thresholds

feat: adjustable threshold tau with client-side re-flagging (F4)

Closes #F6 - _export_pdf: renders summary sentences with colour-coded backgrounds (light green/orange/red per label) via fpdf2; includes model/source metadata, legend, and signal-scores table - _latin1: sanitises text for Helvetica core font (Latin-1 only); avoids FPDFUnicodeEncodingException on non-Latin-1 characters - gr.DownloadButton for PDF wired alongside JSON button; both become visible after analysis completes - fpdf2 moved from dev to main dependencies (needed at app runtime) - 3 new tests: None guard, valid PDF header, summary text verified via pdfplumber extraction

feat: export annotated PDF (F6)

Closes #F8 Replaces scaffold placeholder with a full user guide covering: hardware requirements (GPU/CPU with expected times), installation, running the app, step-by-step usage (load → analyse → interpret → trace source spans → adjust tau → export), result interpretation notes, JSON/PDF export, and project layout.

docs: README + usage walkthrough (F8)

Closes #F9 docs/use-case.puml — PlantUML use-case diagram covering UC-01 'Verify a Summary' and all extensions from requirements.md §7: - 3 actors: Journalist (P1), Policy Analyst (P2), Financial Analyst (P3) - 5 included steps: Submit Document, Validate Input, Summarise, Compute & Fuse Signals, Render Heatmap - 4 user-initiated extensions: Trace Source Spans, Adjust tau, Export JSON, Export PDF - 2 error extensions: Handle Invalid Input (2a), Handle Model Failure (3a) - all FRs and US IDs annotated on each use case

docs: UML use-case diagram (F9)

Static HTML mockup covering source panel, summary heatmap, threshold sliders, input area, Analyse/Export buttons, legend, and signal-scores table. README documentation section links to the new file.

docs: two-panel dashboard UI wireframe (F10)

…etrics Replace the gradient-Inseq attribution (signal C), which is undefined for RAGTruth's external-model summaries, with a generator-agnostic support attribution derived from the sentence-level NLI matrix: attr_conc (support concentration) and attr_loo (best-supporter margin), as signals C and D. These are well-defined for any (source, summary) pair. Add threshold-free roc_auc / pr_auc to the ablation so the report is not read off a single fixed-0.5 operating point, which is misleading under the ~5% hallucination base rate. The ablation now covers every non-empty subset of {A classifier, B nli, C attr_conc, D attr_loo}. - sumlens/signals/support.py: support_attribution() over a shared NLI matrix - eval/features.py: attr_conc/attr_loo columns (drop legacy "attribution") - eval/ablation.py: 4 signals, 15 conditions, emit roc_auc/pr_auc - eval/metrics.py: roc_auc (rank-based) + pr_auc (average precision) - train_fusion.py: impute empty signal columns to neutral 0.5 - scripts/jobs/run_eval.sbatch: ablation fits its own per-subset models

feat(eval): generator-agnostic attribution signals + threshold-free metrics

docs: tidy implementation-order wording in data-model

Davidpereira2803 and others added 23 commits June 8, 2026 18:50

Merge pull request #4 from bacemtayeb/feat/frontend-ui

f389c11

feat: two-panel Gradio UI with faithfulness heatmap (F2)

fix: escape user text in source HTML to prevent XSS

0eb39e6

html.escape() applied to sentence.text and document.raw_text before interpolation into the gr.HTML panel — user-supplied content (pasted text or PDF extract) could otherwise inject arbitrary HTML.

Merge pull request #5 from bacemtayeb/feat/frontend-ui

d2f8e8a

feat: click-to-highlight source spans

Merge pull request #6 from bacemtayeb/feat/export-json

e102059

feat: export annotated result as JSON (F5)

Merge pull request #7 from bacemtayeb/feat/e2e-test

2c77218

test: end-to-end UI test with model boundary mocks (F7)

Merge pull request #8 from bacemtayeb/feat/tau-slider

7ac7d6c

feat: adjustable threshold tau with client-side re-flagging (F4)

Merge pull request #9 from bacemtayeb/feat/export-pdf

7614e67

feat: export annotated PDF (F6)

Merge pull request #10 from bacemtayeb/feat/readme

d67da09

docs: README + usage walkthrough (F8)

Merge pull request #11 from bacemtayeb/feat/readme

cf1e12d

docs: UML use-case diagram (F9)

docs: two-panel dashboard UI wireframe (F10)

11c6989

Static HTML mockup covering source panel, summary heatmap, threshold sliders, input area, Analyse/Export buttons, legend, and signal-scores table. README documentation section links to the new file.

Merge pull request #12 from bacemtayeb/feat/mockup

1fc3d1d

docs: two-panel dashboard UI wireframe (F10)

docs: tidy implementation-order wording in data-model

f214d08

Merge pull request #14 from bacemtayeb/feat/eval-attribution-signals

f824c7e

feat(eval): generator-agnostic attribution signals + threshold-free metrics

Merge pull request #15 from bacemtayeb/docs/data-model-wording

9128e12

docs: tidy implementation-order wording in data-model

bacemtayeb merged commit a5fa233 into main Jun 17, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release: full pipeline + evaluation + dashboard (v0.1.0)#16

Release: full pipeline + evaluation + dashboard (v0.1.0)#16
bacemtayeb merged 23 commits into
mainfrom
dev

bacemtayeb commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bacemtayeb commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants