Release: full pipeline + evaluation + dashboard (v0.1.0)#16
Merged
Conversation
Closes #F2 - app.py: source/summary two-panel layout, per-sentence green/orange/red highlight via gr.HighlightedText, button loading state as progress indicator, accordion for ingested source + JSON export, user-facing error messages for empty/oversized/oversized-PDF input - tests/test_app.py: 6 unit tests covering _to_highlighted, run() with text and PDF, empty/oversized text rejection, oversized PDF rejection; all models mocked at the analyse/load_text/load_pdf boundary — no real weights loaded
feat: two-panel Gradio UI with faithfulness heatmap (F2)
Closes #F3 - _render_source_html: renders source sentences as HTML; sentences whose IDs are in the highlighted set get a yellow <mark> span - run() now returns the AnalysisResult object so build_app can store it in gr.State for the select handler - summary_out.select() wired to _on_sentence_select: looks up the clicked sentence's verdict, extracts top_source_sentence_ids (≤5), and re-renders the source HTML with those spans highlighted; clicking another sentence switches the highlight - Source panel promoted from accordion Textbox to always-visible gr.HTML in the left column so the two-panel layout is complete - 6 new tests covering _render_source_html and the select highlight logic
html.escape() applied to sentence.text and document.raw_text before interpolation into the gr.HTML panel — user-supplied content (pasted text or PDF extract) could otherwise inject arbitrary HTML.
Closes #F5 - _export_json: writes result.model_dump_json() to a temp file, returns the path; returns None if no result is loaded yet - gr.DownloadButton wired alongside Analyse; becomes visible and populated with the file path once analysis completes — one click downloads - JSON conforms to the AnalysisResult schema from sumlens/types.py (no redefinition); round-trips via AnalysisResult.model_validate() - 3 new tests: None guard, round-trip validation, required field presence
feat: click-to-highlight source spans
feat: export annotated result as JSON (F5)
Closes #F7 - tests/test_e2e.py: 5 E2E tests driving app.run() through the full pipeline with ML models mocked at the module boundary (_get_summariser, _get_detector, _get_nli, _source_token_attributions); no real weights loaded. Covers: AnalysisResult shape, highlight colours matching verdicts, source HTML content, JSON export round-trip, model loader guard. - tests/conftest.py: also downloads punkt_tab so NLTK works on newer versions that switched from punkt to punkt_tab format.
test: end-to-end UI test with model boundary mocks (F7)
Closes #F4 - _apply_tau: re-labels summary sentences from stored fused_score values using the same logic as fuse.label (score < tau_h → hallucinated, score >= tau_g → grounded); no model re-run - Two sliders (tau_hallucinated / tau_grounded) added to the UI; each slider.change() calls _apply_tau and updates summary_out only — the pipeline is never invoked again - 5 new tests covering None guard, label parity with fuse.label, and boundary conditions at both thresholds
feat: adjustable threshold tau with client-side re-flagging (F4)
Closes #F6 - _export_pdf: renders summary sentences with colour-coded backgrounds (light green/orange/red per label) via fpdf2; includes model/source metadata, legend, and signal-scores table - _latin1: sanitises text for Helvetica core font (Latin-1 only); avoids FPDFUnicodeEncodingException on non-Latin-1 characters - gr.DownloadButton for PDF wired alongside JSON button; both become visible after analysis completes - fpdf2 moved from dev to main dependencies (needed at app runtime) - 3 new tests: None guard, valid PDF header, summary text verified via pdfplumber extraction
feat: export annotated PDF (F6)
Closes #F8 Replaces scaffold placeholder with a full user guide covering: hardware requirements (GPU/CPU with expected times), installation, running the app, step-by-step usage (load → analyse → interpret → trace source spans → adjust tau → export), result interpretation notes, JSON/PDF export, and project layout.
docs: README + usage walkthrough (F8)
Closes #F9 docs/use-case.puml — PlantUML use-case diagram covering UC-01 'Verify a Summary' and all extensions from requirements.md §7: - 3 actors: Journalist (P1), Policy Analyst (P2), Financial Analyst (P3) - 5 included steps: Submit Document, Validate Input, Summarise, Compute & Fuse Signals, Render Heatmap - 4 user-initiated extensions: Trace Source Spans, Adjust tau, Export JSON, Export PDF - 2 error extensions: Handle Invalid Input (2a), Handle Model Failure (3a) - all FRs and US IDs annotated on each use case
docs: UML use-case diagram (F9)
Static HTML mockup covering source panel, summary heatmap, threshold sliders, input area, Analyse/Export buttons, legend, and signal-scores table. README documentation section links to the new file.
docs: two-panel dashboard UI wireframe (F10)
…etrics
Replace the gradient-Inseq attribution (signal C), which is undefined for
RAGTruth's external-model summaries, with a generator-agnostic support
attribution derived from the sentence-level NLI matrix: attr_conc (support
concentration) and attr_loo (best-supporter margin), as signals C and D.
These are well-defined for any (source, summary) pair.
Add threshold-free roc_auc / pr_auc to the ablation so the report is not
read off a single fixed-0.5 operating point, which is misleading under the
~5% hallucination base rate. The ablation now covers every non-empty subset
of {A classifier, B nli, C attr_conc, D attr_loo}.
- sumlens/signals/support.py: support_attribution() over a shared NLI matrix
- eval/features.py: attr_conc/attr_loo columns (drop legacy "attribution")
- eval/ablation.py: 4 signals, 15 conditions, emit roc_auc/pr_auc
- eval/metrics.py: roc_auc (rank-based) + pr_auc (average precision)
- train_fusion.py: impute empty signal columns to neutral 0.5
- scripts/jobs/run_eval.sbatch: ablation fits its own per-subset models
feat(eval): generator-agnostic attribution signals + threshold-free metrics
docs: tidy implementation-order wording in data-model
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
First stable release: the full SumLens pipeline plus evaluation and UI.
Engine — ingestion, BART summariser, three faithfulness signals (classifier, NLI, generator-agnostic support attribution), logistic-regression fusion, and the end-to-end analyse() pipeline. All model-boundary tests mocked so CI runs without weights.
Evaluation — RAGTruth harness, signal ablation over every subset of {A,B,C,D}, with threshold-free ROC-AUC / PR-AUC (fixed-0.5 F1 is misleading at the ~5% hallucination base rate). On the held-out split, fusion lifts the best single signal 0.791 -> 0.835 ROC-AUC.
UI slice (F2-F10) — two-panel Gradio dashboard with faithfulness heatmap, click-to-highlight, adjustable threshold, JSON + PDF export, end-to-end UI test, README, use-case diagram, and wireframe mockup.
CI green (ruff + mypy + pytest, coverage gate) on every merged PR.