Skip to content

Release: full pipeline + evaluation + dashboard (v0.1.0)#16

Merged
bacemtayeb merged 23 commits into
mainfrom
dev
Jun 17, 2026
Merged

Release: full pipeline + evaluation + dashboard (v0.1.0)#16
bacemtayeb merged 23 commits into
mainfrom
dev

Conversation

@bacemtayeb

Copy link
Copy Markdown
Owner

First stable release: the full SumLens pipeline plus evaluation and UI.

Engine — ingestion, BART summariser, three faithfulness signals (classifier, NLI, generator-agnostic support attribution), logistic-regression fusion, and the end-to-end analyse() pipeline. All model-boundary tests mocked so CI runs without weights.

Evaluation — RAGTruth harness, signal ablation over every subset of {A,B,C,D}, with threshold-free ROC-AUC / PR-AUC (fixed-0.5 F1 is misleading at the ~5% hallucination base rate). On the held-out split, fusion lifts the best single signal 0.791 -> 0.835 ROC-AUC.

UI slice (F2-F10) — two-panel Gradio dashboard with faithfulness heatmap, click-to-highlight, adjustable threshold, JSON + PDF export, end-to-end UI test, README, use-case diagram, and wireframe mockup.

CI green (ruff + mypy + pytest, coverage gate) on every merged PR.

Davidpereira2803 and others added 23 commits June 8, 2026 18:50
Closes #F2

- app.py: source/summary two-panel layout, per-sentence green/orange/red
  highlight via gr.HighlightedText, button loading state as progress indicator,
  accordion for ingested source + JSON export, user-facing error messages for
  empty/oversized/oversized-PDF input
- tests/test_app.py: 6 unit tests covering _to_highlighted, run() with text
  and PDF, empty/oversized text rejection, oversized PDF rejection; all models
  mocked at the analyse/load_text/load_pdf boundary — no real weights loaded
feat: two-panel Gradio UI with faithfulness heatmap (F2)
Closes #F3

- _render_source_html: renders source sentences as HTML; sentences whose IDs
  are in the highlighted set get a yellow <mark> span
- run() now returns the AnalysisResult object so build_app can store it in
  gr.State for the select handler
- summary_out.select() wired to _on_sentence_select: looks up the clicked
  sentence's verdict, extracts top_source_sentence_ids (≤5), and re-renders
  the source HTML with those spans highlighted; clicking another sentence
  switches the highlight
- Source panel promoted from accordion Textbox to always-visible gr.HTML in
  the left column so the two-panel layout is complete
- 6 new tests covering _render_source_html and the select highlight logic
html.escape() applied to sentence.text and document.raw_text before
interpolation into the gr.HTML panel — user-supplied content (pasted
text or PDF extract) could otherwise inject arbitrary HTML.
Closes #F5

- _export_json: writes result.model_dump_json() to a temp file, returns
  the path; returns None if no result is loaded yet
- gr.DownloadButton wired alongside Analyse; becomes visible and populated
  with the file path once analysis completes — one click downloads
- JSON conforms to the AnalysisResult schema from sumlens/types.py (no
  redefinition); round-trips via AnalysisResult.model_validate()
- 3 new tests: None guard, round-trip validation, required field presence
feat: click-to-highlight source spans
feat: export annotated result as JSON (F5)
Closes #F7

- tests/test_e2e.py: 5 E2E tests driving app.run() through the full
  pipeline with ML models mocked at the module boundary (_get_summariser,
  _get_detector, _get_nli, _source_token_attributions); no real weights
  loaded. Covers: AnalysisResult shape, highlight colours matching verdicts,
  source HTML content, JSON export round-trip, model loader guard.
- tests/conftest.py: also downloads punkt_tab so NLTK works on newer
  versions that switched from punkt to punkt_tab format.
test: end-to-end UI test with model boundary mocks (F7)
Closes #F4

- _apply_tau: re-labels summary sentences from stored fused_score values
  using the same logic as fuse.label (score < tau_h → hallucinated,
  score >= tau_g → grounded); no model re-run
- Two sliders (tau_hallucinated / tau_grounded) added to the UI; each
  slider.change() calls _apply_tau and updates summary_out only — the
  pipeline is never invoked again
- 5 new tests covering None guard, label parity with fuse.label, and
  boundary conditions at both thresholds
feat: adjustable threshold tau with client-side re-flagging (F4)
Closes #F6

- _export_pdf: renders summary sentences with colour-coded backgrounds
  (light green/orange/red per label) via fpdf2; includes model/source
  metadata, legend, and signal-scores table
- _latin1: sanitises text for Helvetica core font (Latin-1 only);
  avoids FPDFUnicodeEncodingException on non-Latin-1 characters
- gr.DownloadButton for PDF wired alongside JSON button; both become
  visible after analysis completes
- fpdf2 moved from dev to main dependencies (needed at app runtime)
- 3 new tests: None guard, valid PDF header, summary text verified via
  pdfplumber extraction
feat: export annotated PDF (F6)
Closes #F8

Replaces scaffold placeholder with a full user guide covering:
hardware requirements (GPU/CPU with expected times), installation,
running the app, step-by-step usage (load → analyse → interpret →
trace source spans → adjust tau → export), result interpretation
notes, JSON/PDF export, and project layout.
docs: README + usage walkthrough (F8)
Closes #F9

docs/use-case.puml — PlantUML use-case diagram covering UC-01
'Verify a Summary' and all extensions from requirements.md §7:
- 3 actors: Journalist (P1), Policy Analyst (P2), Financial Analyst (P3)
- 5 included steps: Submit Document, Validate Input, Summarise,
  Compute & Fuse Signals, Render Heatmap
- 4 user-initiated extensions: Trace Source Spans, Adjust tau,
  Export JSON, Export PDF
- 2 error extensions: Handle Invalid Input (2a), Handle Model Failure (3a)
- all FRs and US IDs annotated on each use case
docs: UML use-case diagram (F9)
Static HTML mockup covering source panel, summary heatmap, threshold
sliders, input area, Analyse/Export buttons, legend, and signal-scores
table. README documentation section links to the new file.
docs: two-panel dashboard UI wireframe (F10)
…etrics

Replace the gradient-Inseq attribution (signal C), which is undefined for
RAGTruth's external-model summaries, with a generator-agnostic support
attribution derived from the sentence-level NLI matrix: attr_conc (support
concentration) and attr_loo (best-supporter margin), as signals C and D.
These are well-defined for any (source, summary) pair.

Add threshold-free roc_auc / pr_auc to the ablation so the report is not
read off a single fixed-0.5 operating point, which is misleading under the
~5% hallucination base rate. The ablation now covers every non-empty subset
of {A classifier, B nli, C attr_conc, D attr_loo}.

- sumlens/signals/support.py: support_attribution() over a shared NLI matrix
- eval/features.py: attr_conc/attr_loo columns (drop legacy "attribution")
- eval/ablation.py: 4 signals, 15 conditions, emit roc_auc/pr_auc
- eval/metrics.py: roc_auc (rank-based) + pr_auc (average precision)
- train_fusion.py: impute empty signal columns to neutral 0.5
- scripts/jobs/run_eval.sbatch: ablation fits its own per-subset models
feat(eval): generator-agnostic attribution signals + threshold-free metrics
docs: tidy implementation-order wording in data-model
@bacemtayeb bacemtayeb merged commit a5fa233 into main Jun 17, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants