Corpus Scribe is a Markdown-first research ingestion and corpus tool for web articles and PDFs.
It captures messy publisher pages and source PDFs, normalizes them into clean, durable article bundles, preserves local assets, generates citation metadata, and optionally writes companion notes. The main artifact is always the saved Markdown. Reader views, notes, and reading PDFs are derived layers around that corpus.
It includes:
- a Chrome extension for one-click capture
- a Zotero plugin for sending library items with metadata and notes
- a Dockerized Flask backend
- a browser reader for search, reading, highlights, notes, and bibliography review
Corpus Scribe is built for people who want a private local corpus that can be:
- read comfortably in a browser
- searched and indexed locally
- processed by scripts, local tools, and LLM workflows
- preserved in a format that stays useful outside this app
The core job is not "save pages to read later." The core job is:
- capture technical content from the web or a source PDF
- normalize it into a high-quality Markdown bundle
- keep provenance and companion artifacts beside it
- make that corpus usable for reading, annotation, and later machine-assisted work
- Build a personal research corpus in plain Markdown instead of scattered bookmarks, PDFs, and clipped HTML.
- Keep technical documents usable after capture: equations, tables, code blocks, images, bibliography, and notes survive much better than in generic clipping workflows.
- Create a corpus that is easier for local agents and LLM workflows to read, search, summarize, and link.
- Keep both provenance and usability in the same bundle:
- original source PDF when applicable
- canonical
.md - optional
.notes.md - optional reading PDF for Kindle or offline reading
Corpus Scribe is a good fit if you want to:
- build a personal knowledge base from articles, blog posts, papers, and technical documentation
- save papers into labeled folders like
ml,agents,biology, orideasand query them later with local tools - keep a private markdown library with stable frontmatter, bibliography files, notes, and an index that scripts or agents can consume
- generate cleaner reading copies from difficult source PDFs without making the PDF workflow the center of the system
It is less about social bookmarking or casual read-it-later queues, and more about building a durable technical corpus.
- Save web articles and PDFs from Chrome into labeled bundles under
output/<label>/<article>/ - Normalize publisher HTML into readable Markdown with equations, tables, code, figures, metadata, and local assets
- Preserve bibliography as a sibling
*.bibfile - For source PDFs:
- save the original file as
*.source.pdf - extract Markdown with Mistral OCR when configured
- fall back to
pdftotextwhen OCR is unavailable or fails - generate a separate
*.reading.pdffrom cleaned Markdown
- save the original file as
- Browser-based reader and library workbench optimized for WSL + Windows Chrome
- Library search with local indexing and article quality stars
- Open-document switching, close current document, and persisted UI state
- Focus mode, dark mode, adjustable font size, and persisted reading position
- TOC popup for long documents
- External links open in a new tab
- Direct link to the original source page
- Download linked PDFs from the reader
- Select text in the article and save highlights without reloading the document
- Inline visible highlights in the reader
- Highlight panel with list mode and single-item mode with next/previous navigation
- Click a highlight to center it in the reader
- Add comments to highlights and copy them to the clipboard
- Mark sections as
noisewith strikethrough styling and optionally hide them in the reader - Optional
References = noisemode for cleaner reading and lower-token LLM workflows - Notes pane with two modes:
MarkdowneditingPreviewrich rendered view
- Markdown syntax coloring in the notes editor
- Typora-style keyboard shortcuts for fast markdown editing
- Generate companion notes for an existing article bundle from the UI
- Copy code blocks
- Copy equations as LaTeX
- Copy tables as CSV
- Open images in full resolution
- Copy images to the clipboard
- Markdown is the canonical artifact
- Annotation layers such as highlights, notes, and noise stay attached to the article bundle instead of replacing the source
- Stable YAML frontmatter for indexing and downstream processing
- Generated
index.jsonlat the corpus root - Companion artifacts around the main article:
*.notes.md*.bib*.highlights.json*.source.pdf*.reading.pdf
corpus-scribe/
docker-compose.yml
output/ # saved PDFs and Markdown
backend/
Dockerfile
main.py # Flask API
article_extractor.py # article extraction + HTML/MD/PDF pipeline
send_email_gmail.py # Gmail API (Kindle delivery)
requirements.txt
test_article_extractor.py # end-to-end extractor regression tests
extension/
manifest.json # Chrome Manifest V3
popup.html / popup.js # extension UI
background.js
assets/
zotero-plugin/
manifest.json # Zotero WebExtension manifest
bootstrap.js # lifecycle hooks (startup/shutdown)
prefs.js # default preferences
content/
corpus-scribe.js # plugin logic (menu, upload, metadata)
preferences.xhtml # settings pane
desktop/
src/ # browser reader + notes/highlights UI
vite.config.ts
./scripts/compose-with-home-env.sh up -d --build backendThe backend starts on http://localhost:5000. Saved files appear in ./output/<label>/<article>/.
The canonical artifact is the saved .md file. Notes, bibliography, source PDFs, and reading PDFs are derived artifacts around that markdown core.
The backend container runs as ${UID:-1000}:${GID:-1000} by default so files created under ./output/ stay owned by your host user.
If you want markdown and image assets to open cleanly in Windows apps from outside WSL, point the output mount at a Windows-backed folder.
Create a .env file from .env.example and set:
HOST_OUTPUT_DIR=/mnt/c/Users/<you>/Documents/corpus-scribeThen restart the backend with:
./scripts/compose-with-home-env.sh up -d --build backendInside the container the backend still writes to /output, but Docker will map that to your configured host folder.
- Open
chrome://extensions/ - Enable Developer mode
- Click Load unpacked and select the
extension/directory - Navigate to any article or PDF and click the extension icon
- Choose or create a Label in the popup
- The popup will show what it detected:
Detected: Web article- or for PDFs:
Detected: PDFExtraction: Mistral OCR- or
Extraction: pdftotext fallback
- Choose Save PDF + MD or Send to Kindle
The primary workflow is:
- capture article or PDF
- store it under a label
- keep the markdown as the main artifact
- optionally generate notes and reading PDFs
Kindle delivery is supported, but it is an export path, not the center of the system.
The Zotero plugin sends library items directly from Zotero to the Corpus Scribe backend. It works with Zotero 7 and 8.
What it does:
- Right-click one or more Zotero items and choose Send to Corpus Scribe
- The plugin reads the PDF attachment, extracts Zotero metadata (title, authors, DOI, date, publication, abstract, etc.), and uploads everything to the backend
- If the Zotero item has child notes, they are converted from HTML to Markdown and sent as working notes
- The backend processes the PDF through the standard extraction pipeline (Mistral OCR or pdftotext fallback) with the Zotero metadata taking precedence over OCR-extracted metadata
- A label prompt lets you file the item into an existing or new corpus label
Installation:
- Build or download
zotero_plugin.zipfrom thezotero-plugin/directory - In Zotero, go to Tools → Add-ons
- Click the gear icon → Install Add-on From File and select the
.zip - Open the plugin preferences (Zotero Settings → Corpus Scribe) and configure:
- API Base URL — the backend address (default
http://127.0.0.1:5000) - API Key — must match the backend
API_KEY(defaultapi-key-1234) - Page Size — reading PDF page size (A4, A5, or Letter)
- API Base URL — the backend address (default
There is now a browser reader in desktop/ that acts as a workbench over the saved corpus:
- library browser on the left
- markdown reader in the center
- notes, highlights, and bibliography on the right
- local indexed search over article, notes, and highlight text
- article quality stars directly in the library
- independently scrolling panes
- draggable splitters for resizing the side panels
- persisted UI state such as selected label, display mode, and view preferences
It reads the existing Corpus Scribe bundle layout directly instead of introducing a new storage model. The reader is intentionally downstream of the saved corpus: if you stop using the UI, the Markdown bundles remain usable on their own. The reader runs as a normal web app and talks to the Flask backend through local /desktop/* endpoints.
From the repo root:
./scripts/compose-with-home-env.sh up -d --build backend
cd desktop
npm install
npm run devThen open:
http://localhost:1420/
On WSL, opening that URL in Windows Chrome is the recommended path. This is the primary supported reader workflow.
The compose wrapper above sources ~/.env first, so backend-only secrets such as NOTES_LLM_API_KEY and MISTRAL_API_KEY can stay in your home environment instead of the repo-local .env.
The current browser UI supports:
- browse and search the corpus
- open multiple documents and switch between them quickly
- read the full article without an extra "load full article" step
- create visible inline highlights
- mark low-value sections as
noise - write Markdown notes and switch to rich preview
- generate notes from the current article on demand
- inspect bibliography and copy citation data
- jump through highlights and back to the relevant article location
Corpus Scribe includes a Model Context Protocol (MCP) server that exposes the library to any MCP-compatible host — Claude Desktop, Claude Code, Cursor, or custom agents. The server lives in scripts/scribe.py and runs as a stdio transport, so the host spawns it directly rather than connecting to an HTTP endpoint.
The MCP layer turns the corpus into a tool-equipped knowledge base that an LLM can navigate during conversation. Without it, the LLM would need the full text of every relevant article pasted into context. With it, the LLM can search, triage, and selectively read only the parts it needs — the same way a human skims a table of contents before reading a section.
Add a server entry to your MCP host configuration. For Claude Desktop or Claude Code, add to the MCP settings:
{
"mcpServers": {
"scribe-mcp": {
"command": "python",
"args": ["/path/to/send-to-scribe/scripts/scribe.py", "mcp-server"],
"env": {
"SCRIBE_API_BASE": "http://127.0.0.1:5000"
}
}
}
}The mcp Python package must be installed in the environment the host uses to spawn the server (pip install mcp or uv tool install mcp). The Flask backend must be running.
The server exposes 13 tools. They fall into four groups:
Context and navigation
get_current_context— what the desktop reader currently has open (focused document, open tabs, label filter)list_labels— top-level corpus folderslist_documents— compact metadata list, optionally filtered by labelsearch— PubMed-style field queries ((CSD[Title]) AND (Tournier JD[Author]))
Reading
read_document— full article body with noise and references stripped by defaultlist_sections— heading outline with per-section character countsread_section— a single section by heading, case-insensitive match
Highlights and notes
get_highlights— user-saved highlights with comments (noise-variant excluded)read_notes— companion.notes.mdbodyappend_notes— additive write to notes without replacing existing contentupdate_notes— whole-file replace of notes
Related documents
get_related— user-curated related-document links for cross-paper navigation
Academic papers in the corpus can be 50k–150k characters. Loading full documents into an LLM's context window is expensive and often unnecessary — most tasks only need a few sections. The MCP server provides a progressive-disclosure reading pattern that keeps context token usage low:
-
list_sectionsfetches the document outline without any body text. Each entry includes the heading name, nesting level, and character count, so the LLM can see the structure of a 100-page paper in a few hundred tokens and decide which parts matter. -
read_sectionfetches a single section by heading (case-insensitive, with prefix-match fallback). Only the targeted slice — including nested subsections — enters the context. A typical section is 2k–10k characters instead of the full 100k. -
read_documentwithstripNoise=true(default) removes content the user has marked as noise — author affiliations, page headers, boilerplate. SettingstripReferences=true(also default) drops the references section, which in academic papers can be 20–40% of the total length. Together these flags can cut a document's token footprint in half before the LLM sees it.
The recommended workflow for a long paper is: list_sections to survey, then read_section for the parts that matter. Reserve read_document for short articles or when the user explicitly asks for the full body.
For cross-paper research, search and list_documents return compact metadata (title, authors, rating, DOI, excerpt) — enough to triage dozens of candidates without reading any bodies. Follow up with get_highlights on promising hits: highlights are typically a few hundred tokens and represent what the user already considered important.
The same scripts/scribe.py file doubles as a CLI that mirrors the core MCP tools:
python scripts/scribe.py search "diffusion models"
python scripts/scribe.py read /output/ml/Paper/Paper.md
python scripts/scribe.py context
python scripts/scribe.py update-notes /output/ml/Paper/Paper.md --from-file notes.mdPass --json for machine-readable output. The CLI uses only the Python standard library and does not require the mcp package.
All endpoints accept JSON with an apiKey field for authentication (default: api-key-1234, configurable via API_KEY env var).
Extract article and save PDF + Markdown to the output directory.
{
"apiKey": "api-key-1234",
"label": "Machine Learning",
"url": "https://example.com/article",
"html": "<html>...</html>",
"cookies": {"domain": {"name": "value"}}
}Response:
{
"success": true,
"title": "Article Title",
"label": "Machine Learning",
"primary": "/output/Machine Learning/Article Title/Article Title.md",
"pdf": "/output/Machine Learning/Article Title/Article Title.pdf",
"bib": "/output/Machine Learning/Article Title/Article Title.bib",
"notes": "/output/Machine Learning/Article Title/Article Title.notes.md",
"pdfAvailable": true,
"notesAvailable": true,
"md": "/output/Machine Learning/Article Title/Article Title.md",
"metadata": {
"source_site": "example.com",
"citation_key": "doe2026articletitle",
"doi": "10.1000/example",
"language": "en",
"word_count": 1234,
"image_count": 4,
"ingested_at": "2026-04-12T10:00:00+00:00"
}
}Optional notes request overrides for local saves:
{
"notes": {
"provider": "openai",
"model": "gpt-4.1-mini",
"base_url": "https://api.openai.com/v1",
"api_key": "sk-..."
}
}Download a source PDF, save the original PDF locally, extract markdown from the PDF with Mistral OCR when configured, and optionally generate companion notes. If OCR is unavailable or fails, the backend falls back to pdftotext.
For PDF-source saves, the backend keeps two PDF artifacts:
sourcePdf: the original downloaded PDF for archival fidelitypdf: a regenerated reading PDF created from the cleaned markdown for better Kindle readingbib: a generated bibliography entry for citation workflows
{
"apiKey": "api-key-1234",
"label": "Papers",
"url": "https://example.com/paper.pdf",
"sourceName": "paper.pdf",
"cookies": {"domain": {"name": "value"}}
}Response fields of interest:
{
"pdf": "/output/Papers/Article Title/Article Title.reading.pdf",
"sourcePdf": "/output/Papers/Article Title/Article Title.source.pdf",
"bib": "/output/Papers/Article Title/Article Title.bib",
"pdfAvailable": true,
"sourcePdfAvailable": true,
"primary": "/output/Papers/Article Title/Article Title.md"
}Upload a PDF file directly with optional metadata overrides and working notes. This is the endpoint used by the Zotero plugin.
Accepts multipart/form-data with these fields:
| Field | Required | Description |
|---|---|---|
file |
yes | The PDF file |
apiKey |
yes | API authentication key |
label |
yes | Corpus label to file under |
pageSize |
no | Reading PDF page size (a4, a5, letter; default a5) |
sourceName |
no | Original filename |
metadata |
no | JSON string with citation overrides (title, author, doi, date, url, container_title, publisher, volume, issue, pages, abstract) |
note |
no | Markdown text to save as a .notes.md companion file |
Response:
{
"success": true,
"title": "Article Title",
"label": "Papers",
"primary": "/output/Papers/Article Title/Article Title.md"
}List existing output labels so the extension can offer them in the save dialog.
Query params:
apiKey=...
Return lightweight backend capability information used by the popup.
Query params:
apiKey=...
Response:
{
"success": true,
"pdfOcr": {
"available": true,
"engine": "mistral",
"fallback": "pdftotext"
}
}Extract article and send PDF to Kindle via Gmail.
{
"apiKey": "api-key-1234",
"url": "https://example.com/article",
"html": "<html>...</html>",
"cookies": {},
"email": "you@gmail.com",
"kindleEmail": "you@kindle.com"
}Returns {"status": "ok"}.
To use the Send to Kindle feature, you need Google OAuth credentials:
- Create a project in Google Cloud Console
- Enable the Gmail API
- Create OAuth 2.0 credentials (Desktop app type)
- Download
credentials.json - Uncomment the credential mounts in
docker-compose.yml:volumes: - ./output:/output - ./credentials.json:/app/credentials.json:ro - ./token.json:/app/token.json
- On first run, the backend will open a browser for OAuth consent and save
token.json
Also add your Gmail address to the Approved Personal Document E-mail List in your Kindle settings.
| Env variable | Default | Description |
|---|---|---|
API_KEY |
api-key-1234 |
API authentication key |
OUTPUT_DIR |
/output |
Directory for saved PDFs and Markdown |
HOST_OUTPUT_DIR |
./output |
Host-side bind mount target for /output in docker-compose.yml |
UID |
1000 |
UID used to run the backend container |
GID |
1000 |
GID used to run the backend container |
MISTRAL_API_KEY |
empty | Enables Mistral OCR as the primary PDF-to-markdown extractor |
MISTRAL_BASE_URL |
https://api.mistral.ai |
Base URL for Mistral OCR API |
MISTRAL_OCR_MODEL |
mistral-ocr-latest |
Mistral OCR model used for source PDFs |
- The Chrome extension captures the current page's full HTML and cookies
- The backend extracts the main article content, preferring a real
<article>tag and falling back to Readability (readability-lxml) when needed - Images are downloaded with the captured browser cookies so paywalled/CDN-backed assets can still be resolved
- Publisher-specific math wrappers are normalized before conversion:
- PubMed/PMC display equations stored in
table.disp-formulaare unwrapped so they are treated as equations, not tables - ScienceDirect/MathJax blocks are normalized from embedded MathML or MathJax markup
- real MathML is preserved whenever available
- PubMed/PMC display equations stored in
- Presentation-only HTML is cleaned before markdown generation so the
.mdoutput is not polluted with layout wrappers and raw HTML blocks - Lists, code-listing tables, and figure wrappers are normalized before conversion so markdown readers do not get empty bullets, broken code blocks, or layout-only captions
- Markdown is generated from normalized HTML with Pandoc, not
html2text, so equations, tables, and code survive conversion much more reliably - The markdown file is written as the canonical saved artifact with structured YAML frontmatter for downstream indexing and LLM use
- A sibling
*.bibfile is generated from the extracted citation metadata so each saved document has an immediately usable bibliography entry - Markdown is post-processed to remove same-document fragment links and other raw HTML artifacts that are harmless in browsers but noisy in markdown readers
- Local saves can also generate a companion
*.notes.mdfile using a local OpenAI-compatible model endpoint - PDF generation uses a straightforward markdown-first path:
- Pandoc converts the generated markdown to HTML
- the backend wraps that HTML in an academic print stylesheet
- headless Chromium renders the final self-contained PDF without browser print headers or footers
- PDF tabs use a native PDF source path instead of HTML extraction:
- the backend downloads the source PDF with the captured browser cookies
- Mistral OCR is the primary extractor for PDF-to-markdown conversion when
MISTRAL_API_KEYis configured - the raw OCR response is cached as
ocr_response.jsonbeside the extracted files pdftotextis only the fallback path for OCR-disabled or OCR-failed PDFs- OCR-derived markdown goes through a PDF-specific cleanup pass to unescape OCR HTML entities, normalize LaTeX spacing, and tighten inline math in prose
- the original PDF is saved as
*.source.pdf - a separate reading PDF is regenerated from the cleaned markdown and saved as
*.reading.pdf
- Local saves are written under
/output/<label>/<article>/ - Local saves are markdown-primary: if PDF rendering fails, the markdown still remains saved successfully
- Files are either saved locally or emailed to Kindle via Gmail API
Saved articles include YAML frontmatter intended to make the markdown corpus easier to index and query later. Current fields include:
doc_iddoc_typetitleauthordateurlcanonical_urlsource_sitelabellanguagedescriptioncitation_keydoiarxiv_idsource_formatsource_fileocr_enginepage_countword_countimage_countingested_atnotes_filenotes_doc_idbib_file
For companion notes, the frontmatter also includes:
source_articlesource_doc_id
For local saves, the backend can generate a sibling *.notes.md file using a configurable LLM provider. The current default is Anthropic:
NOTES_LLM_PROVIDERdefault:anthropicNOTES_LLM_MODELdefault:claude-sonnet-4-20250514NOTES_LLM_API_KEYmust be set for Anthropic notes generation
The notes file is a best-effort derived artifact intended for later review and LLM workflows. The canonical source remains the extracted article markdown.
Supported providers:
openai_compatibleUses the OpenAI-stylePOST /chat/completionsAPI. This works with local endpoints such as LM Studio.openaiAlso usesPOST /chat/completions, but you typically pointNOTES_LLM_BASE_URLathttps://api.openai.com/v1.anthropicUsesPOST https://api.anthropic.com/v1/messageswithNOTES_LLM_API_KEYandNOTES_ANTHROPIC_VERSION.
Configuration can come from either:
- environment variables in
.env - per-request
notesoverrides sent toPOST /save_local
Local saves also maintain a generated index.jsonl file at the output root:
output/index.jsonl
It contains one JSON object per saved article or notes file, including:
doc_idtypetitlelabelpathpdf_pathsource_pdf_pathbib_pathnotes_pathsource_article_pathsource_doc_idurlcanonical_urlsource_sitesource_formatocr_enginecitation_keydoiarxiv_idlanguageword_countimage_countingested_at
The index is deterministic and path-based, so repeated saves update records instead of blindly appending duplicates.
- Math is not reconstructed from hardcoded Unicode replacement tables.
- The pipeline keeps MathML whenever possible and only uses fallback MathJax conversion when the source page no longer exposes real MathML.
- Display-formula wrapper tables are treated as equation containers and removed before markdown/PDF conversion.
- Real content tables remain tables and are converted by Pandoc into markdown tables and styled HTML tables.
- Syntax-highlighter layout tables are converted into semantic fenced/indented code blocks before markdown and PDF generation.
- Loose publisher list markup is normalized so
ulandolitems do not turn into blank bullets or extra spacing.
- The PDF engine is headless Chromium.
- The backend renders the saved markdown through a fixed academic HTML/CSS template before printing to PDF.
- PDFs are self-contained because images are resolved to local assets before browser rendering.
- Browser print headers and footers are disabled, so the PDF does not include
file://paths, timestamps, or page chrome. - Unicode text is handled through generic normalization and cleanup, not symbol-by-symbol mappings.
- Code blocks are preserved from markdown and rendered with monospace print styling rather than being flattened into body text.
- For source PDFs saved via
POST /save_pdf, the original document is preserved separately as*.source.pdf. - The default
pdfreturned byPOST /save_pdfis the regenerated reading copy*.reading.pdf. - For source PDFs, markdown quality is best when
MISTRAL_API_KEYis configured and the backend can use Mistral OCR. - The popup surfaces the planned PDF extraction path before save and the actual OCR engine after save.
- Every saved article also gets a sibling
*.bibfile generated from the extracted citation metadata.
The extractor has a regression test that exercises:
- PubMed-style display math wrappers
- ScienceDirect-style embedded MathML
- Unicode-heavy prose
- HTML table conversion
- PDF text extraction checks to verify equations are rendered instead of showing literal
$$
Run it in the backend container:
docker compose exec -T backend python -m unittest -v test_article_extractor
