A skeleton for building your own auditable academic knowledge base (KB) for a project or team. Clone it, point an AI coding agent at it, and start turning PDFs into a cross-referenced research wiki where every substantive claim traces back to a source you actually read.
This repo is the generic skeleton. A filled-in project vault — real papers, real cross-references — is the worked example you build on top of it.
It is a concrete, auditability-first implementation of the LLM-maintained wiki
pattern Karpathy sketches in llm-wiki:
immutable raw sources, an LLM-maintained wiki of interlinked markdown pages, and a
schema document that governs how the agent maintains it — here raw/pdfs/,
knowledge/, and the knowledge/AGENTS.md contract, respectively.
An LLM answering from its training weights is fluent and fast — and unverifiable. Worse, it is confidently wrong exactly where you can least afford it: on well-known material. The famous theorem, the canonical equation, the landmark paper everyone cites — these are where a model reconstructs from a blurry memory of a thousand paraphrases and hands you a wrong sign, a fabricated "convergence theorem," a flipped benchmark result, a dropped second author. It sounds authoritative because it has seen the topic a million times. That is the trap.
A knowledge base does not fix this by being perfect. It earns trust by being auditable. Every page on it can be traced to a specific PDF, specific pages, a specific read, on a specific date — or it is flagged as unsourced. You do not have to believe the summary; you can check it.
Three mechanisms, working together, enforce that:
-
A plain-text CONTRACT — the READ PROTOCOL. No PDF, no body. The rule, in
knowledge/AGENTS.md, is non-negotiable: an agent may write a paper's Summary, equations, or theorem statements only after a real PDF is on disk and was actually rendered and read this session. If there is no readable PDF, the agent writes astatus: to-readstub and stops. Reconstructing a body from memory is never allowed, however famous the paper. -
A
source_tracerecord on every page. Each ingested paper page carries asource_traceblock in its frontmatter: whichpages_read, whattranscription_method(read-tool | pdftotext | none), whatdate_read. This is the audit trail. It turns "trust me" into "here is exactly how this page was produced." -
A DETERMINISTIC lint check — plain Python, zero LLM calls.
scripts/lint_kb.pyreads the frontmatter and the filesystem and mechanically catches the failures the prose contract is meant to prevent: an ingested page with an emptysource_trace, a page that claims it was read with the Read tool but whose PDF is not on disk (it could not have been read), a "to-read" stub that nonetheless contains equation- or theorem-grade prose. No LLM grades another LLM here. The checks are arithmetic on files — reproducible, fast, CI-able.
Together: the contract sets the rule, source_trace records compliance, and the
lint enforces it without trusting anyone's word — including the model's.
research-kb/
├── README.md # this file — the integrity pitch + quick start
├── docs/
│ ├── QUICK-START.md # first paper, end to end (fetch → ingest → lint)
│ └── use-cases.md # when a KB beats RAG; team use; learning with it
├── knowledge/ # the skeleton vault (you fill this in)
│ ├── AGENTS.md # THE CONTRACT — the READ PROTOCOL, agent-neutral
│ ├── papers/ concepts/ methods/ theorems/ authors/ ... # page-type dirs
│ ├── templates/ # frontmatter templates per page type
│ └── raw/pdfs/ # downloaded PDFs (gitignored) + manifest.md
├── scripts/
│ ├── lint-kb.sh # wrapper → runs lint_kb.py on knowledge/
│ ├── lint_kb.py # the deterministic checker (plain Python)
│ ├── fetch_pdf.py # download + VALIDATE a PDF (no silent failures)
│ └── requirements.txt # PyYAML, pypdf
└── .claude/skills/knowledge-base/ # the vendored knowledge-base skill
├── SKILL.md # scaffold / ingest / query / lint capabilities
└── references/ # frontmatter schemas, delegation prompt, lessons
The knowledge/ tree ships empty (just the directory skeleton and the contract).
You populate it — either by hand from knowledge/templates/, or by running the
vendored skill's SCAFFOLD step to generate a domain-tailored schema first.
-
Clone and install the two Python deps.
git clone https://github.com/lruthotto/research-kb.git cd research-kb python3 -m pip install -r scripts/requirements.txt -
Read the contract. Open
knowledge/AGENTS.md. This is the READ PROTOCOL — no PDF, no body — and the page schema your agent must follow. Everything else is downstream of it. -
Pick how to start filling
knowledge/:- Scaffold (recommended for a new topic). Open an AI coding agent in this
repo and invoke the vendored
knowledge-baseskill's SCAFFOLD capability (Capability 1 in.claude/skills/knowledge-base/SKILL.md). It interviews you for the topic and page types, then writes a domain-tailored schema and navigation files intoknowledge/. - Or by hand. Copy a template from
knowledge/templates/into the right page-type directory and fill it in.
- Scaffold (recommended for a new topic). Open an AI coding agent in this
repo and invoke the vendored
-
Ingest your first paper with an AI coding agent (Claude Code, or a CLI agent) under the READ PROTOCOL. Fetch a PDF, then have the agent read it from disk and write a sourced page with a
source_trace. Step-by-step:docs/QUICK-START.md. -
Run the lint.
bash scripts/lint-kb.sh knowledge
It must be clean of HARD issues. Warnings (empty dirs, orphan pages on a fresh vault) are expected early on.
scripts/lint-kb.sh is a thin wrapper around scripts/lint_kb.py. The checker
is plain Python with zero LLM calls — it never asks a model to judge whether a
page is "good." It parses each page's YAML frontmatter (via PyYAML) and walks the
filesystem, then applies fixed rules. The ones that enforce auditability:
source_tracepresent. Any paper page withstatus: ingestedmust have a non-emptysource_trace(pages_read,transcription_method,date_read). Empty → HARD failure: the body was not verifiably sourced.transcription_methodis a strict enum.read-tool | pdftotext | none. Anything else (read-tool-with-fallback,pymupdf, …) → HARD failure.- The fabrication signature. A page claiming
transcription_method: read-toolwhosepdf:does not resolve on disk → HARD failure: it could not have been read, so the body was almost certainly reconstructed from memory. - No body on an unread stub. A
status: to-readpage with substantive content in equation/theorem-grade sections → HARD failure (descriptive prose → warning): a page never read can only have been drafted from memory. - Plus structural hygiene: dead links, missing frontmatter
type, PDF-on-disk vs.raw/pdfs/manifest.mddrift, BibTeX sync, orphan pages, empty directories.
Because the rules are arithmetic on files and frontmatter — not a model's opinion — they are deterministic and reproducible. The same vault lints the same way on any machine, in CI, today and next year. That is the whole point: the thing checking the model is not another model.
For the full set of capabilities (scaffold, ingest, query, evolve, connect) and the
complete READ PROTOCOL with its documented failure history, see
.claude/skills/knowledge-base/SKILL.md.