A chat widget that answers questions about your documents, shows its sources, and says "I don't know" instead of making something up. The thing every business actually wants when they ask for "an AI chatbot for our site", built so you can trust what it says.
A real captured run with Claude. The answer is written only from the retrieved passages, the citation chip links to the source it used, and the sources panel shows what the assistant looked at.
Most chatbot demos will confidently invent a refund policy that does not exist. This one answers only from the documents you give it, cites the passage behind every claim, and refuses when the answer is not there.
- A real chat UI, not a curl example. Streamed answers token by token, clickable citations, a sources panel, and an honest "demo mode" badge when it runs without a model. One HTML file, one CSS file, one JS file, no build step.
- Grounded answers with citations. The model is given the retrieved passages and told to ground every claim in them and cite by number. The citation chips in the answer link to the exact source passage.
- An honest "I don't know." Ask it something outside the docs and it says so, instead of guessing. This is the feature that makes a support bot safe to put in front of customers.
A question goes from the browser to a FastAPI endpoint. A pure-Python BM25 retriever (no embeddings server, no GPU, no API key) finds the most relevant passages from your documents. Then the engine decides whether to answer at all, and if it does, Claude writes a grounded answer that streams back to the UI as Server-Sent Events, with the sources sent first so they appear the moment the answer starts.
The honest "I don't know" is not one check, it is two, because one is not enough:
- A cheap retrieval floor. If the best passage's BM25 score is below a threshold, the assistant abstains and the model is never called. A weather question against a help center scores near zero and costs zero tokens.
- The grounded prompt. A question can share words with the docs without being answerable from them: "what is your stock price" shares "price" with the pricing page and clears the cheap floor. So the model is the second layer: it is told to answer only from the passages and to refuse otherwise, and it catches what the lexical floor cannot.
Field note from building this: I measured the real BM25 scores on the bundled corpus before setting the floor. The natural questions I measured scored from about 4 (a short one like "do you support SSO") up to 10 (a distinctive one like "is there a Linux desktop app"), while an off-topic question like the weather scored 0.35. Shorter questions score lower, so the floor sits at 2.5, with margin below the weakest measured question rather than right under it. That margin is the point: the cheap floor should clear junk without clipping a terse real question, and the grounded prompt is there to catch anything lexically close that slips through. Point the app at your own documents and you retune that one number once, against your own measured scores.
pip install -r requirements.txt
# key-free: runs the whole UI with a labelled extractive fallback
LLM_MODE=demo uvicorn app.main:app --port 8000
# the real thing
export ANTHROPIC_API_KEY=sk-...
uvicorn app.main:app --port 8000
Open http://localhost:8000 and ask about plans, refunds, SSO or limits. The bundled knowledge base is the help center of a fictional file-storage product, with real facts to retrieve (an 8-dollar Team plan, a 14-day annual refund window, SSO on Business) and gaps to abstain on (there is no Linux desktop app, and it will tell you).
Demo mode needs no key: it answers by surfacing the most on-topic sentence from the top passage and labels itself clearly. It exists so the whole UI, streaming, citations and abstention can be tried for free. It does no synthesis; that is Claude's job.
Drop your .md files into data/docs/ and restart. Chunk size, how many
passages are retrieved, and the abstention floor are three numbers in
app/config.py. Nothing else changes: the same retriever, engine, UI and
tests work on any corpus. PDFs and web pages enter as text through whatever
extractor you already use.
python -m pytest
22 tests, no API key and no network. They cover chunking and BM25 ranking on the real corpus, the abstention floor measured against genuine and off-topic questions, the engine grounding an answer and streaming sources before tokens (checked on the serialized wire, not just in a list), the abstention path refusing without calling the model, the grounding instruction the second layer depends on, and a mid-stream model failure degrading to a visible message (with the partial answer preserved and no raw exception leaked to the browser), end to end through the web layer.
This is the retrieval-and-answer layer with its UI, not a full support desk: it does not manage tickets, accounts or conversations across sessions, and it keeps no history server-side, which is what makes every answer reproducible from the question in front of it. BM25 is the right retriever for help-desk content, where the words people type are the words in the docs; for corpora full of paraphrase, the retriever is one swap behind the same interface (see the companion repo rag-quality, which measures exactly that trade-off).


