Skip to content

vinimabreu/rag-chat

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rag-chat

A chat widget that answers questions about your documents, shows its sources, and says "I don't know" instead of making something up. The thing every business actually wants when they ask for "an AI chatbot for our site", built so you can trust what it says.

the chat answering from the docs, with a citation and sources

A real captured run with Claude. The answer is written only from the retrieved passages, the citation chip links to the source it used, and the sources panel shows what the assistant looked at.

Most chatbot demos will confidently invent a refund policy that does not exist. This one answers only from the documents you give it, cites the passage behind every claim, and refuses when the answer is not there.

What it does

  • A real chat UI, not a curl example. Streamed answers token by token, clickable citations, a sources panel, and an honest "demo mode" badge when it runs without a model. One HTML file, one CSS file, one JS file, no build step.
  • Grounded answers with citations. The model is given the retrieved passages and told to ground every claim in them and cite by number. The citation chips in the answer link to the exact source passage.
  • An honest "I don't know." Ask it something outside the docs and it says so, instead of guessing. This is the feature that makes a support bot safe to put in front of customers.

it abstains instead of guessing

How it works

architecture

A question goes from the browser to a FastAPI endpoint. A pure-Python BM25 retriever (no embeddings server, no GPU, no API key) finds the most relevant passages from your documents. Then the engine decides whether to answer at all, and if it does, Claude writes a grounded answer that streams back to the UI as Server-Sent Events, with the sources sent first so they appear the moment the answer starts.

Abstention is two layers, and that is on purpose

The honest "I don't know" is not one check, it is two, because one is not enough:

  1. A cheap retrieval floor. If the best passage's BM25 score is below a threshold, the assistant abstains and the model is never called. A weather question against a help center scores near zero and costs zero tokens.
  2. The grounded prompt. A question can share words with the docs without being answerable from them: "what is your stock price" shares "price" with the pricing page and clears the cheap floor. So the model is the second layer: it is told to answer only from the passages and to refuse otherwise, and it catches what the lexical floor cannot.

Field note from building this: I measured the real BM25 scores on the bundled corpus before setting the floor. The natural questions I measured scored from about 4 (a short one like "do you support SSO") up to 10 (a distinctive one like "is there a Linux desktop app"), while an off-topic question like the weather scored 0.35. Shorter questions score lower, so the floor sits at 2.5, with margin below the weakest measured question rather than right under it. That margin is the point: the cheap floor should clear junk without clipping a terse real question, and the grounded prompt is there to catch anything lexically close that slips through. Point the app at your own documents and you retune that one number once, against your own measured scores.

Run it

pip install -r requirements.txt

# key-free: runs the whole UI with a labelled extractive fallback
LLM_MODE=demo uvicorn app.main:app --port 8000

# the real thing
export ANTHROPIC_API_KEY=sk-...
uvicorn app.main:app --port 8000

Open http://localhost:8000 and ask about plans, refunds, SSO or limits. The bundled knowledge base is the help center of a fictional file-storage product, with real facts to retrieve (an 8-dollar Team plan, a 14-day annual refund window, SSO on Business) and gaps to abstain on (there is no Linux desktop app, and it will tell you).

Demo mode needs no key: it answers by surfacing the most on-topic sentence from the top passage and labels itself clearly. It exists so the whole UI, streaming, citations and abstention can be tried for free. It does no synthesis; that is Claude's job.

Point it at your own documents

Drop your .md files into data/docs/ and restart. Chunk size, how many passages are retrieved, and the abstention floor are three numbers in app/config.py. Nothing else changes: the same retriever, engine, UI and tests work on any corpus. PDFs and web pages enter as text through whatever extractor you already use.

Tests

python -m pytest

22 tests, no API key and no network. They cover chunking and BM25 ranking on the real corpus, the abstention floor measured against genuine and off-topic questions, the engine grounding an answer and streaming sources before tokens (checked on the serialized wire, not just in a list), the abstention path refusing without calling the model, the grounding instruction the second layer depends on, and a mid-stream model failure degrading to a visible message (with the partial answer preserved and no raw exception leaked to the browser), end to end through the web layer.

Scope notes

This is the retrieval-and-answer layer with its UI, not a full support desk: it does not manage tickets, accounts or conversations across sessions, and it keeps no history server-side, which is what makes every answer reproducible from the question in front of it. BM25 is the right retriever for help-desk content, where the words people type are the words in the docs; for corpora full of paraphrase, the retriever is one swap behind the same interface (see the companion repo rag-quality, which measures exactly that trade-off).

About

Chat-with-your-docs widget: grounded answers from your documents with clickable citations, an honest "I don't know", and a real chat UI. Pure-Python retrieval, runs key-free in demo mode.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors