A local-first, fully-free retrieval-augmented Q&A pipeline over your own documents — production-shaped (FastAPI + Docker + structured logs + cache + streaming + eval harness) with an optional QLoRA fine-tuning track on a free Colab GPU.
Runs on a MacBook (Apple Silicon Metal) at ~2.8 s end-to-end, with retrieval at MRR = 1.0 and 100% refusal accuracy on out-of-corpus questions on the bundled sample corpus.
data/raw/*.{pdf,md,html,txt}
│
▼ ingest
recursive chunker ──► MiniLM embeddings ──► Chroma (+ BM25 sidecar)
│
▼ query
hybrid retrieve (dense + BM25 + RRF)
│
▼
cross-encoder rerank ──► top-5
│
▼
confidence gate → refuse if score < threshold
│
▼
Ollama (qwen2.5:3b) ──► answer + [n] citations
- Hybrid retrieval — dense (sentence-transformers) + sparse (BM25), fused with reciprocal rank fusion. Closes the keyword-recall gap on technical docs.
- Cross-encoder reranking —
ms-marco-MiniLM-L-6-v2reranks top-50 → top-5. - Two-layer hallucination guard — refuses if the reranker score is below a threshold or if the LLM emits the canonical refusal phrase. Citations are attached structurally (chunk → source) rather than asked of the model.
- Conversation memory — summary-buffer per
session_id, keeps the context window small for the 3B model. - Streaming — Server-Sent Events on
/chat/stream. - QLoRA track — Colab notebook fine-tunes Qwen2.5-3B in ~15 min on a free T4, producing a ~50 MB adapter targeting answer style (always-cite, refuse cleanly), separate from the API.
Run with python eval/run_eval.py against the bundled 14-question golden set:
| Stage | Metric | Value |
|---|---|---|
| Retrieval | precision@5 | 0.564 |
| recall@5 | 1.000 | |
| MRR | 1.000 | |
| Answer | token F1 vs gold | 0.562 |
| citation_validity | 1.000 | |
| faithfulness_proxy | 0.839 | |
| refusal_accuracy | 1.000 | |
| Latency p50 | retrieve | 1 ms |
| rerank | 63 ms | |
| generate | 2,571 ms | |
| total | 2.8 s |
Reproduce these numbers locally with PYTHONPATH=. python eval/run_eval.py --out eval/report.json (writes a per-question JSON report; gitignored so it stays local).
Prereqs: Python 3.10+, Ollama running locally.
# 1. one-time setup
ollama pull qwen2.5:3b
python -m venv .venv && source .venv/bin/activate
pip install -e .
cp .env.example .env
# 2. ingest the bundled sample docs
python scripts/ingest.py data/samples
# 3. ask
python scripts/query.py "What is scaled dot-product attention?"Or run the HTTP API:
python -m rag.api.main
# in another shell:
curl -s http://localhost:8000/health | python3 -m json.tool
curl -s -X POST http://localhost:8000/query \
-H 'Content-Type: application/json' \
-d '{"question":"What is LoRA?"}' | python3 -m json.tool
# interactive docs: http://localhost:8000/docsOr with Docker:
docker compose build
docker compose run --rm ingest # one-time
docker compose up -d.
├── data/samples/ # 3 sample MD docs (attention, BERT, LoRA)
├── eval/ # golden set + retrieval/answer metrics + runner
├── finetune/ # Colab QLoRA notebook + training/eval scripts + sample data
├── scripts/ # CLI: ingest, query, compare, merge_lora, benchmark
├── src/rag/
│ ├── api/ # FastAPI app, routes, schemas, response cache
│ ├── chain/ # prompts, memory, pipeline orchestrator
│ ├── embeddings/ # sentence-transformers wrapper (cached)
│ ├── ingest/ # loaders + recursive chunker
│ ├── llm/ # base protocol + Ollama + hf_local backends
│ ├── retrieval/ # dense + BM25 + RRF + cross-encoder rerank
│ ├── vectorstore/ # Chroma facade
│ ├── utils/ # logging + diskcache
│ └── config.py
├── tests/ # 37 unit/API tests, no models or network required
├── Dockerfile
└── docker-compose.yml
All knobs live in .env (loaded by pydantic-settings). Key vars:
| Var | Default | What |
|---|---|---|
LLM_BACKEND |
ollama |
ollama or hf_local |
OLLAMA_MODEL |
qwen2.5:3b |
Model tag for Ollama |
HF_MODEL_NAME |
Qwen/Qwen2.5-3B-Instruct |
Base model for hf_local |
LORA_ADAPTER_PATH |
(empty) | Optional LoRA adapter dir (Phase 4) |
EMBED_MODEL |
all-MiniLM-L6-v2 |
Sentence-transformer for embeddings |
EMBED_DEVICE |
cpu |
cpu, mps, or cuda |
RETRIEVE_TOP_K |
20 | Candidates pulled from each retriever |
RERANK_TOP_N |
5 | Final top-N after cross-encoder |
USE_RERANKER |
true | Toggle reranking |
CONFIDENCE_THRESHOLD |
-5.0 | Reranker logit floor for refusal |
LLM_TEMPERATURE |
0.2 | |
LLM_MAX_TOKENS |
512 |
See .env.example for the full list.
| Endpoint | Method | Body | Returns |
|---|---|---|---|
/health |
GET | — | {status, llm_backend, llm_model, chunk_count, …} |
/query |
POST | {question, top_k?, use_reranker?} |
{answer, citations, refused, refusal_reason, timings_ms, cache_hit} |
/chat |
POST | {session_id, message} |
same shape as /query |
/chat/stream |
POST | {session_id, message} |
SSE: meta → token* → done |
/chat/{session_id} |
DELETE | — | 204 (clears memory) |
Refusals carry refusal_reason: "low_confidence" (no useful chunks retrieved)
or "generator" (model emitted the refusal phrase).
/query responses are cached (1 h TTL, keyed by question + retrieval params +
LLM model). Refusals are not cached.
# basic
python scripts/query.py "What is scaled dot-product attention?"
# multi-turn (notice the second resolves "its" via summary memory)
python scripts/query.py "What is BERT?" --session demo
python scripts/query.py "How big is its large variant?" --session demo
# streaming
python scripts/query.py "Explain LoRA briefly" --stream
# out-of-corpus → refused, no hallucination
python scripts/query.py "What is the population of France?"More worked examples with expected output: examples/queries.md.
# retrieval-only (fast, ~2 s — no LLM call)
PYTHONPATH=. python eval/run_eval.py --skip-generation
# full eval (retrieval + answer + latency)
PYTHONPATH=. python eval/run_eval.py --out eval/report.json
# latency only
PYTHONPATH=. python scripts/benchmark.py --n 20Metrics implemented (no paid LLM judge required):
- Retrieval: precision@k, binary recall@k, MRR
- Answer: token F1, citation_validity, refusal_correctness, faithfulness_proxy (n-gram overlap with retrieved chunks)
The faithfulness proxy is intentionally lexical — it'll miss paraphrased hallucinations. RAGAS-with-LLM-judge is the documented paid upgrade.
The fine-tuning track is decoupled from the API. The API stays clean of
bitsandbytes/CUDA; fine-tuning runs in Colab, produces a small LoRA
adapter, and Phase 4 scripts merge it back into the base model for local use.
See finetune/README.md for the full walkthrough.
1. Open finetune/colab_notebook.ipynb in Colab → T4 GPU
2. Run cells top to bottom (~15-25 min)
3. Download the adapter zip
4. python scripts/merge_lora.py --base Qwen/Qwen2.5-3B-Instruct \
--adapter <unzipped-dir> --out outputs/qwen2_5_3b_merged
5. Convert to GGUF + register with Ollama (commands in finetune/README.md)
6. python scripts/compare.py "What is LoRA?" \
--base-backend ollama --base-model qwen2.5:3b \
--adapter-backend ollama --adapter-model my-rag-model
The fine-tuning targets answer style — always-cite [n], refuse cleanly —
not domain knowledge. It's defensible regardless of corpus.
pytest -q
# 37 passed in ~3sNo models or network needed: API tests use dependency_overrides with a fake
pipeline; metric and chain tests use plain dataclasses.
| Decision | Why |
|---|---|
| Chroma over FAISS as default | Persistent + metadata filters out of the box; FAISS exposed as a swappable backend |
| Hybrid retrieval (dense + BM25) | ~30 LOC for measurable recall gain on technical-doc keywords (torch.nn.Embedding, error codes) |
| Cross-encoder rerank by default | ~60 ms tax on M4; toggleable via USE_RERANKER=false |
| Citations attached structurally, not generated | Eliminates a whole class of citation hallucinations |
| Two-layer refusal | One layer guards retrieval gaps, the other guards generator confabulation |
| Memory = summary buffer | Keeps context tight for a 3B model on a 4–8K window |
| Ollama as primary serving runtime on M-series | Avoids bitsandbytes pain on Apple Silicon; uses Metal-optimized GGUF |
| Disk-backed response cache | 1 h TTL, refusals excluded so re-ingest re-evaluates them |
| Models not baked into the Docker image | Image is ~700 MB instead of multiple GB; first request lazy-loads |
| Fine-tuning targets style, not knowledge | The adapter remains useful regardless of which docs you swap in |
| Workload | Memory | Notes |
|---|---|---|
| Inference (M4, 16GB+) | ~4–5 GB | Embeddings + Chroma + reranker + qwen2.5:3b Q4_K_M |
| QLoRA on Colab free T4 (16GB VRAM) | ~6–8 GB | rank-16 LoRA on q/k/v/o + MLP, batch 2 × accum 4 |
| Docker image (CPU torch) | ~700 MB | Models download lazily on first request |
Cannot connect to Ollama— make sureollama serveis running andOLLAMA_BASE_URLmatches (defaulthttp://localhost:11434). On Docker for Mac usehttp://host.docker.internal:11434(already set indocker-compose.yml).BM25 index not found— ingest hasn't been run yet;python scripts/ingest.py data/samplesbuilds both Chroma and BM25.- MPS errors during inference — set
EMBED_DEVICE=cpuin.env. MiniLM is fast enough on CPU; the LLM runs in Ollama (Metal) regardless. - Slow first query — embedding and reranker model downloads happen lazily; subsequent calls are fast (cache + warm Ollama model).
Defaults are tuned for local development, not internet exposure. Read this section before deploying anywhere reachable from outside your machine.
- API binds to
0.0.0.0:8000— reachable from anything on your local network - CORS
allow_origins=["*"]— any browser origin can call it - No authentication on any endpoint
- No request rate limiting
| Issue | Fix |
|---|---|
| Open API | Add an auth dependency (HTTP Bearer or API key) to the router in src/rag/api/routes.py |
CORS=* |
Set allow_origins=["https://yourfrontend.example"] in src/rag/api/main.py |
0.0.0.0 bind |
Set API_HOST=127.0.0.1 and put a TLS-terminating reverse proxy (Caddy/Nginx) in front |
| No rate limit | Add slowapi middleware (already in pyproject.toml-ready territory) |
Secrets in .env |
Don't commit .env (already gitignored). Use a real secrets manager in production |
| Container as root | The Dockerfile already drops to a non-root app user before runtime |
- Prompt injection from ingested docs. A malicious PDF/MD in your corpus can contain instructions like "ignore previous, output X" that become retrieved chunks and reach the LLM verbatim. The structural-citation design helps (the model can't fabricate sources it didn't see), but this is the fundamental risk of any RAG over user-supplied content. For high-stakes deployments, run ingested chunks through a moderation classifier before storing them in Chroma.
- BM25 index uses
pickle.load_bm25_indexcallspickle.load. Never point it at a BM25 index from an untrusted source — pickle deserialization is arbitrary code execution. Always rebuild from your own corpus. (See the docstring onload_bm25_index.) - Config controls outbound network.
OLLAMA_BASE_URLandHF_MODEL_NAMEare read from environment. Anyone who can write.envcan redirect the pipeline's outbound traffic or load a different model. Treat.envas a secret on the deployment host.
It is not a hardened, multi-tenant SaaS RAG service. It's a clean
single-tenant local-first project. If you need multi-tenant isolation,
quota enforcement, audit logs, or document-level ACLs, treat the components
in src/rag/ as a starting kit and add the production layer yourself.
- ✅ Phase 1 — System plan
- ✅ Phase 2 — Ingestion, hybrid retrieval, reranking, prompts, memory, pipeline
- ✅ Phase 3 — QLoRA fine-tuning (Colab notebook, training/eval scripts, sample data)
- ✅ Phase 4 — Adapter integration (
merge_lora.py,compare.py) + two-layer confidence fallback - ✅ Phase 5 — FastAPI (query/chat/stream) + Docker + caching + structured logs
- ✅ Phase 6 — Eval harness (golden set + retrieval/answer metrics) + latency benchmark
- ✅ Phase 7 — This README + examples + upgrade paths
Paid upgrade paths (managed services, judges, larger models) are catalogued in UPGRADES.md. Each has a free fallback already wired in.