RAG Q&A System with QLoRA Fine-Tuning

A local-first, fully-free retrieval-augmented Q&A pipeline over your own documents — production-shaped (FastAPI + Docker + structured logs + cache + streaming + eval harness) with an optional QLoRA fine-tuning track on a free Colab GPU.

Runs on a MacBook (Apple Silicon Metal) at ~2.8 s end-to-end, with retrieval at MRR = 1.0 and 100% refusal accuracy on out-of-corpus questions on the bundled sample corpus.

What it does

data/raw/*.{pdf,md,html,txt}
    │
    ▼ ingest
recursive chunker  ──►  MiniLM embeddings  ──►  Chroma  (+ BM25 sidecar)
                                                  │
                                                  ▼ query
                       hybrid retrieve (dense + BM25 + RRF)
                                                  │
                                                  ▼
                       cross-encoder rerank   ──►  top-5
                                                  │
                                                  ▼
                  confidence gate → refuse if score < threshold
                                                  │
                                                  ▼
                       Ollama (qwen2.5:3b)  ──►  answer + [n] citations

Hybrid retrieval — dense (sentence-transformers) + sparse (BM25), fused with reciprocal rank fusion. Closes the keyword-recall gap on technical docs.
Cross-encoder reranking — ms-marco-MiniLM-L-6-v2 reranks top-50 → top-5.
Two-layer hallucination guard — refuses if the reranker score is below a threshold or if the LLM emits the canonical refusal phrase. Citations are attached structurally (chunk → source) rather than asked of the model.
Conversation memory — summary-buffer per session_id, keeps the context window small for the 3B model.
Streaming — Server-Sent Events on /chat/stream.
QLoRA track — Colab notebook fine-tunes Qwen2.5-3B in ~15 min on a free T4, producing a ~50 MB adapter targeting answer style (always-cite, refuse cleanly), separate from the API.

Live eval results (M4, qwen2.5:3b via Ollama)

Run with python eval/run_eval.py against the bundled 14-question golden set:

Stage	Metric	Value
Retrieval	precision@5	0.564
	recall@5	1.000
	MRR	1.000
Answer	token F1 vs gold	0.562
	citation_validity	1.000
	faithfulness_proxy	0.839
	refusal_accuracy	1.000
Latency p50	retrieve	1 ms
	rerank	63 ms
	generate	2,571 ms
	total	2.8 s

Reproduce these numbers locally with PYTHONPATH=. python eval/run_eval.py --out eval/report.json (writes a per-question JSON report; gitignored so it stays local).

Quickstart

Prereqs: Python 3.10+, Ollama running locally.

# 1. one-time setup
ollama pull qwen2.5:3b
python -m venv .venv && source .venv/bin/activate
pip install -e .
cp .env.example .env

# 2. ingest the bundled sample docs
python scripts/ingest.py data/samples

# 3. ask
python scripts/query.py "What is scaled dot-product attention?"

Or run the HTTP API:

python -m rag.api.main
# in another shell:
curl -s http://localhost:8000/health | python3 -m json.tool
curl -s -X POST http://localhost:8000/query \
     -H 'Content-Type: application/json' \
     -d '{"question":"What is LoRA?"}' | python3 -m json.tool
# interactive docs:  http://localhost:8000/docs

Or with Docker:

docker compose build
docker compose run --rm ingest    # one-time
docker compose up -d

Project layout

.
├── data/samples/           # 3 sample MD docs (attention, BERT, LoRA)
├── eval/                   # golden set + retrieval/answer metrics + runner
├── finetune/               # Colab QLoRA notebook + training/eval scripts + sample data
├── scripts/                # CLI: ingest, query, compare, merge_lora, benchmark
├── src/rag/
│   ├── api/                # FastAPI app, routes, schemas, response cache
│   ├── chain/              # prompts, memory, pipeline orchestrator
│   ├── embeddings/         # sentence-transformers wrapper (cached)
│   ├── ingest/             # loaders + recursive chunker
│   ├── llm/                # base protocol + Ollama + hf_local backends
│   ├── retrieval/          # dense + BM25 + RRF + cross-encoder rerank
│   ├── vectorstore/        # Chroma facade
│   ├── utils/              # logging + diskcache
│   └── config.py
├── tests/                  # 37 unit/API tests, no models or network required
├── Dockerfile
└── docker-compose.yml

Configuration

All knobs live in .env (loaded by pydantic-settings). Key vars:

Var	Default	What
`LLM_BACKEND`	`ollama`	`ollama` or `hf_local`
`OLLAMA_MODEL`	`qwen2.5:3b`	Model tag for Ollama
`HF_MODEL_NAME`	`Qwen/Qwen2.5-3B-Instruct`	Base model for hf_local
`LORA_ADAPTER_PATH`	(empty)	Optional LoRA adapter dir (Phase 4)
`EMBED_MODEL`	`all-MiniLM-L6-v2`	Sentence-transformer for embeddings
`EMBED_DEVICE`	`cpu`	`cpu`, `mps`, or `cuda`
`RETRIEVE_TOP_K`	20	Candidates pulled from each retriever
`RERANK_TOP_N`	5	Final top-N after cross-encoder
`USE_RERANKER`	true	Toggle reranking
`CONFIDENCE_THRESHOLD`	-5.0	Reranker logit floor for refusal
`LLM_TEMPERATURE`	0.2
`LLM_MAX_TOKENS`	512

See .env.example for the full list.

API reference

Endpoint	Method	Body	Returns
`/health`	GET	—	`{status, llm_backend, llm_model, chunk_count, …}`
`/query`	POST	`{question, top_k?, use_reranker?}`	`{answer, citations, refused, refusal_reason, timings_ms, cache_hit}`
`/chat`	POST	`{session_id, message}`	same shape as `/query`
`/chat/stream`	POST	`{session_id, message}`	SSE: `meta` → `token`* → `done`
`/chat/{session_id}`	DELETE	—	204 (clears memory)

Refusals carry refusal_reason: "low_confidence" (no useful chunks retrieved) or "generator" (model emitted the refusal phrase).

/query responses are cached (1 h TTL, keyed by question + retrieval params + LLM model). Refusals are not cached.

Example queries

# basic
python scripts/query.py "What is scaled dot-product attention?"

# multi-turn (notice the second resolves "its" via summary memory)
python scripts/query.py "What is BERT?" --session demo
python scripts/query.py "How big is its large variant?" --session demo

# streaming
python scripts/query.py "Explain LoRA briefly" --stream

# out-of-corpus → refused, no hallucination
python scripts/query.py "What is the population of France?"

More worked examples with expected output: examples/queries.md.

Eval and benchmarking

# retrieval-only (fast, ~2 s — no LLM call)
PYTHONPATH=. python eval/run_eval.py --skip-generation

# full eval (retrieval + answer + latency)
PYTHONPATH=. python eval/run_eval.py --out eval/report.json

# latency only
PYTHONPATH=. python scripts/benchmark.py --n 20

Metrics implemented (no paid LLM judge required):

Retrieval: precision@k, binary recall@k, MRR
Answer: token F1, citation_validity, refusal_correctness, faithfulness_proxy (n-gram overlap with retrieved chunks)

The faithfulness proxy is intentionally lexical — it'll miss paraphrased hallucinations. RAGAS-with-LLM-judge is the documented paid upgrade.

QLoRA fine-tuning (optional, free Colab T4)

The fine-tuning track is decoupled from the API. The API stays clean of bitsandbytes/CUDA; fine-tuning runs in Colab, produces a small LoRA adapter, and Phase 4 scripts merge it back into the base model for local use.

See finetune/README.md for the full walkthrough.

1. Open finetune/colab_notebook.ipynb in Colab → T4 GPU
2. Run cells top to bottom (~15-25 min)
3. Download the adapter zip
4. python scripts/merge_lora.py --base Qwen/Qwen2.5-3B-Instruct \
       --adapter <unzipped-dir> --out outputs/qwen2_5_3b_merged
5. Convert to GGUF + register with Ollama (commands in finetune/README.md)
6. python scripts/compare.py "What is LoRA?" \
       --base-backend ollama --base-model qwen2.5:3b \
       --adapter-backend ollama --adapter-model my-rag-model

The fine-tuning targets answer style — always-cite [n], refuse cleanly — not domain knowledge. It's defensible regardless of corpus.

Tests

pytest -q
# 37 passed in ~3s

No models or network needed: API tests use dependency_overrides with a fake pipeline; metric and chain tests use plain dataclasses.

Design choices (with cost reasoning)

Decision	Why
Chroma over FAISS as default	Persistent + metadata filters out of the box; FAISS exposed as a swappable backend
Hybrid retrieval (dense + BM25)	~30 LOC for measurable recall gain on technical-doc keywords (`torch.nn.Embedding`, error codes)
Cross-encoder rerank by default	~60 ms tax on M4; toggleable via `USE_RERANKER=false`
Citations attached structurally, not generated	Eliminates a whole class of citation hallucinations
Two-layer refusal	One layer guards retrieval gaps, the other guards generator confabulation
Memory = summary buffer	Keeps context tight for a 3B model on a 4–8K window
Ollama as primary serving runtime on M-series	Avoids `bitsandbytes` pain on Apple Silicon; uses Metal-optimized GGUF
Disk-backed response cache	1 h TTL, refusals excluded so re-ingest re-evaluates them
Models not baked into the Docker image	Image is ~700 MB instead of multiple GB; first request lazy-loads
Fine-tuning targets style, not knowledge	The adapter remains useful regardless of which docs you swap in

Hardware expectations

Workload	Memory	Notes
Inference (M4, 16GB+)	~4–5 GB	Embeddings + Chroma + reranker + qwen2.5:3b Q4_K_M
QLoRA on Colab free T4 (16GB VRAM)	~6–8 GB	rank-16 LoRA on q/k/v/o + MLP, batch 2 × accum 4
Docker image (CPU torch)	~700 MB	Models download lazily on first request

Troubleshooting

Cannot connect to Ollama — make sure ollama serve is running and OLLAMA_BASE_URL matches (default http://localhost:11434). On Docker for Mac use http://host.docker.internal:11434 (already set in docker-compose.yml).
BM25 index not found — ingest hasn't been run yet; python scripts/ingest.py data/samples builds both Chroma and BM25.
MPS errors during inference — set EMBED_DEVICE=cpu in .env. MiniLM is fast enough on CPU; the LLM runs in Ollama (Metal) regardless.
Slow first query — embedding and reranker model downloads happen lazily; subsequent calls are fast (cache + warm Ollama model).

Security posture

Defaults are tuned for local development, not internet exposure. Read this section before deploying anywhere reachable from outside your machine.

Out of the box (local dev)

API binds to 0.0.0.0:8000 — reachable from anything on your local network
CORS allow_origins=["*"] — any browser origin can call it
No authentication on any endpoint
No request rate limiting

Hardening checklist before exposing publicly

Issue	Fix
Open API	Add an auth dependency (HTTP Bearer or API key) to the `router` in `src/rag/api/routes.py`
`CORS=*`	Set `allow_origins=["https://yourfrontend.example"]` in `src/rag/api/main.py`
`0.0.0.0` bind	Set `API_HOST=127.0.0.1` and put a TLS-terminating reverse proxy (Caddy/Nginx) in front
No rate limit	Add `slowapi` middleware (already in `pyproject.toml`-ready territory)
Secrets in `.env`	Don't commit `.env` (already gitignored). Use a real secrets manager in production
Container as root	The `Dockerfile` already drops to a non-root `app` user before runtime

Threat model notes

Prompt injection from ingested docs. A malicious PDF/MD in your corpus can contain instructions like "ignore previous, output X" that become retrieved chunks and reach the LLM verbatim. The structural-citation design helps (the model can't fabricate sources it didn't see), but this is the fundamental risk of any RAG over user-supplied content. For high-stakes deployments, run ingested chunks through a moderation classifier before storing them in Chroma.
BM25 index uses pickle. load_bm25_index calls pickle.load. Never point it at a BM25 index from an untrusted source — pickle deserialization is arbitrary code execution. Always rebuild from your own corpus. (See the docstring on load_bm25_index.)
Config controls outbound network. OLLAMA_BASE_URL and HF_MODEL_NAME are read from environment. Anyone who can write .env can redirect the pipeline's outbound traffic or load a different model. Treat .env as a secret on the deployment host.

What this codebase does not do

It is not a hardened, multi-tenant SaaS RAG service. It's a clean single-tenant local-first project. If you need multi-tenant isolation, quota enforcement, audit logs, or document-level ACLs, treat the components in src/rag/ as a starting kit and add the production layer yourself.

Roadmap (what was built, by phase)

✅ Phase 1 — System plan
✅ Phase 2 — Ingestion, hybrid retrieval, reranking, prompts, memory, pipeline
✅ Phase 3 — QLoRA fine-tuning (Colab notebook, training/eval scripts, sample data)
✅ Phase 4 — Adapter integration (merge_lora.py, compare.py) + two-layer confidence fallback
✅ Phase 5 — FastAPI (query/chat/stream) + Docker + caching + structured logs
✅ Phase 6 — Eval harness (golden set + retrieval/answer metrics) + latency benchmark
✅ Phase 7 — This README + examples + upgrade paths

Scaling beyond free

Paid upgrade paths (managed services, judges, larger models) are catalogued in UPGRADES.md. Each has a free fallback already wired in.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Q&A System with QLoRA Fine-Tuning

What it does

Live eval results (M4, qwen2.5:3b via Ollama)

Quickstart

Project layout

Configuration

API reference

Example queries

Eval and benchmarking

QLoRA fine-tuning (optional, free Colab T4)

Tests

Design choices (with cost reasoning)

Hardware expectations

Troubleshooting

Security posture

Out of the box (local dev)

Hardening checklist before exposing publicly

Threat model notes

What this codebase does not do

Roadmap (what was built, by phase)

Scaling beyond free

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
eval		eval
examples		examples
finetune		finetune
scripts		scripts
src/rag		src/rag
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
UPGRADES.md		UPGRADES.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

RAG Q&A System with QLoRA Fine-Tuning

What it does

Live eval results (M4, qwen2.5:3b via Ollama)

Quickstart

Project layout

Configuration

API reference

Example queries

Eval and benchmarking

QLoRA fine-tuning (optional, free Colab T4)

Tests

Design choices (with cost reasoning)

Hardware expectations

Troubleshooting

Security posture

Out of the box (local dev)

Hardening checklist before exposing publicly

Threat model notes

What this codebase does not do

Roadmap (what was built, by phase)

Scaling beyond free

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages