RAG Context Pipeline

A Retrieval-Augmented Generation (RAG) pipeline over the three-volume Environment Protection and Biodiversity Conservation Act 1999. Loads the PDFs, splits them into chunks, embeds each chunk locally, stores the vectors in Postgres + pgvector, and answers questions by retrieving the most relevant chunks and sending them to an OpenAI model, with answers that cite the volume and page each fact came from.

This umbrella project has been split into four independent repos, one per concern, each nested here as its own git repo (with its own remote). The umbrella itself now holds only this documentation; all code lives in the sub-repos.

Repositories

Repo	Owns	Entry point
`vector-db-rag-context-pipeline/`	Postgres + pgvector (Docker)	`docker compose up -d`
`indexing-rag-context-pipeline/`	PDFs → chunks → embeddings → `chunks` table; the source PDFs	`python build_index.py`
`engine-rag-context-pipeline/`	query engine (retriever + LCEL chain), the REPL, and the retrieval eval	`python ask.py` / `python eval/run_eval.py`
`backend-rag-context-pipeline/`	HTTP API (FastAPI) over the engine	`uvicorn api.main:app`

Each sub-repo has its own README; the three Python repos (indexing, engine, backend) also ship a requirements.txt and .env.example, while the Docker-only vector-db repo needs neither.

How it works

The dashed edge is a build/import-time relationship, not a network call: the backend imports the engine's leaf modules, so at runtime only the database and OpenAI are remote.

The repos communicate only through the Postgres chunks table. Indexing writes the chunks, their embeddings, the source page and volume, and the embedding_model name; the engine retrieves the top-k via a SQL cosine-distance search (ORDER BY embedding <=> query LIMIT k) and composes the answer as a LangChain LCEL chain (retriever | prompt | llm | parser). The backend exposes that chain over HTTP — blocking JSON (/ask) and a streamed chat endpoint (/chat). The embedding_model column keeps the query side embedding with the same model the index was built with.

Both write endpoints can sit behind an optional API-key header; leave it unset for keyless local dev. GET /health stays open either way.

The engine wraps raw pgvector SQL in a LangChain BaseRetriever rather than LangChain's PGVector vectorstore, so the chunks schema stays under the project's control. Embeddings are local sentence-transformers (BAAI/bge-small-en-v1.5, no API key); only gpt-5.4-nano is called through LangChain (ChatOpenAI).

Extraction fit. PyMuPDF4LLM extracts text, tables, and headings well — a strong match for a legal Act, which is text and tables with no chart data to lose. It also preserves the Act's Chapter / Part / Division / Section hierarchy in a per-page running header, so retrieved chunks carry their structural context inline. (Pages restart per volume, which is why citations are [Volume N, p.M] rather than a bare page number.)

Full local setup

Prerequisites: Python 3.9+, Docker, and an OpenAI API key.

Each sub-repo loads its own .env first, falling back to an umbrella .env at this project root — so the quickest path is to create one shared .env here and let all repos read it:

echo "OPENAI_API_KEY=sk-..." > .env
echo "DATABASE_URL=postgresql://rag:rag@localhost:5432/rag" >> .env

Then, from each repo (each with its own venv + pip install -r requirements.txt):

# 1. Database
cd vector-db-rag-context-pipeline && docker compose up -d && cd -

# 2. Build the index (first run downloads the ~130MB embedding model)
cd indexing-rag-context-pipeline && python build_index.py && cd -

# 3. Ask questions — REPL, or the retrieval eval
cd engine-rag-context-pipeline && python ask.py            # interactive REPL
#                                  python eval/run_eval.py  # hit-rate@k / MRR (no API key)

# 4. Or serve the HTTP API (interactive docs at http://localhost:8000/docs)
cd backend-rag-context-pipeline && uvicorn api.main:app --reload

To index a different set of PDFs, drop them in a folder and point PDF_DIR in indexing-rag-context-pipeline/build_index.py at it (it globs *.pdf and derives each volume label from the filename).

To put the HTTP API behind the optional API-key gate, see the backend repo's README; leave it unset for keyless local dev.

Tests

The engine and backend repos each carry an offline unit-test suite (run with python -m pytest from the repo root) plus a GitHub Actions workflow (on push to main and PRs). The suites need no database, no OpenAI key, and no model download — the leaves take their collaborators as arguments, so the tests swap in fakes.

cd engine-rag-context-pipeline  && python -m pytest    # 12 tests (retriever / chain / load_index)
cd backend-rag-context-pipeline && python -m pytest    # 19 tests (ask+health / chat / auth)

Stack

Layer	Choice	Repo
PDF extraction	`pymupdf4llm` (per-page Markdown)	indexing
Chunking	`RecursiveCharacterTextSplitter` (`langchain-text-splitters`)	indexing
Embeddings	`sentence-transformers` + `BAAI/bge-small-en-v1.5` (local)	indexing + engine
Vector store	Postgres + `pgvector` (Dockerized), via `psycopg`	vector-db
Retrieval + orchestration	LangChain LCEL — `langchain-core`	engine
Answer generation	OpenAI (`gpt-5.4-nano`) via `langchain-openai` `ChatOpenAI`	engine
HTTP API	`fastapi` + `uvicorn`, pooled via `psycopg-pool`	backend
Env loading	`python-dotenv`	all

Tuning

The knobs worth experimenting with live in each repo's README:

Constant	Repo / file	Default
`CHUNK_SIZE`, `CHUNK_OVERLAP`	`indexing-rag-context-pipeline/build_index.py`	1200 / 150 chars
`EMBEDDING_MODEL`	`indexing-rag-context-pipeline/build_index.py`	`BAAI/bge-small-en-v1.5`
`TOP_K`, `OPENAI_MODEL`	`engine-rag-context-pipeline/ask.py`	6 / `gpt-5.4-nano`
`QUERY_PREFIX`	`engine-rag-context-pipeline/retriever.py`	BGE instruction prefix

Changing EMBEDDING_MODEL requires re-running build_index.py — the rebuild drops and recreates the chunks table with the new VECTOR(dim) and records the new model name, so the engine picks it up automatically.

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
commands.md		commands.md
diagram.mmd		diagram.mmd
diagram.png		diagram.png
eval-questions.md		eval-questions.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Context Pipeline

Repositories

How it works

Full local setup

Tests

Stack

Tuning

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG Context Pipeline

Repositories

How it works

Full local setup

Tests

Stack

Tuning

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages