A Retrieval-Augmented Generation (RAG) pipeline over the three-volume Environment Protection and Biodiversity Conservation Act 1999. Loads the PDFs, splits them into chunks, embeds each chunk locally, stores the vectors in Postgres + pgvector, and answers questions by retrieving the most relevant chunks and sending them to an OpenAI model, with answers that cite the volume and page each fact came from.
This umbrella project has been split into four independent repos, one per concern, each nested here as its own git repo (with its own remote). The umbrella itself now holds only this documentation; all code lives in the sub-repos.
| Repo | Owns | Entry point |
|---|---|---|
vector-db-rag-context-pipeline/ |
Postgres + pgvector (Docker) | docker compose up -d |
indexing-rag-context-pipeline/ |
PDFs → chunks → embeddings → chunks table; the source PDFs |
python build_index.py |
engine-rag-context-pipeline/ |
query engine (retriever + LCEL chain), the REPL, and the retrieval eval | python ask.py / python eval/run_eval.py |
backend-rag-context-pipeline/ |
HTTP API (FastAPI) over the engine | uvicorn api.main:app |
Each sub-repo has its own README; the three Python repos (indexing, engine, backend)
also ship a requirements.txt and .env.example, while the Docker-only vector-db
repo needs neither.
The dashed edge is a build/import-time relationship, not a network call: the backend imports the engine's leaf modules, so at runtime only the database and OpenAI are remote.
The repos communicate only through the Postgres chunks table. Indexing writes
the chunks, their embeddings, the source page and volume, and the embedding_model
name; the engine retrieves the top-k via a SQL cosine-distance search
(ORDER BY embedding <=> query LIMIT k) and composes the answer as a LangChain LCEL
chain (retriever | prompt | llm | parser). The backend exposes that chain over
HTTP — blocking JSON (/ask) and a streamed chat endpoint (/chat). The
embedding_model column keeps the query side embedding with the same model the
index was built with.
Both write endpoints can sit behind an optional API-key header; leave it unset for
keyless local dev. GET /health stays open either way.
The engine wraps raw pgvector SQL in a LangChain BaseRetriever rather than
LangChain's PGVector vectorstore, so the chunks schema stays under the project's
control. Embeddings are local sentence-transformers (BAAI/bge-small-en-v1.5, no
API key); only gpt-5.4-nano is called through LangChain (ChatOpenAI).
Extraction fit. PyMuPDF4LLM extracts text, tables, and headings well — a strong
match for a legal Act, which is text and tables with no chart data to lose. It also
preserves the Act's Chapter / Part / Division / Section hierarchy in a per-page
running header, so retrieved chunks carry their structural context inline. (Pages
restart per volume, which is why citations are [Volume N, p.M] rather than a bare
page number.)
Prerequisites: Python 3.9+, Docker, and an OpenAI API key.
Each sub-repo loads its own .env first, falling back to an umbrella .env at this
project root — so the quickest path is to create one shared .env here and let all
repos read it:
echo "OPENAI_API_KEY=sk-..." > .env
echo "DATABASE_URL=postgresql://rag:rag@localhost:5432/rag" >> .envThen, from each repo (each with its own venv + pip install -r requirements.txt):
# 1. Database
cd vector-db-rag-context-pipeline && docker compose up -d && cd -
# 2. Build the index (first run downloads the ~130MB embedding model)
cd indexing-rag-context-pipeline && python build_index.py && cd -
# 3. Ask questions — REPL, or the retrieval eval
cd engine-rag-context-pipeline && python ask.py # interactive REPL
# python eval/run_eval.py # hit-rate@k / MRR (no API key)
# 4. Or serve the HTTP API (interactive docs at http://localhost:8000/docs)
cd backend-rag-context-pipeline && uvicorn api.main:app --reloadTo index a different set of PDFs, drop them in a folder and point PDF_DIR in
indexing-rag-context-pipeline/build_index.py at it (it globs *.pdf and derives
each volume label from the filename).
To put the HTTP API behind the optional API-key gate, see the backend repo's README; leave it unset for keyless local dev.
The engine and backend repos each carry an offline unit-test suite (run with
python -m pytest from the repo root) plus a GitHub Actions workflow (on push to
main and PRs). The suites need no database, no OpenAI key, and no model download —
the leaves take their collaborators as arguments, so the tests swap in fakes.
cd engine-rag-context-pipeline && python -m pytest # 12 tests (retriever / chain / load_index)
cd backend-rag-context-pipeline && python -m pytest # 19 tests (ask+health / chat / auth)| Layer | Choice | Repo |
|---|---|---|
| PDF extraction | pymupdf4llm (per-page Markdown) |
indexing |
| Chunking | RecursiveCharacterTextSplitter (langchain-text-splitters) |
indexing |
| Embeddings | sentence-transformers + BAAI/bge-small-en-v1.5 (local) |
indexing + engine |
| Vector store | Postgres + pgvector (Dockerized), via psycopg |
vector-db |
| Retrieval + orchestration | LangChain LCEL — langchain-core |
engine |
| Answer generation | OpenAI (gpt-5.4-nano) via langchain-openai ChatOpenAI |
engine |
| HTTP API | fastapi + uvicorn, pooled via psycopg-pool |
backend |
| Env loading | python-dotenv |
all |
The knobs worth experimenting with live in each repo's README:
| Constant | Repo / file | Default |
|---|---|---|
CHUNK_SIZE, CHUNK_OVERLAP |
indexing-rag-context-pipeline/build_index.py |
1200 / 150 chars |
EMBEDDING_MODEL |
indexing-rag-context-pipeline/build_index.py |
BAAI/bge-small-en-v1.5 |
TOP_K, OPENAI_MODEL |
engine-rag-context-pipeline/ask.py |
6 / gpt-5.4-nano |
QUERY_PREFIX |
engine-rag-context-pipeline/retriever.py |
BGE instruction prefix |
Changing EMBEDDING_MODEL requires re-running build_index.py — the rebuild drops
and recreates the chunks table with the new VECTOR(dim) and records the new model
name, so the engine picks it up automatically.
