Skip to content

obj809/rag-context-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RAG Context Pipeline

A Retrieval-Augmented Generation (RAG) pipeline over the three-volume Environment Protection and Biodiversity Conservation Act 1999. Loads the PDFs, splits them into chunks, embeds each chunk locally, stores the vectors in Postgres + pgvector, and answers questions by retrieving the most relevant chunks and sending them to an OpenAI model, with answers that cite the volume and page each fact came from.

This umbrella project has been split into four independent repos, one per concern, each nested here as its own git repo (with its own remote). The umbrella itself now holds only this documentation; all code lives in the sub-repos.

Repositories

Repo Owns Entry point
vector-db-rag-context-pipeline/ Postgres + pgvector (Docker) docker compose up -d
indexing-rag-context-pipeline/ PDFs → chunks → embeddings → chunks table; the source PDFs python build_index.py
engine-rag-context-pipeline/ query engine (retriever + LCEL chain), the REPL, and the retrieval eval python ask.py / python eval/run_eval.py
backend-rag-context-pipeline/ HTTP API (FastAPI) over the engine uvicorn api.main:app

Each sub-repo has its own README; the three Python repos (indexing, engine, backend) also ship a requirements.txt and .env.example, while the Docker-only vector-db repo needs neither.

How it works

Architecture diagram: indexing writes chunks to Postgres + pgvector; the engine retrieves top-k chunks and composes a volume/page-cited answer via OpenAI; the backend exposes it over HTTP

The dashed edge is a build/import-time relationship, not a network call: the backend imports the engine's leaf modules, so at runtime only the database and OpenAI are remote.

The repos communicate only through the Postgres chunks table. Indexing writes the chunks, their embeddings, the source page and volume, and the embedding_model name; the engine retrieves the top-k via a SQL cosine-distance search (ORDER BY embedding <=> query LIMIT k) and composes the answer as a LangChain LCEL chain (retriever | prompt | llm | parser). The backend exposes that chain over HTTP — blocking JSON (/ask) and a streamed chat endpoint (/chat). The embedding_model column keeps the query side embedding with the same model the index was built with.

Both write endpoints can sit behind an optional API-key header; leave it unset for keyless local dev. GET /health stays open either way.

The engine wraps raw pgvector SQL in a LangChain BaseRetriever rather than LangChain's PGVector vectorstore, so the chunks schema stays under the project's control. Embeddings are local sentence-transformers (BAAI/bge-small-en-v1.5, no API key); only gpt-5.4-nano is called through LangChain (ChatOpenAI).

Extraction fit. PyMuPDF4LLM extracts text, tables, and headings well — a strong match for a legal Act, which is text and tables with no chart data to lose. It also preserves the Act's Chapter / Part / Division / Section hierarchy in a per-page running header, so retrieved chunks carry their structural context inline. (Pages restart per volume, which is why citations are [Volume N, p.M] rather than a bare page number.)

Full local setup

Prerequisites: Python 3.9+, Docker, and an OpenAI API key.

Each sub-repo loads its own .env first, falling back to an umbrella .env at this project root — so the quickest path is to create one shared .env here and let all repos read it:

echo "OPENAI_API_KEY=sk-..." > .env
echo "DATABASE_URL=postgresql://rag:rag@localhost:5432/rag" >> .env

Then, from each repo (each with its own venv + pip install -r requirements.txt):

# 1. Database
cd vector-db-rag-context-pipeline && docker compose up -d && cd -

# 2. Build the index (first run downloads the ~130MB embedding model)
cd indexing-rag-context-pipeline && python build_index.py && cd -

# 3. Ask questions — REPL, or the retrieval eval
cd engine-rag-context-pipeline && python ask.py            # interactive REPL
#                                  python eval/run_eval.py  # hit-rate@k / MRR (no API key)

# 4. Or serve the HTTP API (interactive docs at http://localhost:8000/docs)
cd backend-rag-context-pipeline && uvicorn api.main:app --reload

To index a different set of PDFs, drop them in a folder and point PDF_DIR in indexing-rag-context-pipeline/build_index.py at it (it globs *.pdf and derives each volume label from the filename).

To put the HTTP API behind the optional API-key gate, see the backend repo's README; leave it unset for keyless local dev.

Tests

The engine and backend repos each carry an offline unit-test suite (run with python -m pytest from the repo root) plus a GitHub Actions workflow (on push to main and PRs). The suites need no database, no OpenAI key, and no model download — the leaves take their collaborators as arguments, so the tests swap in fakes.

cd engine-rag-context-pipeline  && python -m pytest    # 12 tests (retriever / chain / load_index)
cd backend-rag-context-pipeline && python -m pytest    # 19 tests (ask+health / chat / auth)

Stack

Layer Choice Repo
PDF extraction pymupdf4llm (per-page Markdown) indexing
Chunking RecursiveCharacterTextSplitter (langchain-text-splitters) indexing
Embeddings sentence-transformers + BAAI/bge-small-en-v1.5 (local) indexing + engine
Vector store Postgres + pgvector (Dockerized), via psycopg vector-db
Retrieval + orchestration LangChain LCEL — langchain-core engine
Answer generation OpenAI (gpt-5.4-nano) via langchain-openai ChatOpenAI engine
HTTP API fastapi + uvicorn, pooled via psycopg-pool backend
Env loading python-dotenv all

Tuning

The knobs worth experimenting with live in each repo's README:

Constant Repo / file Default
CHUNK_SIZE, CHUNK_OVERLAP indexing-rag-context-pipeline/build_index.py 1200 / 150 chars
EMBEDDING_MODEL indexing-rag-context-pipeline/build_index.py BAAI/bge-small-en-v1.5
TOP_K, OPENAI_MODEL engine-rag-context-pipeline/ask.py 6 / gpt-5.4-nano
QUERY_PREFIX engine-rag-context-pipeline/retriever.py BGE instruction prefix

Changing EMBEDDING_MODEL requires re-running build_index.py — the rebuild drops and recreates the chunks table with the new VECTOR(dim) and records the new model name, so the engine picks it up automatically.

About

RAG pipeline answering page-cited questions over a PDF report with local embeddings and vector search. FastAPI | LangChain | pgvector

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors