📚 Document QA System using RAG

A Streamlit web application that lets you upload any PDF document and ask natural language questions about its contents. Powered by Retrieval-Augmented Generation (RAG) — it finds the most relevant chunks from your document and uses an LLM to generate accurate, context-grounded answers.

How It Works

The system follows a classic RAG pipeline:

Upload — You upload a PDF via the Streamlit UI.
Chunk — The document is split into overlapping text chunks (800 tokens, 150 overlap) using LangChain's RecursiveCharacterTextSplitter.
Embed — Each chunk is converted to a vector embedding using MistralAIEmbeddings.
Store — Embeddings are persisted in a ChromaDB vector store, keyed per uploaded file.
Retrieve — On each question, the top-4 most semantically similar chunks are retrieved.
Generate — The retrieved context is passed to ChatMistralAI (mistral-small-2506) with a strict prompt that prevents hallucination — the model only answers from the provided context.

Features

Upload any PDF and build a searchable vector database with one click
Ask free-form natural language questions
Answers are grounded strictly in document content — no hallucinated prior knowledge
Expandable debug panel shows exactly which chunks (with page numbers) were used to generate each answer
Per-file vector databases — switching documents rebuilds the DB cleanly

Project Structure

Document-QA-System-using-RAG/
├── app.py                  # Main Streamlit application
├── main.py                 # Standalone CLI entry point
├── create_database.py      # Script to build ChromaDB from a PDF
├── document_loaders/       # Custom or extended document loader utilities
├── retrivers/              # Retriever configuration and experiments
├── Vector Store/           # Persisted ChromaDB vector store data
├── requirements.txt
└── .gitignore

Prerequisites

Python 3.9+
A Mistral AI API key

Installation

# 1. Clone the repository
git clone https://github.com/nandan-byte/Document-QA-System-using-RAG.git
cd Document-QA-System-using-RAG

# 2. (Recommended) Create a virtual environment
python -m venv venv
source venv/bin/activate      # On Windows: venv\Scripts\activate

# 3. Install dependencies
pip install -r requirements.txt

Configuration

Create a .env file in the project root:

MISTRAL_API_KEY=your_mistral_api_key_here

The app loads this automatically via python-dotenv.

Running the App

streamlit run app.py

Then open http://localhost:8501 in your browser.

Usage:

Upload a PDF using the file uploader.
Click "Create Vector Database" and wait for processing to complete.
Type a question in the text input and press Enter.
View the AI-generated answer. Expand "🔍 Retrieved Context" to see the source chunks.

Tech Stack

Component	Library
UI	Streamlit
LLM	Mistral AI (`mistral-small-2506`)
Embeddings	`langchain-mistralai` MistralAI Embeddings
Vector Store	ChromaDB
PDF Loading	LangChain `PyPDFLoader`
Text Splitting	LangChain `RecursiveCharacterTextSplitter`
Orchestration	LangChain Core / Community
Env Config	`python-dotenv`

Notes

The vector database is stored locally under chroma_db/<filename>/. Each uploaded file gets its own isolated store.
Re-uploading the same file and clicking "Create Vector Database" will delete and rebuild the store cleanly.
The LLM prompt enforces strict document-only answers. If your question falls outside the document's content, the model will say so rather than guess.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📚 Document QA System using RAG

How It Works

Features

Project Structure

Prerequisites

Installation

Configuration

Running the App

Tech Stack

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Vector Store		Vector Store
document_loaders		document_loaders
retrivers		retrivers
.gitignore		.gitignore
README.md		README.md
app.py		app.py
create_database.py		create_database.py
deep-learning-material-dept-ece-ase-blr-1.pdf		deep-learning-material-dept-ece-ase-blr-1.pdf
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

📚 Document QA System using RAG

How It Works

Features

Project Structure

Prerequisites

Installation

Configuration

Running the App

Tech Stack

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages