A Streamlit web application that lets you upload any PDF document and ask natural language questions about its contents. Powered by Retrieval-Augmented Generation (RAG) — it finds the most relevant chunks from your document and uses an LLM to generate accurate, context-grounded answers.
The system follows a classic RAG pipeline:
- Upload — You upload a PDF via the Streamlit UI.
- Chunk — The document is split into overlapping text chunks (800 tokens, 150 overlap) using LangChain's
RecursiveCharacterTextSplitter. - Embed — Each chunk is converted to a vector embedding using
MistralAIEmbeddings. - Store — Embeddings are persisted in a ChromaDB vector store, keyed per uploaded file.
- Retrieve — On each question, the top-4 most semantically similar chunks are retrieved.
- Generate — The retrieved context is passed to
ChatMistralAI(mistral-small-2506) with a strict prompt that prevents hallucination — the model only answers from the provided context.
- Upload any PDF and build a searchable vector database with one click
- Ask free-form natural language questions
- Answers are grounded strictly in document content — no hallucinated prior knowledge
- Expandable debug panel shows exactly which chunks (with page numbers) were used to generate each answer
- Per-file vector databases — switching documents rebuilds the DB cleanly
Document-QA-System-using-RAG/
├── app.py # Main Streamlit application
├── main.py # Standalone CLI entry point
├── create_database.py # Script to build ChromaDB from a PDF
├── document_loaders/ # Custom or extended document loader utilities
├── retrivers/ # Retriever configuration and experiments
├── Vector Store/ # Persisted ChromaDB vector store data
├── requirements.txt
└── .gitignore
- Python 3.9+
- A Mistral AI API key
# 1. Clone the repository
git clone https://github.com/nandan-byte/Document-QA-System-using-RAG.git
cd Document-QA-System-using-RAG
# 2. (Recommended) Create a virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# 3. Install dependencies
pip install -r requirements.txtCreate a .env file in the project root:
MISTRAL_API_KEY=your_mistral_api_key_hereThe app loads this automatically via python-dotenv.
streamlit run app.pyThen open http://localhost:8501 in your browser.
Usage:
- Upload a PDF using the file uploader.
- Click "Create Vector Database" and wait for processing to complete.
- Type a question in the text input and press Enter.
- View the AI-generated answer. Expand "🔍 Retrieved Context" to see the source chunks.
| Component | Library |
|---|---|
| UI | Streamlit |
| LLM | Mistral AI (mistral-small-2506) |
| Embeddings | langchain-mistralai MistralAI Embeddings |
| Vector Store | ChromaDB |
| PDF Loading | LangChain PyPDFLoader |
| Text Splitting | LangChain RecursiveCharacterTextSplitter |
| Orchestration | LangChain Core / Community |
| Env Config | python-dotenv |
- The vector database is stored locally under
chroma_db/<filename>/. Each uploaded file gets its own isolated store. - Re-uploading the same file and clicking "Create Vector Database" will delete and rebuild the store cleanly.
- The LLM prompt enforces strict document-only answers. If your question falls outside the document's content, the model will say so rather than guess.