A context-aware Retrieval-Augmented Generation (RAG) system powered by local LLMs and semantic search.
This project implements a lightweight yet powerful RAG pipeline that enhances LLM responses using relevant contextual data. It combines embedding-based retrieval with local model inference for efficient and privacy-focused AI.
- 🔍 Semantic search using SentenceTransformers
- 🧩 Intelligent chunking with overlap strategy
- 📊 Cosine similarity-based retrieval (Top-K matches)
- 🧠 Context injection into LLM prompts
- 💻 Local LLM execution via GPT4All (offline capable)
User Input
↓
Chunking
↓
Embedding (MiniLM)
↓
Cosine Similarity
↓
Top-K Context Retrieval
↓
Prompt Injection
↓
Local LLM (GPT4All)
↓
Final Response
rag-agent/
│
├── main.py # Main chat loop + LLM interaction
├── chunks.py # Chunking + embedding + retrieval logic
├── similarity.py # Semantic mapping & metadata logic
├── data.json # Preprocessed knowledge chunks
├── data_set.txt # Raw dataset
pip install gpt4all sentence-transformers scikit-learn numpypython main.py- MiniLM (
all-MiniLM-L6-v2) for embeddings - GPT4All-supported GGUF models (Mistral / LLaMA)
- Large model files (
.gguf) are not included in the repository - Designed for local/offline AI workflows
- Optimized for low-resource environments
Faizan Imran AI Engineer | Full Stack Developer
- API deployment (FastAPI / Vercel)
- Streaming responses
- Vector database integration (FAISS / Chroma)
- Multi-document support