RAGeATM stands for "Raging Against the Machine with Retrieval-Augmented Generation." It is a small SEIS 767 Conversational AI final-project MVP that demonstrates the mechanics of a retrieval-augmented question-answering pipeline.
The final project is intentionally scoped as an explainable prototype, not a production assistant.
- Local corpus ingestion from
data/raw/*.txt - Word-window chunking with overlap
- Lightweight TF-IDF vectors with cosine-similarity retrieval
- A saved local TF-IDF index under
data/index/ - Top-k retrieval with similarity scores and a minimum relevance threshold
- Prompt construction using retrieved context
- Offline retrieval-conditioned answer generation
- Optional OpenAI generation when
OPENAI_API_KEYis available - A small benchmark with in-domain, partially answerable, and out-of-domain questions
- A CLI demo suitable for a 5-6 minute final presentation
Use this phrase when describing the technical scope:
a lightweight retrieval-based RAG prototype using TF-IDF vectors and cosine similarity
- No Chroma or external vector database
- No neural embedding model
- No semantic embedding claims beyond lexical TF-IDF retrieval
- No agent tools
- No conversational memory
- No voice interface
- No Docker/evolution/self-improvement loop
- No production deployment
- No large benchmark or production corpus
If the demo runs without OPENAI_API_KEY, the answer generator is not a real LLM. It is an offline retrieval-conditioned generator used to demonstrate RAG mechanics without paid services.
flowchart LR
A["data/raw/*.txt"] --> B["src.ingest"]
B --> C["data/processed/documents.json"]
C --> D["src.chunk"]
D --> E["data/processed/chunks.json"]
E --> F["src.embed TF-IDF"]
F --> G["data/index vectors + metadata + vectorizer"]
H["User question"] --> I["src.retrieve cosine similarity"]
G --> I
I --> J["Top-k retrieved chunks + scores"]
J --> K["src.generate prompt construction"]
K --> L["Offline grounded answer or optional OpenAI answer"]
data/raw/ Small educational corpus used by the demo
data/processed/ Generated documents/chunks, ignored by git
data/index/ Generated TF-IDF index, ignored by git
docs/evaluation_results.md Generated benchmark report
docs/final_report_notes.md Final-report companion notes
docs/video_script_5_6_min.md Presentation script
scripts/build_index.py Rebuilds processed data and TF-IDF index
scripts/run_demo.py Runs one end-to-end demo question
scripts/run_evaluation.py Runs benchmark and writes docs/evaluation_results.md
src/ Pipeline implementation
tests/ Focused sanity tests
From the project root:
python3 -m venv .venv
source .venv/bin/activate
python -m pip install -r requirements.txtOptional real-LLM mode:
export OPENAI_API_KEY="your_key_here"
export RAGEATM_OPENAI_MODEL="gpt-4o-mini"Do not commit API keys. The default demo does not need an API key.
Recommended command for the final video:
python scripts/run_demo.py --query "How does RAG reduce hallucinations?"Out-of-domain refusal demo:
python scripts/run_demo.py --query "What is the capital of France?"Show the constructed prompt if useful:
python scripts/run_demo.py --query "How does RAG reduce hallucinations?" --show-promptForce offline mode:
python scripts/run_demo.py --mode offline --query "What does chunking do in the RAG pipeline?"Try OpenAI mode if OPENAI_API_KEY is set:
python scripts/run_demo.py --mode openai --query "How does RAG reduce hallucinations?"If OpenAI mode fails, the project falls back to the offline retrieval-conditioned generator and prints the failure reason.
python scripts/run_evaluation.pyThis rebuilds the local index, runs the small benchmark, and writes:
docs/evaluation_results.md
Current benchmark summary:
- 7 benchmark questions
- 5 in-domain questions
- 1 partially answerable limitation question
- 1 out-of-domain question
- Current useful retrieval decisions: 7/7 on the small educational benchmark
That 7/7 result should be described carefully. It means the small sanity-check benchmark works, not that the system is generally accurate.
python -m pytestThe tests cover chunking behavior and basic retrieval threshold behavior.
The local corpus is intentionally small and educational. It lives in data/raw/ and includes notes on:
- RAG concepts
- RAGeATM project design
- chunking and retrieval pipeline mechanics
- evaluation methodology
- data-engineering troubleshooting
- conversational-AI limitations
This is not a production dataset or a large curated knowledge base.
Retrieval uses TfidfVectorizer from scikit-learn and cosine similarity. This is lexical retrieval. It is useful for an explainable MVP because the vectorization and scoring are easy to defend.
Generation has two modes:
- Default: offline retrieval-conditioned answer generator. It selects supporting sentences from relevant retrieved chunks and cites local sources.
- Optional: OpenAI generation through environment variables. This is not required for the project to run.
The system refuses to answer when the best retrieved context is below the similarity threshold.
- Small corpus
- Lexical retrieval only
- No Chroma
- No neural embeddings
- No persistent chat memory
- No agent tools
- Offline generator is not a full LLM
- Evaluation is a small demo benchmark, not rigorous large-scale measurement
- Replace TF-IDF with neural embeddings
- Add Chroma or another vector store
- Expand the corpus with real course/project documents
- Add a small Gradio or Streamlit UI
- Add conversation history and follow-up question handling
- Add stronger evaluation with held-out queries and human ratings
- Lewis et al., "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," 2020, https://arxiv.org/abs/2005.11401
- scikit-learn
TfidfVectorizer, https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html - scikit-learn
cosine_similarity, https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html
For the video, be direct:
- Say this is a rebuilt, honest MVP.
- Say it demonstrates RAG mechanics with local TF-IDF retrieval.
- Show one successful in-domain question.
- Show one out-of-domain refusal.
- Do not claim Chroma, neural embeddings, memory, tools, or production readiness.
Use docs/video_script_5_6_min.md for the recording plan.