RAG Evaluation Pipeline

A Retrieval-Augmented Generation (RAG) evaluation framework that tests LLM responses against a knowledge base using DeepEval metrics and logs results to Langfuse.

What It Does

Loads a document (company policy) and chunks it into a FAISS vector store
Uses llama3.2 via Ollama to generate answers from retrieved context
Evaluates answers using 4 DeepEval metrics with a configurable judge model
Logs all results (scores, latency, pass/fail) to Langfuse for observability

Evaluation Metrics

Metric	What It Checks
AnswerRelevancy	Did the LLM answer the actual question?
Faithfulness	Did the LLM stick to the retrieved context without hallucinating?
ContextualPrecision	Did the retriever rank the most useful chunks first?
ContextualRecall	Did the retriever fetch the chunk containing the answer?

Tech Stack

LLM — Ollama (llama3.2)
Embeddings — HuggingFace (all-MiniLM-L6-v2)
Vector Store — FAISS
Evaluation — DeepEval
Judge Model — Groq (llama-3.3-70b) or Ollama (gemma4:26b)
Observability — Langfuse

Setup

Clone the repo
Install dependencies

   pip install -r requirements.txt

Copy .env.example to .env and add your keys

   cp .env.example .env

Run Ollama locally with llama3.2

   ollama pull llama3.2

Run the evaluation

   python app2.py

Configuration

Switch judge model backend in evaluation/deepeval_eval.py:

JUDGE_BACKEND = "groq"    # fast, API-based
JUDGE_BACKEND = "ollama"  # local, private

Output

Question : How many annual leave days do employees get?
Expected : 25 days
Actual   : Employees get 25 days of annual leave.
Latency  : 1.14 seconds
Score    : 1.0
Result   : PASS

EVALUATION SUMMARY
Passed: 4/4
Success Rate: 100.0%
Average DeepEval Score: 1.0

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.deepeval		.deepeval
.vscode		.vscode
__pycache__		__pycache__
data		data
evaluation		evaluation
observability		observability
user		user
user2		user2
user3		user3
.gitignore		.gitignore
README.md		README.md
app.py		app.py
app1.py		app1.py
app2.py		app2.py
app3.py		app3.py
deep_eval_check.py		deep_eval_check.py
diagram.png		diagram.png
inspect_langfuse.py		inspect_langfuse.py
requirements.txt		requirements.txt
test_cases.py		test_cases.py
test_langfuse.py		test_langfuse.py
test_langfuse_connection.py		test_langfuse_connection.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Evaluation Pipeline

What It Does

Evaluation Metrics

Tech Stack

Setup

Configuration

Output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG Evaluation Pipeline

What It Does

Evaluation Metrics

Tech Stack

Setup

Configuration

Output

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages