Skip to content

nipunkhanderia/rag-eval-project

Repository files navigation

RAG Evaluation Pipeline

A Retrieval-Augmented Generation (RAG) evaluation framework that tests LLM responses against a knowledge base using DeepEval metrics and logs results to Langfuse.

What It Does

  • Loads a document (company policy) and chunks it into a FAISS vector store
  • Uses llama3.2 via Ollama to generate answers from retrieved context
  • Evaluates answers using 4 DeepEval metrics with a configurable judge model
  • Logs all results (scores, latency, pass/fail) to Langfuse for observability

RAG Pipeline

Evaluation Metrics

Metric What It Checks
AnswerRelevancy Did the LLM answer the actual question?
Faithfulness Did the LLM stick to the retrieved context without hallucinating?
ContextualPrecision Did the retriever rank the most useful chunks first?
ContextualRecall Did the retriever fetch the chunk containing the answer?

Tech Stack

  • LLM — Ollama (llama3.2)
  • Embeddings — HuggingFace (all-MiniLM-L6-v2)
  • Vector Store — FAISS
  • Evaluation — DeepEval
  • Judge Model — Groq (llama-3.3-70b) or Ollama (gemma4:26b)
  • Observability — Langfuse

Setup

  1. Clone the repo
  2. Install dependencies
   pip install -r requirements.txt
  1. Copy .env.example to .env and add your keys
   cp .env.example .env
  1. Run Ollama locally with llama3.2
   ollama pull llama3.2
  1. Run the evaluation
   python app2.py

Configuration

Switch judge model backend in evaluation/deepeval_eval.py:

JUDGE_BACKEND = "groq"    # fast, API-based
JUDGE_BACKEND = "ollama"  # local, private

Output

Question : How many annual leave days do employees get?
Expected : 25 days
Actual   : Employees get 25 days of annual leave.
Latency  : 1.14 seconds
Score    : 1.0
Result   : PASS

EVALUATION SUMMARY
Passed: 4/4
Success Rate: 100.0%
Average DeepEval Score: 1.0

About

RAG pipeline with LLM evaluation using DeepEval, Ollama/Groq judge models, and Langfuse observability.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages