A reusable evaluation framework that benchmarks multiple LLMs across four RAGAS metrics on a domain-specific Q&A dataset — no OpenAI billing required (uses Groq free tier).
This framework simulates a RAG (Retrieval-Augmented Generation) pipeline where each model receives a question + a retrieved context passage and must generate a faithful answer. The answers are then scored using four RAGAS metrics, and the results are compared across models with automated charts.
Models benchmarked
| Model | Provider | Size |
|---|---|---|
| Llama 3.1 | Groq (free) | 8B |
| Mistral 7B | Groq (free) | 7B |
| Gemma 2 | Groq (free) | 9B |
Metrics (RAGAS)
| Metric | What It Measures |
|---|---|
| Faithfulness | Are all claims in the answer supported by the retrieved context? |
| Answer Relevancy | Does the answer directly address the question? |
| Context Precision | Are the retrieved passages relevant to the question? |
| Context Recall | Does the retrieved context cover what's needed to answer? |
| Model | Faithfulness ↑ | Answer Relevancy | Context Precision | Context Recall | Avg Latency |
|---|---|---|---|---|---|
| Llama 3.1 | 0.872 | 0.840 | 0.780 | 0.832 | 1.31s |
| Mistral 7B | 0.813 | 0.792 | 0.731 | 0.797 | 1.58s |
| Gemma 2 | 0.643 | 0.708 | 0.748 | 0.674 | 1.09s |
Key finding: There is a 23-point gap in faithfulness between the top model (Llama 3.1 at 0.872) and the bottom (Gemma 2 at 0.643). Llama 3.1 leads across three of four metrics while remaining within acceptable latency bounds.
llm-rag-eval/
├── evaluate.py # Main benchmark runner
├── visualize.py # Chart generator
├── generate_mock_results.py # Demo mode — no API key needed
├── requirements.txt
├── data/
│ └── qa_dataset.json # 20-sample finance Q&A dataset
└── results/ # Generated at runtime
├── summary.csv
├── llama_results.csv
├── mistral_results.csv
├── gemma_results.csv
└── *.png # Charts
git clone https://github.com/<your-username>/llm-rag-eval.git
cd llm-rag-eval
pip install -r requirements.txtSign up at console.groq.com — no credit card required for the free tier.
export GROQ_API_KEY=gsk_your_key_here# Full benchmark (all 3 models × 20 samples)
python evaluate.py
# Quick smoke-test (5 samples only)
python evaluate.py --samples 5
# Single model
python evaluate.py --model llamapython visualize.pyTo preview all charts without any API key:
python generate_mock_results.py # creates realistic mock results
python visualize.py # generates all 4 chartsQuestion + Context
│
▼
[Model generates answer via Groq]
│
▼
RAGAS evaluates {question, context, answer, reference}
│
├─ Faithfulness (LLM-based)
├─ Answer Relevancy (LLM-based)
├─ Context Precision (LLM-based)
└─ Context Recall (LLM-based)
│
▼
Results saved to results/*.csv
Charts saved to results/*.png
RAGAS requires an LLM to act as the evaluator. This project uses llama-3.1-8b-instant on Groq (free) as the judge — no OpenAI API key or billing needed.
The dataset (data/qa_dataset.json) contains 20 finance Q&A samples covering:
- Portfolio theory (CAPM, Sharpe Ratio, Alpha, VaR)
- Fixed income (duration, yield curve, CDS, callable bonds)
- Corporate finance (P/E ratio, ROE, EBITDA, D/E ratio)
- Derivatives (Black-Scholes, FRA, options)
- Macroeconomics (QE, inflation, EMH)
Each sample includes a question, a retrieved context passage, and a human-written reference_answer.
Add a new model:
# In evaluate.py → MODELS dict
MODELS["qwen"] = "qwen-qwq-32b" # also available on GroqAdd a new metric:
# In evaluate.py → metrics list inside run_model_evaluation()
from ragas.metrics import AnswerCorrectness
metrics.append(AnswerCorrectness(llm=judge_llm))Use your own dataset:
Replace data/qa_dataset.json with a JSON file following the same schema:
[
{
"question": "...",
"context": "...",
"reference_answer": "..."
}
]| Tool | Role |
|---|---|
| RAGAS | RAG evaluation metrics |
| Groq | Free LLM inference API |
| HuggingFace Datasets | Dataset format for RAGAS |
| Pandas | Results processing |
| Matplotlib | Charts |
MIT — free to use, modify, and distribute.



