Skip to content

moumen-momi/rag-eval

Repository files navigation

HotpotQA RAG Evaluation

A Retrieval-Augmented Generation (RAG) system for evaluating question-answering performance on the HotpotQA dataset using the Agno framework with ChromaDB vector database.

Features

  • Dual Search Modes: Supports both hybrid search (vector + keyword) and vector-only search
  • Automatic Setup: Handles dependency installation and environment configuration automatically
  • Agent-as-Judge Evaluation: Uses LLM-based semantic evaluation to improve answer scoring
  • Parallel Processing: Efficient batch processing with progress monitoring
  • Structured Logging: Logging with structlog

Quick Start

Option 1: Direct Execution (Recommended)

Run directly from the repository without cloning:

uv run --project https://github.kcl.ac.uk/k23069561/NLP-CW.git@RAG python main.py --first 5

The system will automatically:

  • Install dependencies
  • Set up the virtual environment
  • Configure environment variables
  • Download the dataset
  • Run the evaluation

Note: If the above syntax doesn't work with your uv version, clone the repository first (see Option 2).

Option 2: Clone and Run

  1. Clone the repository:
git clone -b RAG https://github.kcl.ac.uk/k23069561/NLP-CW.git
cd NLP-CW
  1. Run the main script:
python main.py --first 5

The script automatically detects if the project is not initialized and runs the setup process.

Prerequisites

  • Python 3.12 or higher
  • uv package manager (installed automatically if missing via pip)
  • Mistral API key (prompted during setup)

Setup

If running manually, the setup script handles all initialization:

./setup.sh

This will:

  • Install dependencies via uv sync
  • Create .env file from .env.example
  • Prompt for Mistral API key
  • Activate and verify the virtual environment

Usage

Command-Line Options

python main.py [OPTIONS]

Options:

  • --first N: Process first N questions (e.g., --first 5)
  • --last N: Process last N questions (e.g., --last 10)
  • --range START END: Process questions in range START-END, 0-indexed and inclusive (e.g., --range 10 20)
  • No flags: Process all questions in the dataset

Examples:

# Process first 5 questions
python main.py --first 5

# Process last 10 questions
python main.py --last 10

# Process questions 100-200
python main.py --range 100 200

# Process all questions
python main.py

Output

The evaluation pipeline generates the following files in the tmp/ directory:

  • predictions_hybrid.json: Predictions for hybrid search
  • predictions_vector.json: Predictions for vector search
  • gold.json: Gold standard answers and supporting facts
  • eval_results.json: Consolidated evaluation metrics for both search types

The eval_results.json file contains:

{
  "hybrid": {
    "initial": { ... metrics ... },
    "final": { ... metrics ... },
    "judge_improvements": 0
  },
  "vector": {
    "initial": { ... metrics ... },
    "final": { ... metrics ... },
    "judge_improvements": 2
  }
}

Evaluation Metrics

The system reports standard HotpotQA metrics:

  • EM: Exact Match score
  • F1: F1 score for answer matching
  • Joint EM/F1: Combined answer and supporting facts accuracy
  • Supporting Facts (sp_em, sp_f1): Accuracy of retrieved supporting documents

Architecture

  • Vector Database: ChromaDB with persistent storage
  • Embeddings: Mistral embeddings with batch processing
  • LLM: Mistral Large for question answering and evaluation
  • Search: Hybrid search using Reciprocal Rank Fusion (RRF) or vector-only similarity

Project Structure

rag/
├── main.py                 # Main evaluation pipeline
├── eval.py                 # HotpotQA evaluation script
├── logging_config.py       # Structured logging configuration
├── setup.sh                # Initialization script
├── prompt/
│   ├── agent_instructions.yaml
│   └── judge_criteria.yaml
├── tmp/                    # Generated files and data
│   ├── chromadb/          # Vector database storage
│   └── *.json             # Predictions and results
└── pyproject.toml         # Project dependencies

Development

Code Quality

The project uses pre-commit hooks for code quality:

  • Ruff for linting and formatting
  • ShellCheck for shell script validation
  • Security scanning with detect-secrets

Install hooks:

pre-commit install

Environment Variables

Create a .env file with:

MISTRAL_API_KEY=your_api_key_here

Get your API key from: https://console.mistral.ai/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors