HotpotQA RAG Evaluation

A Retrieval-Augmented Generation (RAG) system for evaluating question-answering performance on the HotpotQA dataset using the Agno framework with ChromaDB vector database.

Features

Dual Search Modes: Supports both hybrid search (vector + keyword) and vector-only search
Automatic Setup: Handles dependency installation and environment configuration automatically
Agent-as-Judge Evaluation: Uses LLM-based semantic evaluation to improve answer scoring
Parallel Processing: Efficient batch processing with progress monitoring
Structured Logging: Logging with structlog

Quick Start

Option 1: Direct Execution (Recommended)

Run directly from the repository without cloning:

uv run --project https://github.kcl.ac.uk/k23069561/NLP-CW.git@RAG python main.py --first 5

The system will automatically:

Install dependencies
Set up the virtual environment
Configure environment variables
Download the dataset
Run the evaluation

Note: If the above syntax doesn't work with your uv version, clone the repository first (see Option 2).

Option 2: Clone and Run

Clone the repository:

git clone -b RAG https://github.kcl.ac.uk/k23069561/NLP-CW.git
cd NLP-CW

Run the main script:

python main.py --first 5

The script automatically detects if the project is not initialized and runs the setup process.

Prerequisites

Python 3.12 or higher
uv package manager (installed automatically if missing via pip)
Mistral API key (prompted during setup)

Setup

If running manually, the setup script handles all initialization:

./setup.sh

This will:

Install dependencies via uv sync
Create .env file from .env.example
Prompt for Mistral API key
Activate and verify the virtual environment

Usage

Command-Line Options

python main.py [OPTIONS]

Options:

--first N: Process first N questions (e.g., --first 5)
--last N: Process last N questions (e.g., --last 10)
--range START END: Process questions in range START-END, 0-indexed and inclusive (e.g., --range 10 20)
No flags: Process all questions in the dataset

Examples:

# Process first 5 questions
python main.py --first 5

# Process last 10 questions
python main.py --last 10

# Process questions 100-200
python main.py --range 100 200

# Process all questions
python main.py

Output

The evaluation pipeline generates the following files in the tmp/ directory:

predictions_hybrid.json: Predictions for hybrid search
predictions_vector.json: Predictions for vector search
gold.json: Gold standard answers and supporting facts
eval_results.json: Consolidated evaluation metrics for both search types

The eval_results.json file contains:

{
  "hybrid": {
    "initial": { ... metrics ... },
    "final": { ... metrics ... },
    "judge_improvements": 0
  },
  "vector": {
    "initial": { ... metrics ... },
    "final": { ... metrics ... },
    "judge_improvements": 2
  }
}

Evaluation Metrics

The system reports standard HotpotQA metrics:

EM: Exact Match score
F1: F1 score for answer matching
Joint EM/F1: Combined answer and supporting facts accuracy
Supporting Facts (sp_em, sp_f1): Accuracy of retrieved supporting documents

Architecture

Vector Database: ChromaDB with persistent storage
Embeddings: Mistral embeddings with batch processing
LLM: Mistral Large for question answering and evaluation
Search: Hybrid search using Reciprocal Rank Fusion (RRF) or vector-only similarity

Project Structure

rag/
├── main.py                 # Main evaluation pipeline
├── eval.py                 # HotpotQA evaluation script
├── logging_config.py       # Structured logging configuration
├── setup.sh                # Initialization script
├── prompt/
│   ├── agent_instructions.yaml
│   └── judge_criteria.yaml
├── tmp/                    # Generated files and data
│   ├── chromadb/          # Vector database storage
│   └── *.json             # Predictions and results
└── pyproject.toml         # Project dependencies

Development

Code Quality

The project uses pre-commit hooks for code quality:

Ruff for linting and formatting
ShellCheck for shell script validation
Security scanning with detect-secrets

Install hooks:

pre-commit install

Environment Variables

Create a .env file with:

MISTRAL_API_KEY=your_api_key_here

Get your API key from: https://console.mistral.ai/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HotpotQA RAG Evaluation

Features

Quick Start

Option 1: Direct Execution (Recommended)

Option 2: Clone and Run

Prerequisites

Setup

Usage

Command-Line Options

Output

Evaluation Metrics

Architecture

Project Structure

Development

Code Quality

Environment Variables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
prompt		prompt
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
README.md		README.md
eval.py		eval.py
logging_config.py		logging_config.py
main.py		main.py
pyproject.toml		pyproject.toml
setup.sh		setup.sh

Folders and files

Latest commit

History

Repository files navigation

HotpotQA RAG Evaluation

Features

Quick Start

Option 1: Direct Execution (Recommended)

Option 2: Clone and Run

Prerequisites

Setup

Usage

Command-Line Options

Output

Evaluation Metrics

Architecture

Project Structure

Development

Code Quality

Environment Variables

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages