A Retrieval-Augmented Generation (RAG) system for evaluating question-answering performance on the HotpotQA dataset using the Agno framework with ChromaDB vector database.
- Dual Search Modes: Supports both hybrid search (vector + keyword) and vector-only search
- Automatic Setup: Handles dependency installation and environment configuration automatically
- Agent-as-Judge Evaluation: Uses LLM-based semantic evaluation to improve answer scoring
- Parallel Processing: Efficient batch processing with progress monitoring
- Structured Logging: Logging with structlog
Run directly from the repository without cloning:
uv run --project https://github.kcl.ac.uk/k23069561/NLP-CW.git@RAG python main.py --first 5The system will automatically:
- Install dependencies
- Set up the virtual environment
- Configure environment variables
- Download the dataset
- Run the evaluation
Note: If the above syntax doesn't work with your uv version, clone the repository first (see Option 2).
- Clone the repository:
git clone -b RAG https://github.kcl.ac.uk/k23069561/NLP-CW.git
cd NLP-CW- Run the main script:
python main.py --first 5The script automatically detects if the project is not initialized and runs the setup process.
- Python 3.12 or higher
uvpackage manager (installed automatically if missing via pip)- Mistral API key (prompted during setup)
If running manually, the setup script handles all initialization:
./setup.shThis will:
- Install dependencies via
uv sync - Create
.envfile from.env.example - Prompt for Mistral API key
- Activate and verify the virtual environment
python main.py [OPTIONS]Options:
--first N: Process first N questions (e.g.,--first 5)--last N: Process last N questions (e.g.,--last 10)--range START END: Process questions in range START-END, 0-indexed and inclusive (e.g.,--range 10 20)- No flags: Process all questions in the dataset
Examples:
# Process first 5 questions
python main.py --first 5
# Process last 10 questions
python main.py --last 10
# Process questions 100-200
python main.py --range 100 200
# Process all questions
python main.pyThe evaluation pipeline generates the following files in the tmp/ directory:
predictions_hybrid.json: Predictions for hybrid searchpredictions_vector.json: Predictions for vector searchgold.json: Gold standard answers and supporting factseval_results.json: Consolidated evaluation metrics for both search types
The eval_results.json file contains:
{
"hybrid": {
"initial": { ... metrics ... },
"final": { ... metrics ... },
"judge_improvements": 0
},
"vector": {
"initial": { ... metrics ... },
"final": { ... metrics ... },
"judge_improvements": 2
}
}The system reports standard HotpotQA metrics:
- EM: Exact Match score
- F1: F1 score for answer matching
- Joint EM/F1: Combined answer and supporting facts accuracy
- Supporting Facts (sp_em, sp_f1): Accuracy of retrieved supporting documents
- Vector Database: ChromaDB with persistent storage
- Embeddings: Mistral embeddings with batch processing
- LLM: Mistral Large for question answering and evaluation
- Search: Hybrid search using Reciprocal Rank Fusion (RRF) or vector-only similarity
rag/
├── main.py # Main evaluation pipeline
├── eval.py # HotpotQA evaluation script
├── logging_config.py # Structured logging configuration
├── setup.sh # Initialization script
├── prompt/
│ ├── agent_instructions.yaml
│ └── judge_criteria.yaml
├── tmp/ # Generated files and data
│ ├── chromadb/ # Vector database storage
│ └── *.json # Predictions and results
└── pyproject.toml # Project dependencies
The project uses pre-commit hooks for code quality:
- Ruff for linting and formatting
- ShellCheck for shell script validation
- Security scanning with detect-secrets
Install hooks:
pre-commit installCreate a .env file with:
MISTRAL_API_KEY=your_api_key_here
Get your API key from: https://console.mistral.ai/