Skip to content

Yassa777/EvalKit

Repository files navigation

EvalKit

An open-source evaluation and optimization system for LLM-powered features, with a focus on retrieval-augmented generation (RAG).

Features

  • Track and test various metrics for LLM-powered features
  • Support for multiple vector stores (FAISS and pgvector)
  • Comprehensive evaluation system
  • Real-time monitoring and metrics collection
  • Beautiful Streamlit dashboard
  • Docker support for easy deployment

Prerequisites

  • Python 3.9+
  • Docker and Docker Compose
  • OpenAI API key

Quick Start

  1. Clone the repository:
git clone https://github.com/yourusername/evalkit.git
cd evalkit
  1. Create a .env file:
cp .env.example .env
# Edit .env with your settings
  1. Start the services:
docker-compose up -d
  1. Access the services:

Vector Store Options

EvalKit supports two vector store implementations:

FAISS

  • In-memory vector store
  • Fast similarity search
  • Good for development and testing
  • Configure with VECTOR_STORE_TYPE=faiss

pgvector

  • PostgreSQL-based vector store
  • Persistent storage
  • Production-ready
  • Configure with VECTOR_STORE_TYPE=pgvector

Development

  1. Create a virtual environment:
python -m venv venv
source venv/bin/activate  # or `venv\Scripts\activate` on Windows
  1. Install dependencies:
pip install -e .
  1. Initialize the database:
alembic upgrade head
  1. Start the development server:
uvicorn evalkit.api.main:app --reload

API Usage

Record an Interaction

import requests

response = requests.post(
    "http://localhost:8000/interactions",
    json={
        "query": "What is the capital of France?",
        "response": "The capital of France is Paris.",
        "metadata": {
            "model": "gpt-4",
            "temperature": 0.7
        }
    }
)

Run an Evaluation

response = requests.post(
    "http://localhost:8000/evaluations",
    json={
        "interaction_id": 1,
        "criteria": {
            "relevance": 0.9,
            "coherence": 0.8,
            "completeness": 0.7
        }
    }
)

Monitoring

EvalKit includes comprehensive monitoring through Prometheus and Grafana:

  • Track interaction counts
  • Monitor response times
  • Analyze evaluation scores
  • Set up alerts for performance issues

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

License

MIT License - see LICENSE file for details

About

unit test + observability + tuning engine for any team deploying llm features

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors