Skip to content

iprincegautam/raweval-eval-core

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM eval pipelines fail silently. An output degrades, passes QA, gets used in production. Nobody notices until downstream quality drops.

This library is the infrastructure we built to fix that. It runs in production at chat.raweval.com — 1,500 evaluators onboarded in 7 days, 9-annotator IAA workbench, 45% LLM API cost reduction via multi-model routing.

What's in here

packages/iaa — Inter-annotator agreement workbench: Cohen's κ, Fleiss' κ, Krippendorff's α, and auto-rerouting when agreement falls below threshold.

from packages.iaa import CohenKappa, KrippendorffAlpha, IAARouter, MetricType

router = IAARouter()
result = router.evaluate_batch("task-1", [[0, 1], [0, 0], [0, 1]])

packages/judge — LLM-as-a-judge harness with per-workflow scoring and regression flags against a rolling baseline.

from packages.judge import EvalHarness, WorkflowScorer

harness = EvalHarness(workflow="qa-v1")
scorer = WorkflowScorer("qa-v1")
scorer.record(harness.evaluate(output="...", reference="..."))

packages/drift — Output drift detection and a regression pipeline that ties judge scores to drift signals.

from packages.drift import DriftDetector, RegressionPipeline

detector = DriftDetector("qa-v1")
signal = detector.add_score(0.62)

packages/archipelago — Multi-model cost routing across OpenAI, Anthropic, Google, and DeepSeek.

from packages.archipelago import ArchipelagoRouter

router = ArchipelagoRouter(budget_per_1k_tokens=0.002)
model = router.select_model("moderate")

The eval pipeline

  Input batch
       |
       v
  Pre-QC heuristics  (format, length, policy checks)
       |
       v
  LLM judge           (packages/judge)
       |
       v
  IAA scoring         (packages/iaa — Cohen / Fleiss / Krippendorff)
       |
       +---- pass ---->  Gold set / training mix
       |
       v
  Reroute             (ADDITIONAL_RATER or EXPERT_QUEUE)

Quick start

git clone https://github.com/loxerxxxx/raweval-eval-core.git
cd raweval-eval-core
pip install -r requirements.txt
cp .env.example .env
python examples/iaa_scoring.py

Set API keys in .env only if you call live judge or routing endpoints. The IAA example runs without keys.

Production numbers

1,500 evaluators onboarded in 7 days at launch. Nine-annotator IAA workbench in production on every batch close. Roughly 45% LLM API cost reduction on routed traffic via Archipelago model selection. Eval rubrics aligned to Stanford Phantom, Cornell RHyME, and RoadscapesQA (arXiv:2602.12877). Annotation throughput peaked around 215K VQA pairs per month on the heaviest pipeline.

Running examples

python examples/iaa_scoring.py
python examples/eval_pipeline.py
python examples/cost_comparison.py

Docs

IAA evaluation methodology — how we pick metrics, set thresholds, and route low-agreement tasks.

License

MIT

About

LLM evaluation infrastructure — IAA workbench, LLM-as-a-judge harness, output drift detection, multi-model cost routing. Powers chat.raweval.com.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages