GitHub - iprincegautam/raweval-eval-core: LLM evaluation infrastructure — IAA workbench, LLM-as-a-judge harness, output drift detection, multi-model cost routing. Powers chat.raweval.com.

LLM eval pipelines fail silently. An output degrades, passes QA, gets used in production. Nobody notices until downstream quality drops.

This library is the infrastructure we built to fix that. It runs in production at chat.raweval.com — 1,500 evaluators onboarded in 7 days, 9-annotator IAA workbench, 45% LLM API cost reduction via multi-model routing.

What's in here

packages/iaa — Inter-annotator agreement workbench: Cohen's κ, Fleiss' κ, Krippendorff's α, and auto-rerouting when agreement falls below threshold.

from packages.iaa import CohenKappa, KrippendorffAlpha, IAARouter, MetricType

router = IAARouter()
result = router.evaluate_batch("task-1", [[0, 1], [0, 0], [0, 1]])

packages/judge — LLM-as-a-judge harness with per-workflow scoring and regression flags against a rolling baseline.

from packages.judge import EvalHarness, WorkflowScorer

harness = EvalHarness(workflow="qa-v1")
scorer = WorkflowScorer("qa-v1")
scorer.record(harness.evaluate(output="...", reference="..."))

packages/drift — Output drift detection and a regression pipeline that ties judge scores to drift signals.

from packages.drift import DriftDetector, RegressionPipeline

detector = DriftDetector("qa-v1")
signal = detector.add_score(0.62)

packages/archipelago — Multi-model cost routing across OpenAI, Anthropic, Google, and DeepSeek.

from packages.archipelago import ArchipelagoRouter

router = ArchipelagoRouter(budget_per_1k_tokens=0.002)
model = router.select_model("moderate")

The eval pipeline

  Input batch
       |
       v
  Pre-QC heuristics  (format, length, policy checks)
       |
       v
  LLM judge           (packages/judge)
       |
       v
  IAA scoring         (packages/iaa — Cohen / Fleiss / Krippendorff)
       |
       +---- pass ---->  Gold set / training mix
       |
       v
  Reroute             (ADDITIONAL_RATER or EXPERT_QUEUE)

Quick start

git clone https://github.com/loxerxxxx/raweval-eval-core.git
cd raweval-eval-core
pip install -r requirements.txt
cp .env.example .env
python examples/iaa_scoring.py

Set API keys in .env only if you call live judge or routing endpoints. The IAA example runs without keys.

Production numbers

1,500 evaluators onboarded in 7 days at launch. Nine-annotator IAA workbench in production on every batch close. Roughly 45% LLM API cost reduction on routed traffic via Archipelago model selection. Eval rubrics aligned to Stanford Phantom, Cornell RHyME, and RoadscapesQA (arXiv:2602.12877). Annotation throughput peaked around 215K VQA pairs per month on the heaviest pipeline.

Running examples

python examples/iaa_scoring.py
python examples/eval_pipeline.py
python examples/cost_comparison.py

Docs

IAA evaluation methodology — how we pick metrics, set thresholds, and route low-agreement tasks.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs		docs
examples		examples
packages		packages
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What's in here

The eval pipeline

Quick start

Production numbers

Running examples

Docs

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

What's in here

The eval pipeline

Quick start

Production numbers

Running examples

Docs

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages