LLM eval pipelines fail silently. An output degrades, passes QA, gets used in production. Nobody notices until downstream quality drops.
This library is the infrastructure we built to fix that. It runs in production at chat.raweval.com — 1,500 evaluators onboarded in 7 days, 9-annotator IAA workbench, 45% LLM API cost reduction via multi-model routing.
packages/iaa — Inter-annotator agreement workbench: Cohen's κ, Fleiss' κ, Krippendorff's α, and auto-rerouting when agreement falls below threshold.
from packages.iaa import CohenKappa, KrippendorffAlpha, IAARouter, MetricType
router = IAARouter()
result = router.evaluate_batch("task-1", [[0, 1], [0, 0], [0, 1]])packages/judge — LLM-as-a-judge harness with per-workflow scoring and regression flags against a rolling baseline.
from packages.judge import EvalHarness, WorkflowScorer
harness = EvalHarness(workflow="qa-v1")
scorer = WorkflowScorer("qa-v1")
scorer.record(harness.evaluate(output="...", reference="..."))packages/drift — Output drift detection and a regression pipeline that ties judge scores to drift signals.
from packages.drift import DriftDetector, RegressionPipeline
detector = DriftDetector("qa-v1")
signal = detector.add_score(0.62)packages/archipelago — Multi-model cost routing across OpenAI, Anthropic, Google, and DeepSeek.
from packages.archipelago import ArchipelagoRouter
router = ArchipelagoRouter(budget_per_1k_tokens=0.002)
model = router.select_model("moderate") Input batch
|
v
Pre-QC heuristics (format, length, policy checks)
|
v
LLM judge (packages/judge)
|
v
IAA scoring (packages/iaa — Cohen / Fleiss / Krippendorff)
|
+---- pass ----> Gold set / training mix
|
v
Reroute (ADDITIONAL_RATER or EXPERT_QUEUE)
git clone https://github.com/loxerxxxx/raweval-eval-core.git
cd raweval-eval-core
pip install -r requirements.txt
cp .env.example .env
python examples/iaa_scoring.pySet API keys in .env only if you call live judge or routing endpoints. The IAA example runs without keys.
1,500 evaluators onboarded in 7 days at launch. Nine-annotator IAA workbench in production on every batch close. Roughly 45% LLM API cost reduction on routed traffic via Archipelago model selection. Eval rubrics aligned to Stanford Phantom, Cornell RHyME, and RoadscapesQA (arXiv:2602.12877). Annotation throughput peaked around 215K VQA pairs per month on the heaviest pipeline.
python examples/iaa_scoring.py
python examples/eval_pipeline.py
python examples/cost_comparison.pyIAA evaluation methodology — how we pick metrics, set thresholds, and route low-agreement tasks.
MIT