Measure and compare LLM, RAG, and agent workloads across cost, latency, quality, and reliability.
pangolin-eval is an open-source toolkit for AI product teams, startup CTOs, and senior engineers who need to understand which model is good enough before inference cost becomes painful. It helps teams run structured prompt or workflow evaluations, estimate cost, track latency, and generate decision-ready reports.
AI products are no longer judged only by answer quality. In production, the useful question is:
Which model gives enough quality for this task at an acceptable cost, latency, and reliability?
This project starts with local, file-based workflows:
- compare models on the same prompts
- estimate token and provider cost
- measure latency
- score weighted quality checks
- generate Markdown, JSON, and optional static HTML reports
- apply budget, quality, latency, and reliability gates
- summarize attribution by model, prompt, feature, workflow, environment, and prompt version
- evaluate synthetic RAG tasks for context efficiency and answer coverage
- generate agent/workflow TraceCards from local trace events
- produce auditable recommendations and OTel-style exports
Run the bundled mock comparison. It does not require API keys.
cd pangolin-eval
PYTHONPATH=src python -m pangolin_eval.cli run \
--config examples/simple_model_compare/config.json \
--out reports/simple_model_compare \
--htmlFor sensitive runs, omit response text from saved artifacts while keeping metrics and quality scores:
PYTHONPATH=src python -m pangolin_eval.cli run \
--config examples/simple_model_compare/config.json \
--out reports/simple_model_compare_private \
--content-mode metadata-onlyOpen the generated report:
cat reports/simple_model_compare/report.mdValidate a config without running providers:
PYTHONPATH=src python -m pangolin_eval.cli validate \
--config examples/simple_model_compare/config.jsonRun the synthetic RAG evaluation:
PYTHONPATH=src python -m pangolin_eval.cli rag \
--config examples/rag_eval/config.json \
--out reports/rag_eval \
--content-mode metadata-onlyGenerate agent/workflow TraceCards from local trace events:
PYTHONPATH=src python -m pangolin_eval.cli trace \
--input examples/agent_trace/trace_events.json \
--out reports/agent_traceExport supported artifacts as OTel-style spans:
PYTHONPATH=src python -m pangolin_eval.cli export-otel \
--input reports/simple_model_compare/report.json \
--out reports/simple_model_compare/otel.jsonInstall locally for CLI usage:
python -m pip install -e .
pangolin-eval run \
--config examples/simple_model_compare/config.json \
--out reports/simple_model_compareThe report ranks models by a simple efficiency score that combines quality, cost, and latency.
| Model | Runs | Success rate | Avg quality | Avg latency ms | Estimated cost USD | Recommendation |
| --- | ---: | ---: | ---: | ---: | ---: | --- |
| fast-cheap | 2 | 1.00 | 1.00 | 180 | 0.00006015 | Best quality candidate |
| balanced | 2 | 1.00 | 1.00 | 320 | 0.00022060 | Best quality candidate |
| strong-expensive | 2 | 1.00 | 1.00 | 540 | 0.00127250 | Usable with latency review |
The current local open-source version includes:
- Python 3.9+ CLI and library
- config-driven model and prompt comparison
- mock provider for reproducible public demos
- OpenAI-compatible provider adapter
- latency, token, and cost tracking
- keyword, contains, regex, and exact-match quality evaluators
- configurable token counters for estimated usage fallback
- Markdown, JSON, and optional static HTML reporting
- versioned report schema
- metadata-only report mode for privacy-conscious runs
- budget, quality, latency, and reliability gates
- failure-tolerant runs with success/error status fields
- attribution and pricing provenance summaries
- synthetic RAG evaluation CLI and report with context diagnostics
- local agent/workflow TraceCards with loop and waste diagnostics
- auditable recommendations-lite
- OTel-style export for reports and TraceCards
- OpenAI-compatible, LiteLLM, Ollama, and vLLM gateway examples
- Docker Compose no-key demos
Latest release: v0.2.3.
The planned local open-source scope is implemented for the current release track: weighted evaluator plugins, configurable token counters, additional gateway examples, RAG and agent diagnostics, and optional static HTML reports are available. Future work can focus on polish, more adapters, and hosted or team workflows without weakening the local CLI/library foundation.
JSON reports declare a schema version and content mode:
schema_version: currentlypangolin-eval.report.v4content_mode:fullormetadata_only
See docs/REPORT_SCHEMA.md, docs/EXTENSIONS.md, docs/COMPATIBILITY.md, docs/MIGRATIONS.md, docs/RELEASE.md, and the files under schemas/.
Use SUPPORT.md for the best path to ask questions, report bugs, or suggest integrations. See docs/LAUNCH.md for launch notes, suggested demo flows, and copy that explains the project clearly.
docker compose run --rm demo
docker compose run --rm rag-demoThis repo is the public CLI/library foundation. Commercial extensions may build on top of this engine for UI, run history, team workflows, recommendations, alerts, and executive reports.
The open-source project should remain useful on its own: local reports, transparent metrics, reproducible examples, and CI-friendly gates stay public.
See examples/simple_model_compare/config.json.
Top-level config can include:
run_name: report titledescription: report descriptionpricing_catalog: optional relative path to a pricing cataloggates: optional cost, quality, latency, and reliability thresholds
Model entries include:
id: display name used in reportsprovider:mockoropenai_compatibleapi_model: provider model id, if different fromidinput_price_per_1m: input token cost estimateoutput_price_per_1m: output token cost estimatemodel_group: optional grouping for attributionmax_retries: retry attempts after provider failurestoken_counter: estimated usage fallback, one ofchar_4,whitespace, oropenai_chatpricing_source,pricing_source_url,pricing_updated_at: pricing provenance- capability metadata such as
context_window_tokens,supports_tools,supports_json_mode, andlatency_band allow_unsafe_api_key_env: opt-in escape hatch for custom API key environment variable names or non-default provider hosts; by default OpenAI-compatible configs only allow known safe API key/host pairings or thePANGOLIN_EVAL_prefix, andbase_urlmust use HTTPS unless it points to loopback
Prompt entries include:
id: prompt case identifiermessages: chat-style messagesexpected_keywords: optional simple quality checkevaluators: optional weighted checks usingkeyword,contains,regex, orexact- attribution fields such as
feature,workflow,environment,prompt_version, andcustomer_user_hash
Evaluator example:
{
"evaluators": [
{"type": "contains", "value": "latency"},
{"type": "regex", "value": "cost|price|spend", "weight": 2}
]
}Regex evaluators are intentionally constrained: patterns must be at most 256 characters and nested quantifiers are rejected to avoid excessive runtime on untrusted responses.
Gate examples:
{
"gates": {
"max_total_cost_usd": 0.01,
"max_avg_latency_ms": 500,
"min_avg_quality": 0.75,
"min_success_rate": 1.0
}
}To use a real provider, configure a model with provider: "openai_compatible" and set an API key environment variable.
{
"id": "gpt-4o-mini",
"provider": "openai_compatible",
"api_model": "gpt-4o-mini",
"base_url": "https://api.openai.com/v1",
"api_key_env": "OPENAI_API_KEY",
"input_price_per_1m": 0.15,
"output_price_per_1m": 0.6
}Then run:
export OPENAI_API_KEY="..."
pangolin-eval run \
--config path/to/config.json \
--out reports/live \
--content-mode metadata-onlyfull content mode stores model response text in saved report artifacts. Use
metadata-only for real evaluations unless the prompts and responses are safe
to store and share.
Additional templates:
- examples/ci_gate/config.json
- examples/openai_compatible/config.json
- examples/litellm_gateway/config.json
- examples/ollama_openai_compatible/config.json
- examples/vllm_openai_compatible/config.json
- AI product teams comparing model tradeoffs
- startup CTOs controlling AI product cost
- senior AI engineers building production LLM systems
- ML platform engineers supporting LLMOps workflows
PYTHONPATH=src python -m unittest discover -s testsMIT