pangolin-eval

Measure and compare LLM, RAG, and agent workloads across cost, latency, quality, and reliability.

pangolin-eval is an open-source toolkit for AI product teams, startup CTOs, and senior engineers who need to understand which model is good enough before inference cost becomes painful. It helps teams run structured prompt or workflow evaluations, estimate cost, track latency, and generate decision-ready reports.

Why This Exists

AI products are no longer judged only by answer quality. In production, the useful question is:

Which model gives enough quality for this task at an acceptable cost, latency, and reliability?

This project starts with local, file-based workflows:

compare models on the same prompts
estimate token and provider cost
measure latency
score weighted quality checks
generate Markdown, JSON, and optional static HTML reports
apply budget, quality, latency, and reliability gates
summarize attribution by model, prompt, feature, workflow, environment, and prompt version
evaluate synthetic RAG tasks for context efficiency and answer coverage
generate agent/workflow TraceCards from local trace events
produce auditable recommendations and OTel-style exports

Quickstart

Run the bundled mock comparison. It does not require API keys.

cd pangolin-eval
PYTHONPATH=src python -m pangolin_eval.cli run \
  --config examples/simple_model_compare/config.json \
  --out reports/simple_model_compare \
  --html

For sensitive runs, omit response text from saved artifacts while keeping metrics and quality scores:

PYTHONPATH=src python -m pangolin_eval.cli run \
  --config examples/simple_model_compare/config.json \
  --out reports/simple_model_compare_private \
  --content-mode metadata-only

Open the generated report:

cat reports/simple_model_compare/report.md

Validate a config without running providers:

PYTHONPATH=src python -m pangolin_eval.cli validate \
  --config examples/simple_model_compare/config.json

Run the synthetic RAG evaluation:

PYTHONPATH=src python -m pangolin_eval.cli rag \
  --config examples/rag_eval/config.json \
  --out reports/rag_eval \
  --content-mode metadata-only

Generate agent/workflow TraceCards from local trace events:

PYTHONPATH=src python -m pangolin_eval.cli trace \
  --input examples/agent_trace/trace_events.json \
  --out reports/agent_trace

Export supported artifacts as OTel-style spans:

PYTHONPATH=src python -m pangolin_eval.cli export-otel \
  --input reports/simple_model_compare/report.json \
  --out reports/simple_model_compare/otel.json

Install locally for CLI usage:

python -m pip install -e .
pangolin-eval run \
  --config examples/simple_model_compare/config.json \
  --out reports/simple_model_compare

Example Output

The report ranks models by a simple efficiency score that combines quality, cost, and latency.

| Model | Runs | Success rate | Avg quality | Avg latency ms | Estimated cost USD | Recommendation |
| --- | ---: | ---: | ---: | ---: | ---: | --- |
| fast-cheap | 2 | 1.00 | 1.00 | 180 | 0.00006015 | Best quality candidate |
| balanced | 2 | 1.00 | 1.00 | 320 | 0.00022060 | Best quality candidate |
| strong-expensive | 2 | 1.00 | 1.00 | 540 | 0.00127250 | Usable with latency review |

Current Scope

The current local open-source version includes:

Python 3.9+ CLI and library
config-driven model and prompt comparison
mock provider for reproducible public demos
OpenAI-compatible provider adapter
latency, token, and cost tracking
keyword, contains, regex, and exact-match quality evaluators
configurable token counters for estimated usage fallback
Markdown, JSON, and optional static HTML reporting
versioned report schema
metadata-only report mode for privacy-conscious runs
budget, quality, latency, and reliability gates
failure-tolerant runs with success/error status fields
attribution and pricing provenance summaries
synthetic RAG evaluation CLI and report with context diagnostics
local agent/workflow TraceCards with loop and waste diagnostics
auditable recommendations-lite
OTel-style export for reports and TraceCards
OpenAI-compatible, LiteLLM, Ollama, and vLLM gateway examples
Docker Compose no-key demos

Project Status

Latest release: v0.2.3.

The planned local open-source scope is implemented for the current release track: weighted evaluator plugins, configurable token counters, additional gateway examples, RAG and agent diagnostics, and optional static HTML reports are available. Future work can focus on polish, more adapters, and hosted or team workflows without weakening the local CLI/library foundation.

Report Contract

JSON reports declare a schema version and content mode:

schema_version: currently pangolin-eval.report.v4
content_mode: full or metadata_only

See docs/REPORT_SCHEMA.md, docs/EXTENSIONS.md, docs/COMPATIBILITY.md, docs/MIGRATIONS.md, docs/RELEASE.md, and the files under schemas/.

Support And Feedback

Use SUPPORT.md for the best path to ask questions, report bugs, or suggest integrations. See docs/LAUNCH.md for launch notes, suggested demo flows, and copy that explains the project clearly.

Local Demo With Docker

docker compose run --rm demo
docker compose run --rm rag-demo

Open-Core Direction

This repo is the public CLI/library foundation. Commercial extensions may build on top of this engine for UI, run history, team workflows, recommendations, alerts, and executive reports.

The open-source project should remain useful on its own: local reports, transparent metrics, reproducible examples, and CI-friendly gates stay public.

Configuration

See examples/simple_model_compare/config.json.

Top-level config can include:

run_name: report title
description: report description
pricing_catalog: optional relative path to a pricing catalog
gates: optional cost, quality, latency, and reliability thresholds

Model entries include:

id: display name used in reports
provider: mock or openai_compatible
api_model: provider model id, if different from id
input_price_per_1m: input token cost estimate
output_price_per_1m: output token cost estimate
model_group: optional grouping for attribution
max_retries: retry attempts after provider failures
token_counter: estimated usage fallback, one of char_4, whitespace, or openai_chat
pricing_source, pricing_source_url, pricing_updated_at: pricing provenance
capability metadata such as context_window_tokens, supports_tools, supports_json_mode, and latency_band
allow_unsafe_api_key_env: opt-in escape hatch for custom API key environment variable names or non-default provider hosts; by default OpenAI-compatible configs only allow known safe API key/host pairings or the PANGOLIN_EVAL_ prefix, and base_url must use HTTPS unless it points to loopback

Prompt entries include:

id: prompt case identifier
messages: chat-style messages
expected_keywords: optional simple quality check
evaluators: optional weighted checks using keyword, contains, regex, or exact
attribution fields such as feature, workflow, environment, prompt_version, and customer_user_hash

Evaluator example:

{
  "evaluators": [
    {"type": "contains", "value": "latency"},
    {"type": "regex", "value": "cost|price|spend", "weight": 2}
  ]
}

Regex evaluators are intentionally constrained: patterns must be at most 256 characters and nested quantifiers are rejected to avoid excessive runtime on untrusted responses.

Gate examples:

{
  "gates": {
    "max_total_cost_usd": 0.01,
    "max_avg_latency_ms": 500,
    "min_avg_quality": 0.75,
    "min_success_rate": 1.0
  }
}

OpenAI-Compatible Providers

To use a real provider, configure a model with provider: "openai_compatible" and set an API key environment variable.

{
  "id": "gpt-4o-mini",
  "provider": "openai_compatible",
  "api_model": "gpt-4o-mini",
  "base_url": "https://api.openai.com/v1",
  "api_key_env": "OPENAI_API_KEY",
  "input_price_per_1m": 0.15,
  "output_price_per_1m": 0.6
}

Then run:

export OPENAI_API_KEY="..."
pangolin-eval run \
  --config path/to/config.json \
  --out reports/live \
  --content-mode metadata-only

full content mode stores model response text in saved report artifacts. Use metadata-only for real evaluations unless the prompts and responses are safe to store and share.

Additional templates:

Intended Users

AI product teams comparing model tradeoffs
startup CTOs controlling AI product cost
senior AI engineers building production LLM systems
ML platform engineers supporting LLMOps workflows

Development

PYTHONPATH=src python -m unittest discover -s tests

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.github		.github
docs		docs
examples		examples
schemas		schemas
src/pangolin_eval		src/pangolin_eval
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pangolin-eval

Why This Exists

Quickstart

Example Output

Current Scope

Project Status

Report Contract

Support And Feedback

Local Demo With Docker

Open-Core Direction

Configuration

OpenAI-Compatible Providers

Intended Users

Development

License

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

pangolin-eval

Why This Exists

Quickstart

Example Output

Current Scope

Project Status

Report Contract

Support And Feedback

Local Demo With Docker

Open-Core Direction

Configuration

OpenAI-Compatible Providers

Intended Users

Development

License

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages