Refinery

Open source error analysis for AI agents – analyze failures, generate ranked fix hypotheses, and improve agent reliability through disciplined iteration.

Andrew Ng observed that "the single biggest predictor of how rapidly a team makes progress building an AI agent lay in their ability to drive a disciplined process for evals and error analysis." Yet most teams shortcut this process, attempting quick fixes rather than identifying root causes. Refinery makes rigorous error analysis practical: bring any trace from any agent system, and get a structured Hypothesis Pack with evidence, ranked fixes, confidence scores, and reproducibility metadata.

Quick Start

1. Analyze a local trace (no integrations needed)

pip install refinery-cli
export OPENAI_API_KEY=sk-your-key

# Basic analysis
refinery chat --trace-file examples/demo_trace.json --out hypothesis.yaml

# With prompts and evals
refinery chat \
  --trace-file trace.json \
  --prompts './prompts/*.txt' \
  --evals './tests/*.py' \
  --expected-behavior "Agent should call tools before answering" \
  --out hypothesis.yaml

Supported trace formats:

LangSmith exports
Custom JSON (auto-detected)

2. Fetch live traces via LangSmith integration (optional)

export OPENAI_API_KEY=sk-your-key
export LANGSMITH_API_KEY=ls-your-key

refinery chat --trace-id abc123 --project my-project --out hypothesis.yaml

You'll receive a validated Hypothesis Pack containing findings, ranked hypotheses with diffs, confidence scores, and reproducibility metadata.

Why Refinery?

The Problem: Reliable AI agents require disciplined error analysis

Building reliable AI agents isn't just about stacking the latest techniques. As Andrew Ng puts it: "To improve your agentic AI system, don't just stack up the latest buzzy techniques that just went viral on social media. Instead, use error analysis to figure out where it's falling short, and focus on that."

The challenge: error analysis is time-consuming, and existing tools lock you into specific platforms.

Refinery's Solution: Open source, platform-agnostic error analysis

AgentKit does error analysis but only works if you use their agent framework.

Refinery is open source and works with ANY agent system:

✅ Not locked to any agent framework
✅ Not locked to any LLM provider (OpenAI today, Claude/Gemini coming)
✅ Not locked to any observability platform (LangSmith, Langfuse, OpenTelemetry, custom)
✅ Bring your own traces, prompts, and evals

What Refinery does today: Analyzes failures and generates ranked hypotheses for potential fixes with confidence scores and evidence.

What Refinery will do in the future: Provide an experiment lab environment to statistically prove whether proposed fixes actually improve your agent system.

What you get

Evidence-backed findings – Canonicalized failure patterns with root cause analysis
Ranked fix hypotheses – Proposed changes with diffs, confidence scores, and risk assessment
Before/after examples – Concrete demonstrations of current vs expected behavior
Reproducibility metadata – Model params, seeds, and methodology for validation
Portable Hypothesis Pack – Schema-validated YAML/JSON interchange format

Feature Highlights

Multi-source trace analysis:

Local files: OpenTelemetry, Langfuse, LangSmith exports, custom JSON
LangSmith API: Live trace fetching with --trace-id (optional)

Flexible input:

Bring your own prompts: --prompts flag with glob patterns
Bring your own evals: --evals flag for test files
Custom requirements: --expected-behavior for specific validation criteria

Developer experience:

Structured exports: --out PATH --format yaml|json
CLI alias: refinery analyze mirrors refinery chat
Mocked providers for testing – pytest runs without API keys
Minimal CI pipeline (ruff, black, pytest) runs without secrets
Hypothesis Pack schema validation (Pydantic)

Usage

# Analyze local trace with prompts and evals
refinery chat \
  --trace-file trace.json \
  --prompts './prompts/*.md' \
  --evals './tests/*.py' \
  --expected-behavior "Agent must call check_policy tool before answering" \
  --out hypothesis.yaml

# Analyze exported LangSmith trace
refinery chat --trace-file langsmith_export.json --out hypothesis.yaml

# Fetch live from LangSmith
refinery analyze --trace-id abc123 --project my-project --out hypothesis.json --format json

The CLI validates output against the schema before writing files. Hypothesis IDs are normalized (hyp-001, …) for easy tracking.

Hypothesis Pack schema

Refinery ships a versioned schema with Pydantic validation to prevent regressions.

from refinery.schemas.hypothesis_pack_v1 import HypothesisPack

pack = HypothesisPack.from_path("hypothesis.yaml")
print(pack.findings[0].description)

Schema file: schema/hypothesis-pack-v1.yaml
Python model: refinery/schemas/hypothesis_pack_v1.py
Golden tests: tests/test_hypothesis_pack_schema.py

Schema semver: v1.x = backwards-compatible additions, v2.x = breaking changes. Schema version is independent from package version; check metadata.schema_version.

Supported Integrations

Trace Sources (Current)

Local files - OpenTelemetry, Langfuse, LangSmith exports, custom JSON
LangSmith API - Live trace fetching (requires LANGSMITH_API_KEY)

Trace Sources (Planned - Open Source)

Langfuse API - Direct integration for live trace fetching
OpenTelemetry Collector - Direct OTLP ingestion
Custom APIs - Plugin system for your observability platform

LLM Providers (Current)

OpenAI - GPT-5 with Responses API for structured analysis

LLM Providers (Planned - Open Source)

Anthropic Claude - Extended context and tool use
Google Gemini - Multi-modal analysis capabilities

Refinery is designed to be integration-agnostic. We started with LangSmith and OpenAI, but the architecture supports any trace format and LLM provider. Direct integrations will be added as open source contributions.

Contributing

We welcome contributions! Refinery is built to be integration-agnostic, so adding new trace sources and LLM providers is encouraged.

High-value contributions:

Direct integrations: Langfuse API, OpenTelemetry collectors, other observability platforms
LLM provider support: Anthropic Claude, Google Gemini, local models
Schema enhancements: New hypothesis types, validation rules
Documentation: Tutorials, integration guides, examples

See CONTRIBUTING.md (coming Week 1) for style, testing, and provider guidelines. We label starter issues as good first issue; CI runs without secrets.

License

MIT-licensed open source software.

Future lab environment: In the future, Refinery will provide an experiment orchestration lab to validate proposed fixes. This will be a separate proprietary product, not part of the open source CLI.

See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
refinery		refinery
schema		schema
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
RELEASE.md		RELEASE.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Refinery

Quick Start

1. Analyze a local trace (no integrations needed)

2. Fetch live traces via LangSmith integration (optional)

Why Refinery?

The Problem: Reliable AI agents require disciplined error analysis

Refinery's Solution: Open source, platform-agnostic error analysis

What you get

Feature Highlights

Usage

Hypothesis Pack schema

Supported Integrations

Trace Sources (Current)

Trace Sources (Planned - Open Source)

LLM Providers (Current)

LLM Providers (Planned - Open Source)

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Refinery

Quick Start

1. Analyze a local trace (no integrations needed)

2. Fetch live traces via LangSmith integration (optional)

Why Refinery?

The Problem: Reliable AI agents require disciplined error analysis

Refinery's Solution: Open source, platform-agnostic error analysis

What you get

Feature Highlights

Usage

Hypothesis Pack schema

Supported Integrations

Trace Sources (Current)

Trace Sources (Planned - Open Source)

LLM Providers (Current)

LLM Providers (Planned - Open Source)

Contributing

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages