Can AI models diagnose why experiments fail — the way human researchers do?
EpiDebug is an evaluation benchmark that tests AI models on their ability to perform epistemic debugging: identifying why a scientific experiment or engineering process produced unexpected results, tracing the causal chain from root cause to observed symptoms, and proposing interventions to confirm the diagnosis.
Unlike benchmarks that test factual recall (MMLU) or code generation (SWE-bench), EpiDebug tests scientific intuition — the ability to reason about why things go wrong in the real world.
Each test case presents the AI with exactly what a human researcher would see:
| Layer | What It Contains |
|---|---|
| Objective | What the experiment aimed to achieve |
| Protocol | Step-by-step instructions that were followed |
| Telemetry | Raw data: sensor readings, gel images, spectra, yields |
| Contextual Clues | Environmental details (some relevant, some red herrings) |
The AI must then provide:
- Root Cause (30% weight) — Identify the single variable that failed
- Causal Chain (40% weight) — Trace the mechanism step-by-step
- Intervention (30% weight) — Propose a counterfactual experiment to confirm
| Category | Example |
|---|---|
| Reagent/Material Flaw | Buffer pH drifted due to CO₂ absorption |
| Instrumentation Artifact | Nanodrop pedestal contamination gave false readings |
| Protocol/Human Loophole | Forgot to vortex glycerol stock before inoculating |
| Flawed Hypothesis | Assumed SN2 mechanism when E2 dominated |
- Molecular biology (protein purification, PCR, gel electrophoresis)
- Chemistry (organic synthesis, Grignard reactions, catalysis)
- Physics and engineering (signal processing, materials science, FEA)
- Biotechnology (cell culture, flow cytometry, assays)
- Clinical research (sample quality, biomarkers, pre-analytical artifacts)
- Software systems (cloud observability, distributed traces, production incidents)
- Manufacturing (CNC process control, equipment/tool wear, metrology)
- Civil engineering (structural inspection, concrete defect detection, field validation)
- Aerospace, energy, environmental science, materials science, and robotics seed cases
The current starter set contains 30 validated YAML cases across these domains.
# Clone the repository
git clone https://github.com/your-org/epidebug.git
cd epidebug
# Install with pip
pip install -e ".[all]"
# Or install minimal (no LLM dependencies)
pip install -e .python scripts/validate_cases.py --all -v# Text mode: present all info, ask for diagnosis
python scripts/run_benchmark.py --model gpt-4o --all --mode text
# Agent mode: model uses tools to investigate
python scripts/run_benchmark.py --model gpt-4o --all --mode agent
# Single case (great for development)
python scripts/run_benchmark.py --model gpt-4o --case RF-001 --dry-run
# Run a specific category
python scripts/run_benchmark.py --model claude-sonnet --category reagent_material_flaw
# Compare models
python scripts/run_benchmark.py --model gpt-4o --all -o results/gpt4o.json
python scripts/run_benchmark.py --model claude-sonnet --all -o results/claude.jsonpython scripts/generate_report.py --input results/ --output report.htmlReference baselines are deterministic sanity checks, not LLM results. They help verify that the cases, scorer, result schema, and report generator are wired before running paid/provider-backed evaluations.
python scripts/run_reference_models.py --all
python scripts/generate_report.py --input results/reference --output results/reference_report.htmlepidebug/
├── epidebug/ # Core Python package
│ ├── schema.py # Pydantic data models
│ ├── scoring.py # Multi-dimensional scoring engine
│ ├── tools/ # Mock scientific instruments & databases
│ └── utils/ # Trajectory logging, helpers
├── test_cases/ # Benchmark test cases (YAML)
│ ├── reagent_flaw/ # Category 1
│ ├── instrumentation/ # Category 2
│ ├── protocol_loophole/ # Category 3
│ └── flawed_hypothesis/ # Category 4
├── mock_data/ # Pre-recorded instrument data
├── scripts/ # CLI tools
│ ├── run_benchmark.py # Main runner
│ ├── validate_cases.py # Schema validator
│ └── generate_report.py # HTML report generator
├── results/ # Output directory
├── docs/ # Documentation
└── tests/ # Unit tests
See docs/test_case_authoring.md for the full guide. For source discovery and curation strategy, see docs/data_sources.md.
Each test case is a YAML file following this schema:
id: "RF-001"
title: "Buffer pH Drift in Protein Purification"
domain: "molecular_biology"
subdomain: "protein_purification"
failure_category: "reagent_material_flaw"
difficulty: "medium"
objective: |
Purify recombinant His-tagged protein from E. coli...
protocol:
- step: 1
action: "Grow cells to OD600 = 0.6"
details: "250 mL LB + ampicillin, 37°C, 180 rpm"
telemetry:
sds_page:
description: "SDS-PAGE gel analysis"
observations:
- "Target band faint in elution, heavy in flow-through"
contextual_clues:
- "Buffer was prepared 3 weeks ago at room temperature"
ground_truth:
root_cause: "Buffer pH drifted from 8.0 to 6.5..."
causal_chain:
- "Tris buffer left at room temperature for 3 weeks"
- "CO2 dissolved into buffer, forming carbonic acid"
- "pH dropped from 8.0 to ~6.5"
- "His-tag protonated, lost affinity for Ni-NTA"
intervention: "Re-prepare fresh buffer, verify pH, rerun purification"| Component | Weight | Method |
|---|---|---|
| Root Cause ID | 30% | Keyword match + semantic similarity |
| Causal Chain | 40% | LLM-as-judge with structured rubric |
| Intervention | 30% | LLM-as-judge with structured rubric |
- Tool Efficiency: Productive calls / total calls
- Budget Compliance: Stayed within max_tool_calls?
- Anti-Pattern Detection: Confirmation bias, premature commitment
- Calibration: Confidence vs. actual accuracy
Apache 2.0
We welcome contributions! The easiest way to help:
- Write test cases — See the authoring guide
- Validate cases — Expert review of existing cases
- Run evaluations — Test new models and share results
@misc{epidebug2025,
title={EpiDebug: An Epistemic Debugging Benchmark for AI Scientific Reasoning},
year={2025},
url={https://github.com/your-org/epidebug}
}