🔬 EpiDebug — Epistemic Debugging Benchmark

Can AI models diagnose why experiments fail — the way human researchers do?

EpiDebug is an evaluation benchmark that tests AI models on their ability to perform epistemic debugging: identifying why a scientific experiment or engineering process produced unexpected results, tracing the causal chain from root cause to observed symptoms, and proposing interventions to confirm the diagnosis.

🎯 What Makes This Different

Unlike benchmarks that test factual recall (MMLU) or code generation (SWE-bench), EpiDebug tests scientific intuition — the ability to reason about why things go wrong in the real world.

Each test case presents the AI with exactly what a human researcher would see:

Layer	What It Contains
Objective	What the experiment aimed to achieve
Protocol	Step-by-step instructions that were followed
Telemetry	Raw data: sensor readings, gel images, spectra, yields
Contextual Clues	Environmental details (some relevant, some red herrings)

The AI must then provide:

Root Cause (30% weight) — Identify the single variable that failed
Causal Chain (40% weight) — Trace the mechanism step-by-step
Intervention (30% weight) — Propose a counterfactual experiment to confirm

🧪 Four Categories of Failure

Category	Example
Reagent/Material Flaw	Buffer pH drifted due to CO₂ absorption
Instrumentation Artifact	Nanodrop pedestal contamination gave false readings
Protocol/Human Loophole	Forgot to vortex glycerol stock before inoculating
Flawed Hypothesis	Assumed SN2 mechanism when E2 dominated

📊 Domains Covered

Molecular biology (protein purification, PCR, gel electrophoresis)
Chemistry (organic synthesis, Grignard reactions, catalysis)
Physics and engineering (signal processing, materials science, FEA)
Biotechnology (cell culture, flow cytometry, assays)
Clinical research (sample quality, biomarkers, pre-analytical artifacts)
Software systems (cloud observability, distributed traces, production incidents)
Manufacturing (CNC process control, equipment/tool wear, metrology)
Civil engineering (structural inspection, concrete defect detection, field validation)
Aerospace, energy, environmental science, materials science, and robotics seed cases

The current starter set contains 30 validated YAML cases across these domains.

🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/your-org/epidebug.git
cd epidebug

# Install with pip
pip install -e ".[all]"

# Or install minimal (no LLM dependencies)
pip install -e .

Validate Test Cases

python scripts/validate_cases.py --all -v

Run the Benchmark

# Text mode: present all info, ask for diagnosis
python scripts/run_benchmark.py --model gpt-4o --all --mode text

# Agent mode: model uses tools to investigate
python scripts/run_benchmark.py --model gpt-4o --all --mode agent

# Single case (great for development)
python scripts/run_benchmark.py --model gpt-4o --case RF-001 --dry-run

# Run a specific category
python scripts/run_benchmark.py --model claude-sonnet --category reagent_material_flaw

# Compare models
python scripts/run_benchmark.py --model gpt-4o --all -o results/gpt4o.json
python scripts/run_benchmark.py --model claude-sonnet --all -o results/claude.json

Generate Report

python scripts/generate_report.py --input results/ --output report.html

Run Reference Baselines

Reference baselines are deterministic sanity checks, not LLM results. They help verify that the cases, scorer, result schema, and report generator are wired before running paid/provider-backed evaluations.

python scripts/run_reference_models.py --all
python scripts/generate_report.py --input results/reference --output results/reference_report.html

📁 Project Structure

epidebug/
├── epidebug/                    # Core Python package
│   ├── schema.py               # Pydantic data models
│   ├── scoring.py              # Multi-dimensional scoring engine
│   ├── tools/                  # Mock scientific instruments & databases
│   └── utils/                  # Trajectory logging, helpers
├── test_cases/                  # Benchmark test cases (YAML)
│   ├── reagent_flaw/           # Category 1
│   ├── instrumentation/        # Category 2
│   ├── protocol_loophole/      # Category 3
│   └── flawed_hypothesis/      # Category 4
├── mock_data/                   # Pre-recorded instrument data
├── scripts/                     # CLI tools
│   ├── run_benchmark.py        # Main runner
│   ├── validate_cases.py       # Schema validator
│   └── generate_report.py      # HTML report generator
├── results/                     # Output directory
├── docs/                        # Documentation
└── tests/                       # Unit tests

📝 Writing Test Cases

See docs/test_case_authoring.md for the full guide. For source discovery and curation strategy, see docs/data_sources.md.

Each test case is a YAML file following this schema:

id: "RF-001"
title: "Buffer pH Drift in Protein Purification"
domain: "molecular_biology"
subdomain: "protein_purification"
failure_category: "reagent_material_flaw"
difficulty: "medium"

objective: |
  Purify recombinant His-tagged protein from E. coli...

protocol:
  - step: 1
    action: "Grow cells to OD600 = 0.6"
    details: "250 mL LB + ampicillin, 37°C, 180 rpm"

telemetry:
  sds_page:
    description: "SDS-PAGE gel analysis"
    observations:
      - "Target band faint in elution, heavy in flow-through"

contextual_clues:
  - "Buffer was prepared 3 weeks ago at room temperature"

ground_truth:
  root_cause: "Buffer pH drifted from 8.0 to 6.5..."
  causal_chain:
    - "Tris buffer left at room temperature for 3 weeks"
    - "CO2 dissolved into buffer, forming carbonic acid"
    - "pH dropped from 8.0 to ~6.5"
    - "His-tag protonated, lost affinity for Ni-NTA"
  intervention: "Re-prepare fresh buffer, verify pH, rerun purification"

🏆 Scoring

Multi-Dimensional Score (0.0 – 1.0)

Component	Weight	Method
Root Cause ID	30%	Keyword match + semantic similarity
Causal Chain	40%	LLM-as-judge with structured rubric
Intervention	30%	LLM-as-judge with structured rubric

Trajectory Metrics (reported separately)

Tool Efficiency: Productive calls / total calls
Budget Compliance: Stayed within max_tool_calls?
Anti-Pattern Detection: Confirmation bias, premature commitment
Calibration: Confidence vs. actual accuracy

📄 License

Apache 2.0

🤝 Contributing

We welcome contributions! The easiest way to help:

Write test cases — See the authoring guide
Validate cases — Expert review of existing cases
Run evaluations — Test new models and share results

📚 Citation

@misc{epidebug2025,
  title={EpiDebug: An Epistemic Debugging Benchmark for AI Scientific Reasoning},
  year={2025},
  url={https://github.com/your-org/epidebug}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
FH-003_wrong_mechanism.yaml		FH-003_wrong_mechanism.yaml
FH-005_elastic_beyond_yield.yaml		FH-005_elastic_beyond_yield.yaml
FH-009_concrete_crack_dataset_leakage.yaml		FH-009_concrete_crack_dataset_leakage.yaml
FH-010_software_defect_temporal_leakage.yaml		FH-010_software_defect_temporal_leakage.yaml
FH-011_robot_gripper_rigid_object_assumption.yaml		FH-011_robot_gripper_rigid_object_assumption.yaml
FH-012_fatigue_miner_rule_overload_sequence.yaml		FH-012_fatigue_miner_rule_overload_sequence.yaml
FH-013_crispr_paralog_off_target.yaml		FH-013_crispr_paralog_off_target.yaml
IN-001_nanodrop_baseline.yaml		IN-001_nanodrop_baseline.yaml
IN-005_oscilloscope_aliasing.yaml		IN-005_oscilloscope_aliasing.yaml
IN-007_flow_cytometer_alignment.yaml		IN-007_flow_cytometer_alignment.yaml
IN-009_distributed_trace_clock_skew.yaml		IN-009_distributed_trace_clock_skew.yaml
IN-011_thermocouple_cold_junction_bias.yaml		IN-011_thermocouple_cold_junction_bias.yaml
IN-012_turbofan_sensor_bias_false_degradation.yaml		IN-012_turbofan_sensor_bias_false_degradation.yaml
IN-013_lcms_carryover_false_biomarker.yaml		IN-013_lcms_carryover_false_biomarker.yaml
IN-014_turbidity_sensor_biofouling.yaml		IN-014_turbidity_sensor_biofouling.yaml
PKG-INFO		PKG-INFO
PL-001_vortex_omission.yaml		PL-001_vortex_omission.yaml
PL-004_inert_atmosphere_breach.yaml		PL-004_inert_atmosphere_breach.yaml
PL-006_hplc_mobile_phase_not_degassed.yaml		PL-006_hplc_mobile_phase_not_degassed.yaml
PL-007_drone_photogrammetry_datum_mismatch.yaml		PL-007_drone_photogrammetry_datum_mismatch.yaml
PL-008_elisa_plate_not_equilibrated.yaml		PL-008_elisa_plate_not_equilibrated.yaml
PL-009_bolted_flange_torque_sequence.yaml		PL-009_bolted_flange_torque_sequence.yaml
README.md		README.md
RF-001_buffer_ph_drift.yaml		RF-001_buffer_ph_drift.yaml
RF-002_primer_degradation.yaml		RF-002_primer_degradation.yaml
RF-004_solvent_water_content.yaml		RF-004_solvent_water_content.yaml
RF-009_cnc_end_mill_wear.yaml		RF-009_cnc_end_mill_wear.yaml
RF-010_hemolyzed_plasma_pseudohyperkalemia.yaml		RF-010_hemolyzed_plasma_pseudohyperkalemia.yaml
RF-011_ai4i_heat_dissipation_failure.yaml		RF-011_ai4i_heat_dissipation_failure.yaml
RF-012_battery_electrolyte_water_contamination.yaml		RF-012_battery_electrolyte_water_contamination.yaml
RF-013_fbs_lot_endotoxin_contamination.yaml		RF-013_fbs_lot_endotoxin_contamination.yaml
RF-014_corrosion_coupon_chloride_contamination.yaml		RF-014_corrosion_coupon_chloride_contamination.yaml
SOURCES.txt		SOURCES.txt
__init__.cpython-310.pyc		__init__.cpython-310.pyc
__init__.py		__init__.py
analysis_tools.cpython-310.pyc		analysis_tools.cpython-310.pyc
analysis_tools.py		analysis_tools.py
catalog.yaml		catalog.yaml
category_only.json		category_only.json
data_sources.md		data_sources.md
database_tools.cpython-310.pyc		database_tools.cpython-310.pyc
database_tools.py		database_tools.py
dependency_links.txt		dependency_links.txt
distractor.json		distractor.json
entry_points.txt		entry_points.txt
generate_report.py		generate_report.py
lab_tools.cpython-310.pyc		lab_tools.cpython-310.pyc
lab_tools.py		lab_tools.py
manifest.yaml		manifest.yaml
oracle.json		oracle.json
partial.json		partial.json
pyproject.toml		pyproject.toml
reference_report.html		reference_report.html
requires.txt		requires.txt
run_benchmark.py		run_benchmark.py
run_reference_models.py		run_reference_models.py
schema.cpython-310.pyc		schema.cpython-310.pyc
schema.py		schema.py
scoring.cpython-310.pyc		scoring.cpython-310.pyc
scoring.py		scoring.py
scoring_rubric.md		scoring_rubric.md
test_case_authoring.md		test_case_authoring.md
test_scoring.cpython-310-pytest-9.0.3.pyc		test_scoring.cpython-310-pytest-9.0.3.pyc
test_scoring.py		test_scoring.py
top_level.txt		top_level.txt
trajectory.py		trajectory.py
validate_cases.py		validate_cases.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔬 EpiDebug — Epistemic Debugging Benchmark

🎯 What Makes This Different

🧪 Four Categories of Failure

📊 Domains Covered

🚀 Quick Start

Installation

Validate Test Cases

Run the Benchmark

Generate Report

Run Reference Baselines

📁 Project Structure

📝 Writing Test Cases

🏆 Scoring

Multi-Dimensional Score (0.0 – 1.0)

Trajectory Metrics (reported separately)

📄 License

🤝 Contributing

📚 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🔬 EpiDebug — Epistemic Debugging Benchmark

🎯 What Makes This Different

🧪 Four Categories of Failure

📊 Domains Covered

🚀 Quick Start

Installation

Validate Test Cases

Run the Benchmark

Generate Report

Run Reference Baselines

📁 Project Structure

📝 Writing Test Cases

🏆 Scoring

Multi-Dimensional Score (0.0 – 1.0)

Trajectory Metrics (reported separately)

📄 License

🤝 Contributing

📚 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages