Skip to content

Punyajain-cmd/Epistemic-Debugging-Benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔬 EpiDebug — Epistemic Debugging Benchmark

Can AI models diagnose why experiments fail — the way human researchers do?

EpiDebug is an evaluation benchmark that tests AI models on their ability to perform epistemic debugging: identifying why a scientific experiment or engineering process produced unexpected results, tracing the causal chain from root cause to observed symptoms, and proposing interventions to confirm the diagnosis.


🎯 What Makes This Different

Unlike benchmarks that test factual recall (MMLU) or code generation (SWE-bench), EpiDebug tests scientific intuition — the ability to reason about why things go wrong in the real world.

Each test case presents the AI with exactly what a human researcher would see:

Layer What It Contains
Objective What the experiment aimed to achieve
Protocol Step-by-step instructions that were followed
Telemetry Raw data: sensor readings, gel images, spectra, yields
Contextual Clues Environmental details (some relevant, some red herrings)

The AI must then provide:

  1. Root Cause (30% weight) — Identify the single variable that failed
  2. Causal Chain (40% weight) — Trace the mechanism step-by-step
  3. Intervention (30% weight) — Propose a counterfactual experiment to confirm

🧪 Four Categories of Failure

Category Example
Reagent/Material Flaw Buffer pH drifted due to CO₂ absorption
Instrumentation Artifact Nanodrop pedestal contamination gave false readings
Protocol/Human Loophole Forgot to vortex glycerol stock before inoculating
Flawed Hypothesis Assumed SN2 mechanism when E2 dominated

📊 Domains Covered

  • Molecular biology (protein purification, PCR, gel electrophoresis)
  • Chemistry (organic synthesis, Grignard reactions, catalysis)
  • Physics and engineering (signal processing, materials science, FEA)
  • Biotechnology (cell culture, flow cytometry, assays)
  • Clinical research (sample quality, biomarkers, pre-analytical artifacts)
  • Software systems (cloud observability, distributed traces, production incidents)
  • Manufacturing (CNC process control, equipment/tool wear, metrology)
  • Civil engineering (structural inspection, concrete defect detection, field validation)
  • Aerospace, energy, environmental science, materials science, and robotics seed cases

The current starter set contains 30 validated YAML cases across these domains.


🚀 Quick Start

Installation

# Clone the repository
git clone https://github.com/your-org/epidebug.git
cd epidebug

# Install with pip
pip install -e ".[all]"

# Or install minimal (no LLM dependencies)
pip install -e .

Validate Test Cases

python scripts/validate_cases.py --all -v

Run the Benchmark

# Text mode: present all info, ask for diagnosis
python scripts/run_benchmark.py --model gpt-4o --all --mode text

# Agent mode: model uses tools to investigate
python scripts/run_benchmark.py --model gpt-4o --all --mode agent

# Single case (great for development)
python scripts/run_benchmark.py --model gpt-4o --case RF-001 --dry-run

# Run a specific category
python scripts/run_benchmark.py --model claude-sonnet --category reagent_material_flaw

# Compare models
python scripts/run_benchmark.py --model gpt-4o --all -o results/gpt4o.json
python scripts/run_benchmark.py --model claude-sonnet --all -o results/claude.json

Generate Report

python scripts/generate_report.py --input results/ --output report.html

Run Reference Baselines

Reference baselines are deterministic sanity checks, not LLM results. They help verify that the cases, scorer, result schema, and report generator are wired before running paid/provider-backed evaluations.

python scripts/run_reference_models.py --all
python scripts/generate_report.py --input results/reference --output results/reference_report.html

📁 Project Structure

epidebug/
├── epidebug/                    # Core Python package
│   ├── schema.py               # Pydantic data models
│   ├── scoring.py              # Multi-dimensional scoring engine
│   ├── tools/                  # Mock scientific instruments & databases
│   └── utils/                  # Trajectory logging, helpers
├── test_cases/                  # Benchmark test cases (YAML)
│   ├── reagent_flaw/           # Category 1
│   ├── instrumentation/        # Category 2
│   ├── protocol_loophole/      # Category 3
│   └── flawed_hypothesis/      # Category 4
├── mock_data/                   # Pre-recorded instrument data
├── scripts/                     # CLI tools
│   ├── run_benchmark.py        # Main runner
│   ├── validate_cases.py       # Schema validator
│   └── generate_report.py      # HTML report generator
├── results/                     # Output directory
├── docs/                        # Documentation
└── tests/                       # Unit tests

📝 Writing Test Cases

See docs/test_case_authoring.md for the full guide. For source discovery and curation strategy, see docs/data_sources.md.

Each test case is a YAML file following this schema:

id: "RF-001"
title: "Buffer pH Drift in Protein Purification"
domain: "molecular_biology"
subdomain: "protein_purification"
failure_category: "reagent_material_flaw"
difficulty: "medium"

objective: |
  Purify recombinant His-tagged protein from E. coli...

protocol:
  - step: 1
    action: "Grow cells to OD600 = 0.6"
    details: "250 mL LB + ampicillin, 37°C, 180 rpm"

telemetry:
  sds_page:
    description: "SDS-PAGE gel analysis"
    observations:
      - "Target band faint in elution, heavy in flow-through"

contextual_clues:
  - "Buffer was prepared 3 weeks ago at room temperature"

ground_truth:
  root_cause: "Buffer pH drifted from 8.0 to 6.5..."
  causal_chain:
    - "Tris buffer left at room temperature for 3 weeks"
    - "CO2 dissolved into buffer, forming carbonic acid"
    - "pH dropped from 8.0 to ~6.5"
    - "His-tag protonated, lost affinity for Ni-NTA"
  intervention: "Re-prepare fresh buffer, verify pH, rerun purification"

🏆 Scoring

Multi-Dimensional Score (0.0 – 1.0)

Component Weight Method
Root Cause ID 30% Keyword match + semantic similarity
Causal Chain 40% LLM-as-judge with structured rubric
Intervention 30% LLM-as-judge with structured rubric

Trajectory Metrics (reported separately)

  • Tool Efficiency: Productive calls / total calls
  • Budget Compliance: Stayed within max_tool_calls?
  • Anti-Pattern Detection: Confirmation bias, premature commitment
  • Calibration: Confidence vs. actual accuracy

📄 License

Apache 2.0

🤝 Contributing

We welcome contributions! The easiest way to help:

  1. Write test cases — See the authoring guide
  2. Validate cases — Expert review of existing cases
  3. Run evaluations — Test new models and share results

📚 Citation

@misc{epidebug2025,
  title={EpiDebug: An Epistemic Debugging Benchmark for AI Scientific Reasoning},
  year={2025},
  url={https://github.com/your-org/epidebug}
}

About

Building a new benchmark for the Models, enabling them to be more useful for AI4Science. Currently there is no benchmarks which test the models on meta-cognition abilities. The whole idea here is that can frontier LLMs can find the exact cause why does an experiment or why something failed from an engineering perspective, using human like intuition

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors