Auditable, workflow-level validation artifacts for AI agents in Laboratory Information Systems
This repository provides validated Terminal Bench tasks for evaluating AI agents in clinical laboratory workflows. Each task represents an auditable, reproducible, validated artifact grounded in real laboratory practices and failure modes.
The Challenge:
Traditional LIS validation assumes deterministic, rule-based systems. AI agents introduce emergent capabilities that escape change-based validation. CAP GEN.43875 requires validation "based on changes made" β but you can't validate changes you don't know exist.
The Approach:
Build a library of Terminal Bench validation tasks that:
- Test workflow-level reasoning (not just threshold accuracy)
- Provide auditable, versioned artifacts for regulatory compliance
- Enable reproducible evaluation across models and updates
- Ground validation in established laboratory practices
Core Thesis:
Agentic AI in LIS/LIMS introduces new, silent failure modes that require workflow-level validation β not just model accuracy β and can be evaluated using Terminal Bench as auditable, validated artifacts.
Disclaimer: This is professional development research and is not affiliated with any organization. This framework is provided for educational and research purposes.
LIS Swap & Contamination Triage is the first auditable, reproducible, validated task in a growing library. This Terminal Bench task tests whether AI agents can correctly triage laboratory specimens for contamination and identity swap failures.
We welcome contributions from the laboratory community to expand this framework with additional workflow-level validation tasks.
A batch of laboratory results has passed instrument QC, but specimens may have:
- EDTA contamination (elevated K, depressed Ca from tube contamination)
- Identity swaps (specimens assigned to wrong patients)
- Normal results (should be released)
AI agents must make HOLD/RELEASE decisions by configuring a workflow.json policy file.
- Threshold-only validation passes β Individual analyte values may be in range
- Workflow reasoning fails β Agents must detect cross-analyte patterns and identity mismatches
- Safety-critical decisions β Zero unsafe releases required (no contaminated or swapped specimens released)
All three must hold:
- F1 Score β₯ 0.80 (precision and recall on HOLD decisions)
- Zero unsafe releases (no contaminated or swapped specimens released)
- False hold rate β€ 0.34 (minimize unnecessary specimen holds)
Results aggregated across visible and hidden batches.
This is a Terminal Bench task β designed for standardized AI agent evaluation using Harbor.
- Harbor (Terminal Bench execution harness)
- API keys for model testing (Anthropic, OpenAI, etc.)
- Docker (Harbor installs automatically if needed)
Step 1: Install uv (if not already installed)
# macOS/Linux
curl -LsSf https://astral.sh/uv/install.sh | sh
# Windows
powershell -c "irm https://astral.sh/uv/install.ps1 | iex"Or see uv installation docs.
Step 2: Install Harbor
# Using uv (recommended)
uv tool install harbor
# Or using pip
pip install harborStep 1: Validate the task with oracle (reference solution)
First, verify the task infrastructure works correctly:
harbor run -p lis-swap-contamination-triage/ --oracleThis runs the reference solution (solution/solve.sh) to confirm the task passes evaluation.
Step 2: Evaluate AI models
Once validated, test AI agents against the task:
# Set your API key
export ANTHROPIC_API_KEY=<your-key>
# Run evaluation
harbor run -p lis-swap-contamination-triage/ \
--model anthropic/claude-opus-4-1 \
--agent claude-codeHarbor orchestrates the complete evaluation pipeline:
- Builds the Docker environment
- Injects the agent with task instructions from
instruction.md - Allows agent to configure
/app/workflow.jsonand produce triage decisions - Executes verification tests (
tests/test_outputs.py) - Reports F1 score, safety metrics, and pass/fail status
π See documentation/terminal_bench_primer_for_labs.pdf for a comprehensive guide to Terminal Bench evaluation methodology for laboratory AI validation.
Agents must:
- Analyze specimen batch data in
/app/fixtures/visible_batch_nolabels.json - Configure
/app/workflow.jsonpolicy (default is intentionally flawed) - Produce
/app/decisions.jsonwith correct HOLD/RELEASE decisions - Achieve: F1 β₯ 0.80, zero unsafe releases, false hold rate β€ 0.34
Results are evaluated against both visible and hidden test batches.
lis-swap-contamination-triage/ # Terminal Bench task
βββ environment/ # Docker environment + triage engine
β βββ src/triage.py # Contamination + swap scoring logic
β βββ data/ # Batch fixtures (visible, hidden)
β βββ pyproject.toml # Python 3.10+ stdlib only
βββ tests/test_outputs.py # Evaluation (F1, safety, false holds)
βββ solution/solve.sh # Reference solution
βββ instruction.md # Agent task instructions
βββ task.toml # Terminal Bench metadata
methodology/ # Benchmark design methodology
task/ # Task definitions
documentation/ # Terminal Bench guides
landing-page/ # Project website
README.md # This file
LICENSE # MIT License
The triage pipeline processes specimen batches through four stages:
Detects EDTA-like contamination signatures using geometric mean of normalized component scores:
- High potassium (K) β Above contamination threshold
- Low calcium (Ca) β Below normal range
- Pattern recognition β Cross-analyte consistency
Pairwise comparison of all specimens in batch using delta-check methodology:
- Computes whether swapping two specimens' patient assignments reduces mismatch
- Uses standardized deltas / RCV-style limits against patient historical data
- Score = relative improvement in fit
- Scores must exceed configured thresholds to trigger HOLD
contamination_hold_threshold(default 0.5)swap_hold_threshold(default 0.25)- Below-threshold signals β RELEASE
If HOLDs exceed max_holds batch constraint, weaker HOLDs are downgraded to RELEASE.
| Parameter | Location | Purpose |
|---|---|---|
contamination_hold_threshold |
decision_policy |
Min contamination score to HOLD |
swap_hold_threshold |
decision_policy |
Min swap improvement score to HOLD |
zscore_threshold |
root | Delta-check threshold (standardized difference divisor for swap detection) |
K_min, Ca_max |
contamination_signatures[].rule |
Trigger levels for contamination |
analyte_weights |
swap_detection |
Per-analyte weights for swap mismatch |
- NORMAL β Should be released (no issues detected)
- CONTAMINATION β EDTA tube contamination causing high K / low Ca
- SWAP β Specimen assigned to wrong patient (identity mismatch)
Zero unsafe releases β No contaminated or swapped specimens should be released. This is a hard requirement that reflects real laboratory safety standards.
This task models delta-check and specimen-quality rule-outs used in autoverification and middleware systems. The output is HOLD for manual review, not a definitive diagnosis. This reflects real-world laboratory workflows where automated systems flag specimens for human review rather than making final clinical determinations.
- Project Website: lisaivalidation.dev
- Architecture Docs: See
design/directory - Build Instructions: See
CLAUDE.md
This work addresses validation challenges outlined in:
- CAP GEN.43875 β Autoverification validation and revalidation requirements
- FDA CDS Guidance β Clinical Decision Support Software (updated Jan 2026)
- CAP AI Guidance β Lifecycle validation for AI in clinical laboratories
Traditional validation assumes deterministic, rule-based systems. AI agents introduce emergent capabilities that require workflow-level evaluation.
Terminal Bench provides:
- Operational realism β Real failure modes, not synthetic benchmarks
- Hidden test sets β Prevents overfitting to visible examples
- Standardized evaluation β Reproducible scoring across agents
- Safety constraints β Hard requirements (zero unsafe releases)
This approach aligns with modern agent evaluation frameworks (Harbor, Anthropic agent evals) while addressing regulated industry requirements.
A methodology document describing the benchmark design principles behind this task is available at methodology/KNOWLEDGE_GROUNDED_EVALUATION_PATTERN.md. It covers how knowledge graphs prevent data-fitting, why thresholds derived from published standards produce structurally wider decision margins than any fitting approach, and the path toward a live MCP-served knowledge service for production deployments.
This framework welcomes contributions from the laboratory community. If you're working on AI validation in regulated industries and would like to collaborate or contribute additional workflow-level validation tasks, please open an issue or submit a pull request.
Contact: Alex Openstone (alexrabo@gmail.com)
-
CAP GEN.43875 β Autoverification Validation and Revalidation Requirements https://documents-cloud.cap.org/pdf/QA%20GEN.43875.pdf
-
CLSI AUTO10-A β Autoverification of Clinical Laboratory Test Results Clinical and Laboratory Standards Institute (2006)
-
CLSI AUTO15-ED1 β Autoverification of Medical Laboratory Results for Specific Disciplines Clinical and Laboratory Standards Institute (2019)
-
FDA Clinical Decision Support Software Guidance (Updated January 2026) U.S. Food and Drug Administration
-
Yang YC, et al. (2025) β "Validation gap analysis for AI in clinical laboratories" Preprint. doi: 10.21203/rs.3.rs-5934891/v1
-
BMC Medical Informatics and Decision Making (2021) β "Autoverification in clinical laboratories: a systematic review" https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-021-01545-3
-
Terminal Bench β Laude Institute / Stanford HAI Standardized agent evaluation framework
-
Harbor β Terminal Bench 2.0 execution harness https://github.com/stanford-hai/harbor
MIT License - See LICENSE file for details
Built for real-world laboratory safety. Validated with Terminal Bench rigor.