Formal-verification meta-layer for clinical AI. Every prospective clinical recommendation becomes a machine-checkable certificate: an SMT theorem proven against named guidelines, a Monte Carlo trajectory over patient priors, an entropy-ranked information-gain budget, and a signed FHIR transaction bundle the auditor can verify offline.
The thesis: LLMs are pinned out of every clinical decision. Z3 decides safety, JAX decides trajectory, entropy decides value-of-information. The LLM is invoked exactly once — to draft adversarial counter-arguments that are then re-checked by the same theorem prover.
| Surface | URL | What it serves |
|---|---|---|
| MCP server | https://lumen-mcp-ag707.fly.dev/mcp |
Five engine tools (verify, simulate, voi, adversary, compose) over MCP |
| A2A orchestrator | https://lumen-orchestrator-ag707.fly.dev/ |
Agent-to-Agent endpoint (/.well-known/agent-card.json) for BYO clinical agents |
Both are deployed on Fly.io. Local quick-start in DEMO.md runs the full pipeline against five committed hero patients in < 60 seconds.
Margaret, 72F: CKD stage 3b (eGFR 38), T2DM (HbA1c 58), HTN, on ramipril + bendroflumethiazide. A junior doctor asks the LLM to prescribe ibuprofen 400 mg TID for an osteoarthritic knee.
- The consult lands in the A2A orchestrator, which fans out into the MCP engine tools.
- VERIFY returns
PROVEN_UNSAFE. Z3 cites NICE NG203 §1.4.7 — the "triple whammy" (NSAID + ACE-I + diuretic in CKD) AKI risk. The proof is an SMT certificate, not an opinion. - SIMULATE projects 30-day AKI risk: 12% [8%, 18%] with the recommendation vs 4% [2%, 7%] without. JAX Monte Carlo, 10 000 samples, workspace-cohort priors where N ≥ 30 and named meta-analysis priors otherwise — priors are labelled on the brief.
- ADVERSARY (the only LLM call site) drafts paracetamol 1 g QID. The same theorem suite re-verifies it:
PROVEN_SAFE. - COMPOSE emits a FHIR R4 transaction bundle:
Composition+Provenance(with the SMT proof attachment and trajectory seeds) +AuditEvent. Each entry is ed25519-signed; entries chain by SHA-256 forward hash.
End-to-end ≈ 16 seconds. The full bundle ships in lumen/fhir-bundles/patient-margaret-henderson.json with the chart notes that drive it under lumen/fhir-bundles/documents/margaret-henderson/.
flowchart TB
A[BYO Clinical Agent] -->|A2A consult| B[Lumen Orchestrator<br/>FastAPI · /.well-known/agent-card]
B -->|self-orchestrate| C[Lumen MCP<br/>FastMCP · workers=1]
C --> V[VERIFY<br/>Z3 SMT · Guideline Garden]
C --> S[SIMULATE<br/>JAX Monte Carlo · DoWhy · EconML]
C --> I[VOI<br/>Sobol + entropy<br/>over verifier]
C --> X[ADVERSARY<br/>RAG + LLM<br/>then re-VERIFY]
V --> O[COMPOSE]
S --> O
I --> O
X --> O
O --> R[Signed FHIR Transaction Bundle<br/>Composition + Provenance + AuditEvent<br/>ed25519 · SHA-256 chain]
Engine modules:
lumen/lumen_core/garden/— DSL → SMT compiler. Each named guideline is authored in a Python DSL file and compiles to a.smt2theorem file that Z3 re-checks against the V100 test corpus.lumen/lumen_core/cohort/— Polars-indexed patient cohort with PSM (propensity-score matching) for sub-population priors.lumen/lumen_core/pkpd/— JAX PK/PD models for trajectory simulation.lumen/lumen_mcp/— FastMCP server exposing the five engine tools.lumen/lumen_orchestrator/— FastAPI A2A endpoint with self-fan-out into the MCP.shared/— vendored FHIR client (R4 transactions), ed25519 audit chain, LLM client, types.
Full module-by-module walkthrough: ARCHITECTURE.md.
Five named clinical guidelines compile to SMT theorems. Each row is the source PDF the DSL was authored from, the compiled .smt2 artefact Z3 checks against, and a theorem count.
| Guideline | Compiled artefact | Theorems |
|---|---|---|
| NICE NG28 (Type 2 diabetes management) | nice_ng28.smt2 |
~24 |
| NICE NG203 (Chronic kidney disease) | nice_ng203.smt2 |
~19 |
| AGS Beers Criteria 2023 | beers_2023.smt2 |
~89 |
| STOPP/START v2 (potentially inappropriate prescribing in older adults) | stopp_start.smt2 |
~114 |
| KDIGO AKI 2023 | kdigo_aki.smt2 |
~14 |
≈260 theorems across the five-guideline corpus. The DSL is designed so each new guideline is additive — see lumen/lumen_core/garden/source/ for the authoring style.
These are enforced by CI, not aspirational:
- LLMs do not decide safety, trajectory, or VOI. Z3, JAX, and entropy decide. The LLM is invoked only in
shared/llm/adversary_prompt.py(drafting counter-arguments). The CI invarianttests/test_ci_invariants.py::test_no_hardcoded_llm_model_stringsgreps the repo and fails the build if any LLM call site appears outsideshared/llm/. - No model strings hardcoded.
claude-3,gpt-4,o1-,haiku,gemini-are forbidden outsideshared/llm/. One config controls the whole pin. uvicorn --workers 1is mandatory. FastMCP's in-process session cache requires a single worker.tests/test_ci_invariants.py::test_workers_value_is_always_onerejects any other count.- Provenance is enforced at schema construction. A
LumenBriefwithout a cited theorem cannot serialise. AProvenanceresource without a signed proof attachment cannot validate. - Forward-hash audit chain. ed25519 per entry, SHA-256 forward chain in canonical JSON.
Chain.verify()raises on tamper by design. The audit logger swallows exceptions (a failing logger must not crash a clinical tool call), but the chain primitive itself is a verification artefact.
Verification : 79/100 (79.0%) on V100 corpus, 0 false-clean,
mean 3.4 ms, p99 17.4 ms
RCT replication : 12/12 within published 95% CI (HR-prior calibration)
VOI calibration : 50 cases · 14% pass rate
Adversarial : 2/7 clinician alternatives surfaced (llm_stub_fixed)
Equity (E20) : 14/20 fairness assertions held
─────────────────────────────────────────
5 independent corpora · 189 cases · deterministic re-run:
python -m lumen.evals.runner --frozen v1
Numbers are re-runnable offline from committed artefacts — no live LLM call is needed for the verify path. The frozen Eval v1.0 headline resolves to: V100 79/100 with safety_precision=1.0, R12 12/12, VOI50 7/50, A7 2/7 in llm_stub_fixed mode, and E20 14/20.
| # | Patient | Scenario | Expected verdict |
|---|---|---|---|
| 1 | Margaret Henderson | CKD 3b + T2DM + HTN, NSAID request | PROVEN_UNSAFE (NICE NG203 §1.4.7) |
| 2 | James Okonkwo | Healthy, acute MCL strain | PROVEN_SAFE |
| 3 | Doris Williams | RA on methotrexate, NSAID request, stale LFT | UNDETERMINED + VOI |
| 4 | Patricia Quinn | Multi-spec HF/CKD/T2DM conflict | Multi-spec resolution |
| 5 | Frances Doyle | Frail polypharmacy, no labs in 18 months | UNDETERMINED — data_insufficient |
All five live under lumen/fhir-bundles/ with dated chart notes under lumen/fhir-bundles/documents/. CI enforces FHIR R4 transaction shape, UUIDv4 fullUrls, and intra-bundle reference resolution (tests/test_fhir_bundles_urn_uuid.py, tests/test_fhir_bundles_resolve.py).
git clone https://github.com/AbhinavGupta707/Lumen
cd Lumen
python3 -m venv .venv && source .venv/bin/activate
pip install -e '.[dev]'
pytest # full suite
uvicorn lumen.lumen_mcp.main:app --workers 1 # MCP at :8000
uvicorn lumen.lumen_orchestrator.main:app --port 8080 --workers 1 # A2A at :8080Hands-on demo against the five hero patients: DEMO.md.
- A real DSL-to-SMT compiler. The Garden is not a rules engine; it is a Python DSL whose AST compiles to SMT-LIB2 and is discharged by Z3. ≈260 theorems, machine-checkable, attached to every verdict.
- Counterfactual simulation with causal-inference identification. JAX PK/PD over DoWhy + EconML, with propensity-score matching on workspace cohorts and named meta-analysis fallback. Priors are labelled on output, not hidden.
- Information-theoretic VOI ranking. Sobol-distributed sampling × entropy of the verifier posterior. Tells you which lab to order next to maximally collapse uncertainty, not "everything that might be relevant".
- Adversarial validation in-loop. The LLM is asked to attack the recommendation, not defend it. Whatever it produces is re-verified by the same theorem prover — the LLM cannot ship anything past Z3.
- Cryptographically verifiable audit. Every tool call produces an ed25519-signed FHIR entry, chained SHA-256, externally verifiable with the public key alone.
Lumen ships alongside Vigil — a workspace-scale adverse-event surveillance system. Same author, shared infrastructure (shared/fhir/, shared/audit/, shared/llm/, shared/types/ are vendored in both repos). Different scope: Lumen verifies prospective recommendations per patient; Vigil detects retrospective harm signals across a workspace.
MIT. See LICENSE. This is a research and engineering demonstration — not a medical device. It has not been evaluated by any regulatory authority and must not be used to make clinical decisions for real patients.
Clinical guideline sources (NICE, AGS, STOPP/START, KDIGO) are referenced under fair-use educational citation. Compiled SMT artefacts are derivative works; redistribution of the source PDFs is not implied.