Skip to content

AbhinavGupta707/Lumen

Repository files navigation

Lumen

Formal-verification meta-layer for clinical AI. Every prospective clinical recommendation becomes a machine-checkable certificate: an SMT theorem proven against named guidelines, a Monte Carlo trajectory over patient priors, an entropy-ranked information-gain budget, and a signed FHIR transaction bundle the auditor can verify offline.

The thesis: LLMs are pinned out of every clinical decision. Z3 decides safety, JAX decides trajectory, entropy decides value-of-information. The LLM is invoked exactly once — to draft adversarial counter-arguments that are then re-checked by the same theorem prover.

CI Python 3.11+ License: MIT


Live demo

Surface URL What it serves
MCP server https://lumen-mcp-ag707.fly.dev/mcp Five engine tools (verify, simulate, voi, adversary, compose) over MCP
A2A orchestrator https://lumen-orchestrator-ag707.fly.dev/ Agent-to-Agent endpoint (/.well-known/agent-card.json) for BYO clinical agents

Both are deployed on Fly.io. Local quick-start in DEMO.md runs the full pipeline against five committed hero patients in < 60 seconds.


60-second walkthrough — Margaret Henderson

Margaret, 72F: CKD stage 3b (eGFR 38), T2DM (HbA1c 58), HTN, on ramipril + bendroflumethiazide. A junior doctor asks the LLM to prescribe ibuprofen 400 mg TID for an osteoarthritic knee.

  1. The consult lands in the A2A orchestrator, which fans out into the MCP engine tools.
  2. VERIFY returns PROVEN_UNSAFE. Z3 cites NICE NG203 §1.4.7 — the "triple whammy" (NSAID + ACE-I + diuretic in CKD) AKI risk. The proof is an SMT certificate, not an opinion.
  3. SIMULATE projects 30-day AKI risk: 12% [8%, 18%] with the recommendation vs 4% [2%, 7%] without. JAX Monte Carlo, 10 000 samples, workspace-cohort priors where N ≥ 30 and named meta-analysis priors otherwise — priors are labelled on the brief.
  4. ADVERSARY (the only LLM call site) drafts paracetamol 1 g QID. The same theorem suite re-verifies it: PROVEN_SAFE.
  5. COMPOSE emits a FHIR R4 transaction bundle: Composition + Provenance (with the SMT proof attachment and trajectory seeds) + AuditEvent. Each entry is ed25519-signed; entries chain by SHA-256 forward hash.

End-to-end ≈ 16 seconds. The full bundle ships in lumen/fhir-bundles/patient-margaret-henderson.json with the chart notes that drive it under lumen/fhir-bundles/documents/margaret-henderson/.


Architecture

flowchart TB
    A[BYO Clinical Agent] -->|A2A consult| B[Lumen Orchestrator<br/>FastAPI · /.well-known/agent-card]
    B -->|self-orchestrate| C[Lumen MCP<br/>FastMCP · workers=1]

    C --> V[VERIFY<br/>Z3 SMT · Guideline Garden]
    C --> S[SIMULATE<br/>JAX Monte Carlo · DoWhy · EconML]
    C --> I[VOI<br/>Sobol + entropy<br/>over verifier]
    C --> X[ADVERSARY<br/>RAG + LLM<br/>then re-VERIFY]

    V --> O[COMPOSE]
    S --> O
    I --> O
    X --> O

    O --> R[Signed FHIR Transaction Bundle<br/>Composition + Provenance + AuditEvent<br/>ed25519 · SHA-256 chain]
Loading

Engine modules:

  • lumen/lumen_core/garden/ — DSL → SMT compiler. Each named guideline is authored in a Python DSL file and compiles to a .smt2 theorem file that Z3 re-checks against the V100 test corpus.
  • lumen/lumen_core/cohort/ — Polars-indexed patient cohort with PSM (propensity-score matching) for sub-population priors.
  • lumen/lumen_core/pkpd/ — JAX PK/PD models for trajectory simulation.
  • lumen/lumen_mcp/ — FastMCP server exposing the five engine tools.
  • lumen/lumen_orchestrator/ — FastAPI A2A endpoint with self-fan-out into the MCP.
  • shared/ — vendored FHIR client (R4 transactions), ed25519 audit chain, LLM client, types.

Full module-by-module walkthrough: ARCHITECTURE.md.


The Guideline Garden

Five named clinical guidelines compile to SMT theorems. Each row is the source PDF the DSL was authored from, the compiled .smt2 artefact Z3 checks against, and a theorem count.

Guideline Compiled artefact Theorems
NICE NG28 (Type 2 diabetes management) nice_ng28.smt2 ~24
NICE NG203 (Chronic kidney disease) nice_ng203.smt2 ~19
AGS Beers Criteria 2023 beers_2023.smt2 ~89
STOPP/START v2 (potentially inappropriate prescribing in older adults) stopp_start.smt2 ~114
KDIGO AKI 2023 kdigo_aki.smt2 ~14

≈260 theorems across the five-guideline corpus. The DSL is designed so each new guideline is additive — see lumen/lumen_core/garden/source/ for the authoring style.


Determinism guarantees

These are enforced by CI, not aspirational:

  • LLMs do not decide safety, trajectory, or VOI. Z3, JAX, and entropy decide. The LLM is invoked only in shared/llm/adversary_prompt.py (drafting counter-arguments). The CI invariant tests/test_ci_invariants.py::test_no_hardcoded_llm_model_strings greps the repo and fails the build if any LLM call site appears outside shared/llm/.
  • No model strings hardcoded. claude-3, gpt-4, o1-, haiku, gemini- are forbidden outside shared/llm/. One config controls the whole pin.
  • uvicorn --workers 1 is mandatory. FastMCP's in-process session cache requires a single worker. tests/test_ci_invariants.py::test_workers_value_is_always_one rejects any other count.
  • Provenance is enforced at schema construction. A LumenBrief without a cited theorem cannot serialise. A Provenance resource without a signed proof attachment cannot validate.
  • Forward-hash audit chain. ed25519 per entry, SHA-256 forward chain in canonical JSON. Chain.verify() raises on tamper by design. The audit logger swallows exceptions (a failing logger must not crash a clinical tool call), but the chain primitive itself is a verification artefact.

Evaluation — v1.0

Verification     :  79/100 (79.0%) on V100 corpus, 0 false-clean,
                     mean 3.4 ms, p99 17.4 ms
RCT replication  :  12/12 within published 95% CI (HR-prior calibration)
VOI calibration  :  50 cases · 14% pass rate
Adversarial      :   2/7 clinician alternatives surfaced (llm_stub_fixed)
Equity (E20)     :  14/20 fairness assertions held
─────────────────────────────────────────
5 independent corpora · 189 cases · deterministic re-run:
  python -m lumen.evals.runner --frozen v1

Numbers are re-runnable offline from committed artefacts — no live LLM call is needed for the verify path. The frozen Eval v1.0 headline resolves to: V100 79/100 with safety_precision=1.0, R12 12/12, VOI50 7/50, A7 2/7 in llm_stub_fixed mode, and E20 14/20.


Hero patient roster

# Patient Scenario Expected verdict
1 Margaret Henderson CKD 3b + T2DM + HTN, NSAID request PROVEN_UNSAFE (NICE NG203 §1.4.7)
2 James Okonkwo Healthy, acute MCL strain PROVEN_SAFE
3 Doris Williams RA on methotrexate, NSAID request, stale LFT UNDETERMINED + VOI
4 Patricia Quinn Multi-spec HF/CKD/T2DM conflict Multi-spec resolution
5 Frances Doyle Frail polypharmacy, no labs in 18 months UNDETERMINEDdata_insufficient

All five live under lumen/fhir-bundles/ with dated chart notes under lumen/fhir-bundles/documents/. CI enforces FHIR R4 transaction shape, UUIDv4 fullUrls, and intra-bundle reference resolution (tests/test_fhir_bundles_urn_uuid.py, tests/test_fhir_bundles_resolve.py).


Quick start

git clone https://github.com/AbhinavGupta707/Lumen
cd Lumen
python3 -m venv .venv && source .venv/bin/activate
pip install -e '.[dev]'
pytest                                                       # full suite
uvicorn lumen.lumen_mcp.main:app --workers 1                 # MCP at :8000
uvicorn lumen.lumen_orchestrator.main:app --port 8080 --workers 1  # A2A at :8080

Hands-on demo against the five hero patients: DEMO.md.


What makes this engineering-substantial

  1. A real DSL-to-SMT compiler. The Garden is not a rules engine; it is a Python DSL whose AST compiles to SMT-LIB2 and is discharged by Z3. ≈260 theorems, machine-checkable, attached to every verdict.
  2. Counterfactual simulation with causal-inference identification. JAX PK/PD over DoWhy + EconML, with propensity-score matching on workspace cohorts and named meta-analysis fallback. Priors are labelled on output, not hidden.
  3. Information-theoretic VOI ranking. Sobol-distributed sampling × entropy of the verifier posterior. Tells you which lab to order next to maximally collapse uncertainty, not "everything that might be relevant".
  4. Adversarial validation in-loop. The LLM is asked to attack the recommendation, not defend it. Whatever it produces is re-verified by the same theorem prover — the LLM cannot ship anything past Z3.
  5. Cryptographically verifiable audit. Every tool call produces an ed25519-signed FHIR entry, chained SHA-256, externally verifiable with the public key alone.

Companion product

Lumen ships alongside Vigil — a workspace-scale adverse-event surveillance system. Same author, shared infrastructure (shared/fhir/, shared/audit/, shared/llm/, shared/types/ are vendored in both repos). Different scope: Lumen verifies prospective recommendations per patient; Vigil detects retrospective harm signals across a workspace.


License & clinical disclaimer

MIT. See LICENSE. This is a research and engineering demonstration — not a medical device. It has not been evaluated by any regulatory authority and must not be used to make clinical decisions for real patients.

Clinical guideline sources (NICE, AGS, STOPP/START, KDIGO) are referenced under fair-use educational citation. Compiled SMT artefacts are derivative works; redistribution of the source PDFs is not implied.

About

Formal-verification meta-layer for clinical AI — Z3 proves safety against named guidelines, JAX simulates trajectory, entropy ranks value-of-information, LLM-attacker re-verified by the same theorem prover.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages