April 2026
Authors: Darya Yarparvar (Technical Lead), Olga Shapovalova, Kseniia Warren, Abdullah Sheikh, Eshwar Meduri, Joe Flanagan
- Overview
- Problem Statement
- Solution
- Architecture
- Key Features
- Project Structure
- Prerequisites
- Getting Started
- Configuration
- Pipeline Stages
- Evaluation
- Future Directions
This project introduces an Agentic Retrieval-Augmented Generation (RAG) pipeline as an AI co-pilot for clinical analysts at NICE (National Institute for Health and Care Excellence). It automates the generation of SNOMED CT clinical codelists, transforming a laborious manual process into an efficient, auditable, and reproducible workflow — while keeping humans firmly in the loop.
NICE requires highly accurate, consistent, and auditable clinical codelists — typically expressed as SNOMED CT codes — to identify patient populations for healthcare research and quality indicators. The current process is:
- Manual and time-consuming — analysts must individually review thousands of candidate codes
- Inconsistent — different analysts may produce different codelists for the same research question
- Difficult to audit — decisions are rarely documented with clinical justifications
- Prone to obsolescence — medical terminologies evolve, but codelists are rarely updated systematically
These inefficiencies hinder timely and robust healthcare research across the NHS.
We built an Agentic RAG pipeline that acts as an AI co-pilot for clinical analysts. The system is:
- Fully local — runs on-premise with no data leaving the organisation
- Open-weight — uses open-source models (Phi-4 via Ollama, Nomic Embed Text)
- Audit-ready — every recommendation includes a clinical justification and confidence flag
- Human-in-the-loop — analysts review AI suggestions and approve/reject with full transparency
Figure 1. An overview of the Agentic RAG pipeline. The pipeline comprises four sequential stages: a query router; a ReAct agent loop; hybrid retrieval with graph expansion; and LLM-based code classification.
The pipeline comprises four sequential stages:
| Stage | Component | Description |
|---|---|---|
| 1 | Query Router | Classifies the research question by clinical entity type (diagnosis, medication, procedure, laboratory test, composite) |
| 2 | ReAct Agent Loop | Iteratively retrieves and refines candidate codes using a reasoning-and-acting loop (up to 50–200 steps depending on query complexity) |
| 3 | Hybrid Retrieval | Combines sparse TF-IDF retrieval and dense semantic embeddings, fused via Reciprocal Rank Fusion (RRF), with SNOMED CT graph expansion |
| 4 | LLM Classification | Phi-4 reviews each candidate code and assigns a confidence flag: include, uncertain, or flag_for_review |
- Hybrid retrieval — TF-IDF (n-gram range 1–5) + Nomic Embed Text v1.5 dense embeddings fused via RRF for high-recall candidate selection
- SNOMED CT graph expansion — traverses 16 relationship types (e.g.
is_a,finding_site,causative_agent) to surface related codes not matched by text alone - ReAct-style agentic loop — iterative reasoning that can issue multiple retrieval calls to refine coverage before committing to a final codelist
- Confidence-flagged output — every recommended code is tagged
include/uncertain/flag_for_reviewwith a plain-English clinical justification - Deterministic by default — temperature set to 0 and seed fixed at 42 for reproducible results
- Release-aware — automatically detects and tracks SNOMED CT release versions for longitudinal auditability
.
├── Clinical_Codelist_Generation_Pipeline.ipynb # Main pipeline notebook
├── config.yaml # All configurable parameters
├── requirements.txt # Pinned Python dependencies
├── .gitignore
└── README.md
# Directories created at runtime (not tracked in git):
├── data/
│ ├── raw/ # Source files: PCD Reference Set, SNOMED CT RF2 releases
│ └── processed/ # Chunked knowledge base and embeddings
└── results/ # Generated codelists and evaluation outputs
- GPU with CUDA 12.8 support (recommended: 16 GB+ VRAM for Phi-4 inference)
- 32 GB+ RAM for SNOMED CT processing
- Python 3.10+
- Ollama installed and running locally
- Ollama model pulled:
ollama pull phi4 - Google Colab (recommended) or a local Jupyter environment
The following NHS data files are required but are not included in this repository due to licensing restrictions:
| File | Description |
|---|---|
20250912_PCD_Refset_Content |
NHS Primary Care Domain Reference Set (September 2025) |
| SNOMED CT RF2 Descriptions | SNOMED CT Monolith RF2 PRODUCTION (September 2025) |
| SNOMED CT RF2 Relationships | SNOMED CT relationship hierarchy |
| Code usage statistics | NHS England SNOMED CT usage data (2024–25) |
git clone https://github.com/dyarparvar/Accelerating-Clinical-Codelist-Generation-for-NICE-using-AI.git
cd Accelerating-Clinical-Codelist-Generation-for-NICE-using-AIpip install -r requirements.txtCopy the required NHS data files into data/raw/ and update the paths in config.yaml.
ollama serve &
ollama pull phi4Open Clinical_Codelist_Generation_Pipeline.ipynb in Jupyter or Google Colab and run all cells sequentially. The notebook is fully self-contained with embedded markdown documentation at each stage.
All pipeline parameters are controlled via config.yaml:
randomness_strategy:
seed: 42 # Fixed seed for reproducibility
data:
source: 20250912_PCD_Refset_Content
chunking:
max_chars: 28672 # Context window chunk size
codes_per_subchunk: 50
retrieval:
embedding:
model: nomic-ai/nomic-embed-text-v1.5
device: cuda
retrieving:
sparse_min_score: 0.05
dense_min_score: 0.6
rrf_k: 60 # Reciprocal rank fusion constant
graph_expansion:
enabled: true
expansion_top_k: 10
llm:
model: phi4
temperature: 0 # Deterministic output
num_ctx: 16384
max_codes_to_llm: 200
agent:
model: phi4
max_steps:
simple: 50
moderate: 100
complex: 200Key parameters to adjust for your use case:
data.source— point to your PCD Reference Set versionretrieval.retrieving.dense_min_score— lower to increase recall, raise to improve precisionagent.max_steps— increase for broader queries; decrease to reduce runtimellm.max_codes_to_llm— caps the number of candidates sent to the LLM per iteration
- Loads the PCD Reference Set and enriches it with SNOMED CT taxonomy
- Detects the active SNOMED CT release version
- Builds semantic chunks with relationship expansion and generates dense embeddings using Nomic Embed Text v1.5
- Phi-4 classifies the free-text research question into one of five entity types:
diagnosis,laboratory_test,medication,procedure, orcomposite - Entity type determines which SNOMED CT relationship axes are prioritised during graph expansion
- Sparse retrieval: TF-IDF with n-grams (1–5) against code descriptions and synonyms
- Dense retrieval: Cosine similarity search over Nomic embeddings
- Graph expansion: SNOMED CT hierarchy traversal across 16 relationship types
- Results fused via Reciprocal Rank Fusion (RRF)
- Agent iteratively issues retrieval calls, inspects results, and decides whether to continue refining or commit to a final list
- Reasoning traces are logged for auditability
- Loop terminates when the agent is satisfied or
max_stepsis reached
- Phi-4 reviews each candidate code with its description, synonyms, and retrieved context
- Assigns one of three confidence flags:
include,uncertain,flag_for_review - Provides a short plain-English clinical justification for each decision
- Analyst reviews the AI-recommended codelist with confidence flags as guidance
- Approves or rejects each code; overrides are logged
- Final codelist is committed to git for a full audit trail
The pipeline was benchmarked against 14 research questions drawn from NHS clinical practice, compared against gold-standard codelists from OpenCodelists.
| Metric | All codes | Include-only tier |
|---|---|---|
| Macro F1 | 10.4% | 12.9% |
| Precision (median) | ~65% | — |
| Recall (median) | ~8% | — |
The system is deliberately precision-biased: it will not hallucinate codes outside the retrieved candidate pool, making it suitable as a first-pass filter rather than a stand-alone generator.
| Research Question | Precision |
|---|---|
| Liver Cirrhosis | 69.6% |
| Radiotherapy | 98.2% |
| Failure Mode | Share |
|---|---|
| Retrieval failures (relevant codes not retrieved) | 41.9% |
| Knowledge base gaps (codes absent from PCD/SNOMED snapshot) | 31.6% |
| LLM rejections (retrieved but incorrectly excluded) | 26.5% |
- GraphRAG — Replace flat vector retrieval with a Neo4j knowledge graph to enable multi-hop SNOMED CT reasoning and richer relationship traversal
- Demographic cluster injection — Incorporate prevalence and demographic metadata to improve coverage for rare conditions
- Active learning loop — Feed analyst override decisions back into the retrieval model to improve recall over time
- Regulatory validation — Formal clinical review against NHS England and MHRA requirements before production deployment
This work was carried out to support NICE's mission to provide evidence-based guidance for the NHS. The pipeline relies on:
- SNOMED CT — maintained by SNOMED International
- NHS Primary Care Domain Reference Set — NHS England
- OpenCodelists — used as gold-standard evaluation reference
- Phi-4 — Microsoft Research
- Nomic Embed Text v1.5 — Nomic AI