Semantic retrieval for auditing and expanding ICD-based phenotypes in EHR biobanks.
Phecoder maps clinical phenotypes (Phecodes) to diagnosis codes (ICD-9/10) using pretrained text embedding models. It ranks candidate ICD codes by semantic similarity, supports multi-model ensembles for improved recall, and includes an interactive review widget for curation and export to OHDSI ATLAS.
pip install phecoderGPU support: Install PyTorch with CUDA before installing phecoder.
Interactive review widget:
pip install 'phecoder[review]'Developer install (requires Poetry):
git clone https://github.com/DiseaseNeuroGenomics/phecoder.git
poetry installimport os
from phecoder import Phecoder
os.environ["HF_HOME"] = "./hf-home" # optional: set model cache location
pc = Phecoder(
phecodes=["Suicidal ideation", "Depression", "Anxiety"],
output_dir="./results",
)
pc.run()
pc.build_ensemble()
results = pc.load_results("ensemble-zsum")Phecoder ships with a bundled reference ICD file and loads it automatically. You can supply your own DataFrame instead — it must contain icd_code and icd_string columns. Any additional columns (e.g. concept_id) are preserved through the pipeline.
from phecoder import load_icd_df
icd_df = load_icd_df() # bundled ICD-9/10 referenceTip: For best results, use the ICD descriptions from your own EHR/biobank dataset rather than the reference file. Semantic matching is most accurate when the strings match what your data actually contains.
The phecodes argument accepts a string, a list of strings, or a DataFrame:
# String or list of strings — phecode IDs are assigned automatically
phecodes = "Eating disorders"
phecodes = ["Eating disorders", "Type 2 diabetes", "Hypertension"]
# DataFrame — use this to specify your own phecode IDs
import pandas as pd
phecodes = pd.DataFrame({
"phecode": ["250.2", "401.1"],
"phecode_string": ["Type 2 diabetes","Hypertension"],
})| Preset | Description |
|---|---|
"preset:best_single" |
Best single model from our evaluation |
"preset:best_ensemble" |
Best set of models for ensemble (default) |
You can also pass any SentenceTransformers model ID directly:
# Single model
models = "sentence-transformers/all-MiniLM-L6-v2" # fast, ~80 MB
models = "FremyCompany/BioLORD-2023" # clinical text, ~440 MB
# Multiple models (for ensemble)
models = [
"sentence-transformers/all-MiniLM-L6-v2",
"FremyCompany/BioLORD-2023",
"NeuML/pubmedbert-base-embeddings",
]pc = Phecoder(
phecodes = phecodes,
output_dir = "./results",
icd_df = icd_df, # optional; uses bundled data if omitted
models = models, # optional; defaults to preset:best_ensemble
icd_cache_dir = "./icd_cache", # cache ICD embeddings for reuse across runs
device = None, # "cuda" / "cpu" / None (auto-detect)
dtype = "float16", # embedding storage dtype
st_search_kwargs = {"top_k": 100}, # number of ICD codes to retrieve per phenotype
)icd_cache_dir is useful when running multiple projects against the same ICD corpus — embeddings are computed once and reused.
# Models download automatically on first run
pc.run()
# Or pre-download explicitly (useful before batch jobs)
pc.download_models()
pc.run()Results are written to output_dir and cached — re-running with the same inputs is a no-op unless overwrite=True.
# Default ensemble (reciprocal rank fusion, recommended)
pc.build_ensemble()
# Custom ensemble
pc.build_ensemble(method="rrf", method_kwargs={"k": 60}, name="ens:rrf60")# All results (individual models + ensembles)
df = pc.load_results()
# Specific model or ensemble by name
df = pc.load_results("ensemble-zsum")
df = pc.load_results(models=["ens:rrf60"], include_ensembles=True)Results are returned as a long-format DataFrame with columns: model, phecode, phecode_string, icd_code, icd_string, score, rank.
After running the pipeline, interactively curate the top-K retrieved codes per phecode in a Jupyter notebook:
session = pc.review(top_k=30, score_threshold=0.5)
session # renders a tabbed checkbox widget — one tab per phecodeSave your selections in any format:
session.save("picks.parquet") # flat table with provenance
session.save("picks.json") # nested JSON, grouped by phecode
session.save("picks.csv")ATLAS concept sets use OMOP concept_ids rather than raw ICD codes. Add a concept_id column to icd_df before initializing Phecoder and it will flow through to the export automatically:
icd_df = load_icd_df()
icd_df = icd_df.merge(my_omop_concept_map, on="icd_code", how="left")
pc = Phecoder(phecodes=..., icd_df=icd_df, output_dir="./out")
pc.run()
pc.build_ensemble()
session = pc.review(top_k=30)
pc.export_atlas(session, "./atlas_concept_sets") # one JSON per phecode
pc.export_atlas(session, "./atlas_concept_sets.json") # single bundleImport the resulting JSON in ATLAS via Concept Sets → Import. The exporter sets includeDescendants=true and includeMapped=true by default. OMOP concept IDs for ICD vocabularies are available from OHDSI Athena (select ICD9CM, ICD10, ICD10CM, SNOMED).
If you use phecoder in research, please cite:
Phecoder: semantic retrieval for auditing and expanding ICD-based phenotypes in EHR biobanks.
Jamie J. R. Bennett, Simone Tomasi, Sonali Gupta, VA Million Veteran Program, Georgios Voloudakis, Panos Roussos, David Burstein (2026).
doi: 10.64898/2026.01.08.26343725
Support: open a GitHub Issue or email jamie.bennett@mssm.edu.
