llmsyn-repro

Synthetic EHR generation and evaluation pipeline for a portfolio/research setting. The project compares LLM-generated variants against real-like processed data using utility and fidelity metrics.

Working Directory Assumption

All commands below are intended to run from the repository root:

/Users/cocoxoxo/llmsyn-repro

Pipeline Overview

Generate synthetic cohorts with an LLM (three variants: full, prior, base).
Align synthetic outputs to a common processed schema.
Build utility task features (mortality and respiratory tasks).
Evaluate utility with RandomForest (ACC, AUROC).
Evaluate fidelity with KS and MMD.

Synthetic Variants

full: prior-guided with explicit frequency distributions (demographic rates from prior.json).
prior: prior-guided with vocabulary constraints only (allowed category values, no frequency target).
base: no prior knowledge — model uses general clinical reasoning freely.

This gives three meaningfully different points on the prior-guidance spectrum.

Required Inputs

Already in repo:
- data/priors/top100_icd9.csv
- data/priors/prior.json
Required for generation:
- data/input/D_ICD_PROCEDURES.csv (MIMIC procedure dictionary)
Optional, needed only to rebuild priors:
- data/input/ADMISSIONS.csv
- data/input/PATIENTS.csv
- data/input/DIAGNOSES_ICD.csv
- data/input/D_ICD_DIAGNOSES.csv
Optional, needed only for real-data alignment and evaluation:
- data/input/real_mimic.csv (user-prepared flat MIMIC extract)

Setup

conda env create -f environment.yml
conda activate llmsyn
export OPENAI_API_KEY=your_key_here

The generation script currently uses OpenAI Responses API with model gpt-5.1, so network/model access and a valid API key are required.

Generation

python src/generation/generate_variants.py

Expected output files:

outputs/synthetic/syn_full.csv
outputs/synthetic/syn_prior.csv
outputs/synthetic/syn_base.csv

Alignment

python src/preprocessing/align_input_data.py

Expected output files:

data/processed/syn_full.csv
data/processed/syn_prior.csv
data/processed/syn_base.csv
data/processed/real_mimic.csv (only if data/input/real_mimic.csv is present)

Utility Evaluation

Build feature arrays first:

python src/evaluation/build_features_mortality.py --input data/processed/real_mimic.csv --out-prefix outputs/eval/real_mortality
python src/evaluation/build_features_mortality.py --input data/processed/syn_full.csv --out-prefix outputs/eval/syn_full_mortality
python src/evaluation/build_features_mortality.py --input data/processed/syn_prior.csv --out-prefix outputs/eval/syn_prior_mortality
python src/evaluation/build_features_mortality.py --input data/processed/syn_base.csv --out-prefix outputs/eval/syn_base_mortality

python src/evaluation/build_features_respiratory.py --input data/processed/real_mimic.csv --out-prefix outputs/eval/real_resp
python src/evaluation/build_features_respiratory.py --input data/processed/syn_full.csv --out-prefix outputs/eval/syn_full_resp
python src/evaluation/build_features_respiratory.py --input data/processed/syn_prior.csv --out-prefix outputs/eval/syn_prior_resp
python src/evaluation/build_features_respiratory.py --input data/processed/syn_base.csv --out-prefix outputs/eval/syn_base_resp

Then run utility metrics:

python src/evaluation/eval_mortality.py
python src/evaluation/eval_utility_respiratory.py

Fidelity Evaluation

Fidelity metrics compare synthetic distributions against the real dataset:

Categorical features (LANGUAGE, RELIGION, MARITAL_STATUS, ETHNICITY, INSURANCE, HOSPITAL_EXPIRE_FLAG, MAIN_DIAGNOSIS): Total Variation Distance (TVD). TVD ∈ [0, 1]; lower is better. KS is not used for nominal variables because its result depends on the arbitrary integer encoding order.
Numeric/count features (AGE, OTHER_ICD_COUNT, PROCEDURE_COUNT): Kolmogorov–Smirnov statistic.

Output schema for fidelity_ks.csv: dataset, feature, metric_type (TVD|KS), stat, pvalue

python src/evaluation/fidelity_ks.py \
   --real-path data/processed/real_mimic.csv \
   --dataset syn_full=data/processed/syn_full.csv \
   --dataset syn_base=data/processed/syn_base.csv \
   --dataset syn_prior=data/processed/syn_prior.csv

python src/evaluation/fidelity_mmd.py \
   --real-path data/processed/real_mimic.csv \
   --dataset syn_full=data/processed/syn_full.csv \
   --dataset syn_base=data/processed/syn_base.csv \
   --dataset syn_prior=data/processed/syn_prior.csv \
   --sample-size 5000

Outputs

Synthetic CSVs: outputs/synthetic/*.csv
Processed datasets: data/processed/*.csv
Utility outputs: outputs/eval/utility_*.csv
Fidelity outputs: outputs/eval/fidelity_ks.csv, outputs/eval/fidelity_mmd.csv
Summary: outputs/eval/summary_overview.md

Note: A plotting script is not currently included in this repository. Existing evaluation tables and summary files can be reviewed directly.

Prototype Notes and Limitations

This is a portfolio/research pipeline, not a clinical production system.
Synthetic quality is evaluated only through the provided utility/fidelity tasks.
Realism and privacy are not formally guaranteed by this repository alone.
Generation depends on external API/model availability and stable model behavior.

Troubleshooting

Missing data/input/D_ICD_PROCEDURES.csv: generation step will fail until this file is provided.
Missing OPENAI_API_KEY: generation step will fail by design.
If matplotlib cache permission warnings appear, set export MPLCONFIGDIR=/tmp/matplotlib.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
outputs		outputs
presentation		presentation
prompts		prompts
src		src
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llmsyn-repro

Working Directory Assumption

Pipeline Overview

Synthetic Variants

Required Inputs

Setup

Generation

Alignment

Utility Evaluation

Fidelity Evaluation

Outputs

Prototype Notes and Limitations

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

llmsyn-repro

Working Directory Assumption

Pipeline Overview

Synthetic Variants

Required Inputs

Setup

Generation

Alignment

Utility Evaluation

Fidelity Evaluation

Outputs

Prototype Notes and Limitations

Troubleshooting

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages