Use a knowledge graph to discover directional relationships between complications after elective lumbar fusion. Show that the KG reveals sources, bridges, and sinks; validate using traditional statistics; demonstrate the efficacy of KGs as a discovery mechanism for outcomes medicine. Post-operative complications have historically been studied in silos — one outcome per analysis, one regression per outcome — which obscures directional, network-level relationships between complications. This repository positions the knowledge graph itself as the discovery instrument: cohort to directed weighted complication network to community and centrality decomposition to role classification (source / bridge / sink) to confirmatory traditional statistics.
If you use this code or derived data, please cite both the software archive and the accompanying paper.
Software (this repository):
Salman S. (2026). Knowledge Graphs as a Discovery Mechanism for Post-Operative Complications. Zenodo. https://doi.org/10.5281/zenodo.20349068
The DOI above is the concept DOI and always resolves to the latest
release. To cite a specific version, use the version DOI from the
Zenodo record (e.g.
10.5281/zenodo.20349069 for v0.1.0).
Paper:
Salman S. Knowledge Graphs as a Discovery Mechanism for Post-Operative Complications: Application to Lumbar Spinal Fusion. (Manuscript under review; full citation will be added upon acceptance.)
A machine-readable citation block is provided in
CITATION.cff.
knowledge-graph-postop-complications/
├── README.md # this file
├── LICENSE # MIT
├── CITATION.cff # Zenodo-readable citation metadata
├── .gitignore # excludes data, credentials, caches
├── requirements.txt # Python dependencies
├── config.example.py # template with env-var placeholders
├── docs/
│ ├── SCHEMA.md # graph schema in markdown + mermaid
│ └── REPRODUCE.md # step-by-step reproduction guide
├── kg_construction/ # KG ETL — sanitized
│ ├── schema_100pct.py
│ ├── source_adapter.py
│ ├── loader_100pct.py
│ ├── verify_100pct.py
│ ├── queries_100pct.py
│ └── discovery_queries_100pct_v2.py
├── analysis/ # figure / table / statistics builders
│ ├── build_figure_1.py
│ ├── build_figure_2.py
│ ├── build_figure_3.py
│ ├── build_figure_4.py
│ ├── build_table_1.py
│ ├── build_table_2.py
│ ├── build_table_3.py
│ ├── build_table_4.py
│ ├── build_table_5.py
│ ├── build_table_6.py
│ ├── build_results_docx.py
│ ├── verify_v5.py
│ └── build_presentation.py
└── data/ # PHI-free aggregated snapshots
├── centrality_table.csv
├── layer_a_nodes.csv
├── layer_a_edges.csv
├── community_assignments.csv
├── community_stability.csv
├── community_vs_organsystem.csv
├── aki_paths.csv
├── aki_3event_chains.csv
├── cascade_modifiers.csv
├── cohort_summary.json
└── COMPLICATION_RANKS.md
Three reproduction paths are supported, in increasing order of
infrastructure required. Detailed step-by-step instructions live in
docs/REPRODUCE.md; the summaries below indicate
which path to choose.
The data/ directory contains point-in-time, aggregated, PHI-free
snapshots of the network-topology artifacts the paper relies on
(centrality table, directed edge list, community assignments, AKI
ego-subgraph paths, cohort summary). A subset of the figure builders
and the presentation deck regenerate from data/ alone; full
regeneration of the figures and tables that depend on the upstream
anchor-regression and component-validation pipeline requires Path 2.
git clone https://github.com/samersalman/knowledge-graph-postop-complications.git
cd knowledge-graph-postop-complications
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python analysis/build_figure_1.py # KG schematic
python analysis/build_figure_2.py # Network topology
python analysis/build_presentation.py # Reframed deckOutputs land next to the scripts (figures as PNG, deck as PPTX). This
subset is sufficient to verify that the reported network topology,
community structure, and role classification can be regenerated from
the published aggregates. The remaining figures (3 and 4), the six
tables, and build_results_docx.py depend on intermediate CSVs
(anchor LR/Fisher tables, all-components validation summaries, the
demographic crosstab) that are produced by the upstream V2 pipeline
and are not redistributed in this repository. See
docs/REPRODUCE.md §1 for the full inventory
of what is and is not regenerable from the snapshot, and Recipe 2 for
how to regenerate the missing artifacts from an EHR-backed graph.
This path is intended for collaborators who want to apply the same three-layer methodology (graph → communities → centrality) to a different cohort or institutional data source. It requires:
- Neo4j 5.x (Desktop or Aura). The APOC plugin is required; the Graph Data Science (GDS) plugin is required for community detection and centrality.
- Python 3.10 or newer and the libraries in
requirements.txt. - A cohort source — either a DuckDB file produced from your EHR
extract, a PostgreSQL connection, or a flat CSV with the columns
documented in
docs/SCHEMA.md. - Credentials and paths supplied via environment variables; see
config.example.pyand copy it toconfig.py(which is gitignored) before running.
High-level workflow:
# 1. Copy and edit the config template.
cp config.example.py config.py
export NEO4J_PASSWORD="..."
export DUCKDB_PATH="/path/to/your_cohort.duckdb"
# 2. Create uniqueness constraints.
python kg_construction/schema_100pct.py --target desktop_100pct
# 3. Run the twelve-phase ETL pipeline.
python kg_construction/loader_100pct.py --target desktop_100pct --source duckdb
# 4. Verify cardinality and parity against the cohort.
python kg_construction/verify_100pct.py --target desktop_100pct
# 5. Optional: re-run the discovery and parity queries.
python kg_construction/queries_100pct.py --target desktop_100pct
python kg_construction/discovery_queries_100pct_v2.py --target desktop_100pctThe full load takes approximately 25 to 30 minutes wall time on an
8 GB heap / 4 GB pagecache configuration. See
docs/REPRODUCE.md §2 for the data-model
contract (which input columns are required, which are optional, and
which default to NULL if absent).
If your goal is to re-evaluate the confirmatory statistics (Fisher's
exact test on every directed pair with Bonferroni correction;
covariate-adjusted multivariable logistic regression on validated
pairs), the analysis can be run independently of the graph build once
the upstream V2 intermediates are available alongside data/. The
validator consumes the snapshot CSVs plus the anchor-regression and
all-components-validation outputs of the upstream pipeline (Recipe 2);
these intermediates are not redistributed in this repository.
pip install -r requirements.txt
python analysis/verify_v5.pyverify_v5.py reproduces the reported number of Bonferroni-cleared
directed pairs and the role-classification thresholds against those
intermediates. Expected runtime is under one minute once the inputs
are in place. See docs/REPRODUCE.md §3 for the
list of intermediate values the script prints and how to compare them
against the manuscript's reported numbers.
The underlying patient-level electronic health record data used to
construct the graph were drawn from a federated, de-identified EHR
research network and are subject to restricted access. Raw
patient-level rows, individual identifiers, dates of service, and the
full diagnosis history of any single patient are not redistributable
through this repository and are not contained in data/.
What this repository does contain is the set of aggregated, PHI-free derivatives that underpin the network-topology figures and the role-classification analysis — i.e., the subset of artifacts regenerable via Path 1. Figures and tables that depend on patient-level regression intermediates (anchor LR/Fisher tables, the all-components validation summaries, the demographic crosstab) require re-running the upstream V2 pipeline via Path 2.
- Node-level summaries (38 complication categories with computed centrality and role assignments).
- Edge-level summaries (1,194 directed weighted complication-pair edges with counts and Bonferroni-significance flags).
- Cohort-level aggregates (community assignments, gamma-sweep stability, organ-system crosstabs, cohort demographics).
- AKI case-study aggregates (ego-graph paths and three-event chains with no patient-level identifiers).
Researchers with their own access to a comparable federated EHR network
can rebuild the full graph using Path 2 above, applying the cohort
definition and inclusion logic described in the paper's Methods and in
docs/REPRODUCE.md.
Requests for collaboration, data-use agreements, or methodological clarification can be sent to the contact below.
Samer Salman — samer.salman2021@gmail.com
For questions about the methodology, the schema, or the validation strategy, please open a GitHub issue on this repository (preferred, so the discussion is publicly archived) or reach out by email.
For data-access or collaboration requests that touch the underlying EHR network, please use email so that institutional and regulatory context can be exchanged off the public tracker.
This project is licensed under the MIT License. See
LICENSE for the full text.
In short: you may use, modify, and redistribute this code in source or binary form, for academic or commercial purposes, provided you retain the copyright notice and license text. The software is provided "as is," without warranty of any kind.
The MIT license applies only to the source code and the aggregated
derivative data in data/. It does not confer access to the
underlying EHR network or to any patient-level data — see the Data
availability section above.