Skip to content

samersalman/knowledge-graph-postop-complications

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Knowledge Graphs as a Discovery Mechanism for Post-Operative Complications

DOI License: MIT

Use a knowledge graph to discover directional relationships between complications after elective lumbar fusion. Show that the KG reveals sources, bridges, and sinks; validate using traditional statistics; demonstrate the efficacy of KGs as a discovery mechanism for outcomes medicine. Post-operative complications have historically been studied in silos — one outcome per analysis, one regression per outcome — which obscures directional, network-level relationships between complications. This repository positions the knowledge graph itself as the discovery instrument: cohort to directed weighted complication network to community and centrality decomposition to role classification (source / bridge / sink) to confirmatory traditional statistics.


Citation

If you use this code or derived data, please cite both the software archive and the accompanying paper.

Software (this repository):

Salman S. (2026). Knowledge Graphs as a Discovery Mechanism for Post-Operative Complications. Zenodo. https://doi.org/10.5281/zenodo.20349068

The DOI above is the concept DOI and always resolves to the latest release. To cite a specific version, use the version DOI from the Zenodo record (e.g. 10.5281/zenodo.20349069 for v0.1.0).

Paper:

Salman S. Knowledge Graphs as a Discovery Mechanism for Post-Operative Complications: Application to Lumbar Spinal Fusion. (Manuscript under review; full citation will be added upon acceptance.)

A machine-readable citation block is provided in CITATION.cff.


Repository structure

knowledge-graph-postop-complications/
├── README.md                       # this file
├── LICENSE                         # MIT
├── CITATION.cff                    # Zenodo-readable citation metadata
├── .gitignore                      # excludes data, credentials, caches
├── requirements.txt                # Python dependencies
├── config.example.py               # template with env-var placeholders
├── docs/
│   ├── SCHEMA.md                   # graph schema in markdown + mermaid
│   └── REPRODUCE.md                # step-by-step reproduction guide
├── kg_construction/                # KG ETL — sanitized
│   ├── schema_100pct.py
│   ├── source_adapter.py
│   ├── loader_100pct.py
│   ├── verify_100pct.py
│   ├── queries_100pct.py
│   └── discovery_queries_100pct_v2.py
├── analysis/                       # figure / table / statistics builders
│   ├── build_figure_1.py
│   ├── build_figure_2.py
│   ├── build_figure_3.py
│   ├── build_figure_4.py
│   ├── build_table_1.py
│   ├── build_table_2.py
│   ├── build_table_3.py
│   ├── build_table_4.py
│   ├── build_table_5.py
│   ├── build_table_6.py
│   ├── build_results_docx.py
│   ├── verify_v5.py
│   └── build_presentation.py
└── data/                           # PHI-free aggregated snapshots
    ├── centrality_table.csv
    ├── layer_a_nodes.csv
    ├── layer_a_edges.csv
    ├── community_assignments.csv
    ├── community_stability.csv
    ├── community_vs_organsystem.csv
    ├── aki_paths.csv
    ├── aki_3event_chains.csv
    ├── cascade_modifiers.csv
    ├── cohort_summary.json
    └── COMPLICATION_RANKS.md

Reproduction

Three reproduction paths are supported, in increasing order of infrastructure required. Detailed step-by-step instructions live in docs/REPRODUCE.md; the summaries below indicate which path to choose.

Path 1 — Regenerate the snapshot-backed figures from data/ (no database required)

The data/ directory contains point-in-time, aggregated, PHI-free snapshots of the network-topology artifacts the paper relies on (centrality table, directed edge list, community assignments, AKI ego-subgraph paths, cohort summary). A subset of the figure builders and the presentation deck regenerate from data/ alone; full regeneration of the figures and tables that depend on the upstream anchor-regression and component-validation pipeline requires Path 2.

git clone https://github.com/samersalman/knowledge-graph-postop-complications.git
cd knowledge-graph-postop-complications
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

python analysis/build_figure_1.py          # KG schematic
python analysis/build_figure_2.py          # Network topology
python analysis/build_presentation.py      # Reframed deck

Outputs land next to the scripts (figures as PNG, deck as PPTX). This subset is sufficient to verify that the reported network topology, community structure, and role classification can be regenerated from the published aggregates. The remaining figures (3 and 4), the six tables, and build_results_docx.py depend on intermediate CSVs (anchor LR/Fisher tables, all-components validation summaries, the demographic crosstab) that are produced by the upstream V2 pipeline and are not redistributed in this repository. See docs/REPRODUCE.md §1 for the full inventory of what is and is not regenerable from the snapshot, and Recipe 2 for how to regenerate the missing artifacts from an EHR-backed graph.

Path 2 — Rebuild the knowledge graph from your own EHR data

This path is intended for collaborators who want to apply the same three-layer methodology (graph → communities → centrality) to a different cohort or institutional data source. It requires:

  • Neo4j 5.x (Desktop or Aura). The APOC plugin is required; the Graph Data Science (GDS) plugin is required for community detection and centrality.
  • Python 3.10 or newer and the libraries in requirements.txt.
  • A cohort source — either a DuckDB file produced from your EHR extract, a PostgreSQL connection, or a flat CSV with the columns documented in docs/SCHEMA.md.
  • Credentials and paths supplied via environment variables; see config.example.py and copy it to config.py (which is gitignored) before running.

High-level workflow:

# 1. Copy and edit the config template.
cp config.example.py config.py
export NEO4J_PASSWORD="..."
export DUCKDB_PATH="/path/to/your_cohort.duckdb"

# 2. Create uniqueness constraints.
python kg_construction/schema_100pct.py --target desktop_100pct

# 3. Run the twelve-phase ETL pipeline.
python kg_construction/loader_100pct.py --target desktop_100pct --source duckdb

# 4. Verify cardinality and parity against the cohort.
python kg_construction/verify_100pct.py --target desktop_100pct

# 5. Optional: re-run the discovery and parity queries.
python kg_construction/queries_100pct.py --target desktop_100pct
python kg_construction/discovery_queries_100pct_v2.py --target desktop_100pct

The full load takes approximately 25 to 30 minutes wall time on an 8 GB heap / 4 GB pagecache configuration. See docs/REPRODUCE.md §2 for the data-model contract (which input columns are required, which are optional, and which default to NULL if absent).

Path 3 — Rerun only the statistical validation layer

If your goal is to re-evaluate the confirmatory statistics (Fisher's exact test on every directed pair with Bonferroni correction; covariate-adjusted multivariable logistic regression on validated pairs), the analysis can be run independently of the graph build once the upstream V2 intermediates are available alongside data/. The validator consumes the snapshot CSVs plus the anchor-regression and all-components-validation outputs of the upstream pipeline (Recipe 2); these intermediates are not redistributed in this repository.

pip install -r requirements.txt
python analysis/verify_v5.py

verify_v5.py reproduces the reported number of Bonferroni-cleared directed pairs and the role-classification thresholds against those intermediates. Expected runtime is under one minute once the inputs are in place. See docs/REPRODUCE.md §3 for the list of intermediate values the script prints and how to compare them against the manuscript's reported numbers.


Data availability

The underlying patient-level electronic health record data used to construct the graph were drawn from a federated, de-identified EHR research network and are subject to restricted access. Raw patient-level rows, individual identifiers, dates of service, and the full diagnosis history of any single patient are not redistributable through this repository and are not contained in data/.

What this repository does contain is the set of aggregated, PHI-free derivatives that underpin the network-topology figures and the role-classification analysis — i.e., the subset of artifacts regenerable via Path 1. Figures and tables that depend on patient-level regression intermediates (anchor LR/Fisher tables, the all-components validation summaries, the demographic crosstab) require re-running the upstream V2 pipeline via Path 2.

  • Node-level summaries (38 complication categories with computed centrality and role assignments).
  • Edge-level summaries (1,194 directed weighted complication-pair edges with counts and Bonferroni-significance flags).
  • Cohort-level aggregates (community assignments, gamma-sweep stability, organ-system crosstabs, cohort demographics).
  • AKI case-study aggregates (ego-graph paths and three-event chains with no patient-level identifiers).

Researchers with their own access to a comparable federated EHR network can rebuild the full graph using Path 2 above, applying the cohort definition and inclusion logic described in the paper's Methods and in docs/REPRODUCE.md.

Requests for collaboration, data-use agreements, or methodological clarification can be sent to the contact below.


Contact

Samer Salmansamer.salman2021@gmail.com

For questions about the methodology, the schema, or the validation strategy, please open a GitHub issue on this repository (preferred, so the discussion is publicly archived) or reach out by email.

For data-access or collaboration requests that touch the underlying EHR network, please use email so that institutional and regulatory context can be exchanged off the public tracker.


License

This project is licensed under the MIT License. See LICENSE for the full text.

In short: you may use, modify, and redistribute this code in source or binary form, for academic or commercial purposes, provided you retain the copyright notice and license text. The software is provided "as is," without warranty of any kind.

The MIT license applies only to the source code and the aggregated derivative data in data/. It does not confer access to the underlying EHR network or to any patient-level data — see the Data availability section above.

About

Knowledge-graph-based discovery of post-operative complication cascades after lumbar spinal fusion. Accompanies the KG-COMP manuscript.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages