Skip to content

Define extraction schema: node types, edge types, and evidence weighting model #1

@innacampo

Description

@innacampo

Context & Motivation

Before MedGemma touches a single paper, we need a fixed ontology that tells it what to extract and a weighting model that tells downstream consumers how much to trust it. Without the schema, outputs will be inconsistent across papers and impossible to merge. Without evidence weighting, a case report and a 10,000-participant RCT would carry identical authority in the graph — making clinical reasoning over it unreliable.

This issue is on the critical path: every extraction, validation, and query issue depends on the schema contract defined here.


1. Node Types

Each node type must declare: id format, required properties, optional properties, and canonical enum values where applicable.

Node Type Description Required Properties Notes
Paper The source PubMed article; anchor for all provenance and weighting pmid, title, pub_year, journal, study_design, sample_size, citation_count, evidence_weight (computed) First-class node, not metadata. Every extracted triple must trace back here.
HormonalPhase Reproductive staging category label, straw_stage, description Enumerated values mapped to STRAW+10 (see §3)
Symptom Clinical symptom reported or studied label, mesh_id (optional), category (vasomotor / cognitive / mood / musculoskeletal / sleep / urogenital) Normalize to MeSH or SNOMED-CT where possible
Biomarker Measurable biological indicator label, unit, specimen_type (serum, saliva, CSF, etc.) e.g., FSH, E2, AMH, SHBG, CRP, BDNF
BrainRegion Neuroanatomical structure label, hemisphere (L/R/bilateral/NA), atlas_id (optional) Prefer NeuroNames or AAL atlas identifiers
CognitiveFunction Cognitive domain or specific function label, domain (memory, executive, attention, processing_speed, verbal_fluency, spatial) Map to standard neuropsych domains
Intervention Any treatment, therapy, or exposure label, type (pharmacological / surgical / lifestyle / supplement / psychotherapy), route (oral, transdermal, etc., if applicable) Include dosage info as optional properties when extractable
StudyPopulation Cohort characteristics age_range, mean_age, n, ethnicity, inclusion_criteria_summary, menopausal_status One node per distinct cohort within a paper
GeneOrVariant Genetic factor mentioned in association symbol, rsid (optional), gene_id (optional) Captures pharmacogenomic and risk-factor genetics

2. Edge Types

Every edge must declare: source type → target type, directionality, required edge properties, and an example triple.

Edge Type Direction Required Edge Properties Example Triple
MODULATES HormonalPhase → Biomarker direction (increases / decreases / fluctuates / unclear), magnitude (if reported) (Late Perimenopause) -[MODULATES {direction: "decreases"}]→ (Estradiol)
ASSOCIATED_WITH Symptom ↔ BrainRegion correlation_direction (+/-/unclear), imaging_modality (fMRI, PET, sMRI, etc.) (Brain Fog) -[ASSOCIATED_WITH {correlation_direction: "-", imaging_modality: "fMRI"}]→ (Prefrontal Cortex)
PRESENTS_WITH HormonalPhase → Symptom prevalence (if reported), severity_scale (if reported) (Early Postmenopause) -[PRESENTS_WITH {prevalence: "60-80%"}]→ (Hot Flashes)
PREDICTS Biomarker → CognitiveFunction association_direction (+/-), p_value (optional), effect_size (optional) (E2) -[PREDICTS {association_direction: "+", p_value: 0.003}]→ (Verbal Memory)
TESTED_IN Intervention → StudyPopulation outcome_measure, result_summary (improved / no_effect / worsened) (Transdermal E2 HRT) -[TESTED_IN {result_summary: "improved"}]→ (Peri Women 45-55)
AFFECTS Intervention → Symptom effect (alleviates / worsens / no_effect), effect_size (optional) (SSRI) -[AFFECTS {effect: "alleviates"}]→ (Hot Flashes)
INFLUENCES Biomarker → BrainRegion mechanism (neuroprotective / neuroinflammatory / neurotrophic / unclear) (BDNF) -[INFLUENCES {mechanism: "neurotrophic"}]→ (Hippocampus)
INTERACTS_WITH GeneOrVariant → Intervention interaction_type (efficacy_modifier / risk_modifier / metabolic) (CYP2D6 poor metabolizer) -[INTERACTS_WITH {interaction_type: "efficacy_modifier"}]→ (Tamoxifen)
EXTRACTED_FROM any node or edge → Paper extraction_method (manual / MedGemma_v*), confidence_score [0–1], text_span (source sentence) (triple: E2 PREDICTS Verbal Memory) -[EXTRACTED_FROM {confidence_score: 0.92}]→ (PMID:33456789)

Note on REPORTED_IN → renamed to EXTRACTED_FROM to distinguish raw paper content from extraction provenance. Every node and every edge in the graph must carry at least one EXTRACTED_FROM link.


3. HormonalPhase Enumeration (STRAW+10 Mapping)

Label (Graph Enum) STRAW+10 Stage FSH Characteristic Description
REPRODUCTIVE_LATE -1 Variable Subtle fertility decline, regular cycles
PERIMENOPAUSE_EARLY -2 Elevated Cycle length variability ≥7 days
PERIMENOPAUSE_LATE -1* >25 IU/L Amenorrhea ≥60 days, interval of skipped cycles
MENOPAUSE 0 Final menstrual period (retrospective, 12 mo amenorrhea)
POSTMENOPAUSE_EARLY +1a / +1b / +1c Stabilizing high 0–6 years post-FMP
POSTMENOPAUSE_LATE +2 Stable high >6 years post-FMP
SURGICAL_MENOPAUSE N/A (map to +1a equivalent) Variable Bilateral oophorectomy ± hysterectomy
CHEMOTHERAPY_INDUCED N/A Variable Iatrogenic ovarian failure
PREMATURE_OVARIAN_INSUFFICIENCY N/A Elevated Spontaneous menopause <40 years

Schema must store both label and straw_stage so queries can work at either level of granularity.


4. Paper Evidence Weighting Model

Each Paper node must carry a computed evidence_weight (float, 0.0–1.0) derived from the following components:

4a. Component Scores

Component Weight Scoring Rule
Study Design 0.35 meta_analysis/systematic_review: 1.0 · rct: 0.9 · prospective_cohort: 0.7 · cross_sectional: 0.5 · case_control: 0.4 · case_report/case_series: 0.2 · narrative_review/editorial: 0.15 · animal/in_vitro: 0.1
Sample Size 0.20 log-scaled: min(1.0, log10(n) / log10(10000)) — i.e., n=10→0.25, n=100→0.5, n=1000→0.75, n≥10000→1.0. For meta-analyses, use total pooled N.
Recency 0.10 max(0, 1 - (current_year - pub_year) / 30) — papers >30 yrs old score 0; landmark papers can be manually overridden.
Journal Impact Proxy 0.10 Normalized citation rate: min(1.0, citations_per_year / field_median_cpy). Bootstrap field median from initial corpus.
Replication Signal 0.15 Number of other papers in the graph whose extracted triples corroborate ≥1 triple from this paper. Normalized 0–1. (Computed post-ingestion, default 0 on first pass.)
MedGemma Extraction Confidence 0.10 Mean confidence_score across all EXTRACTED_FROM edges originating from this paper.

4b. Composite Formula

evidence_weight = Σ (component_weight_i × component_score_i)

4c. How Weight Propagates

  • Every EXTRACTED_FROM edge carries the source paper's evidence_weight.
  • When a triple (e.g., E2 -[PREDICTS]→ Verbal Memory) is extracted from multiple papers, the triple's aggregate confidence is:
    triple_confidence = 1 - Π (1 - evidence_weight_i × confidence_score_i)
    
    (noisy-OR: more independent high-quality sources → higher aggregate confidence, with diminishing returns.)
  • Query APIs and visualization layers should expose both per-paper weight and aggregate triple confidence.

4d. Override Mechanism

Some landmark papers (e.g., SWAN cohort publications, Kronos Early Estrogen Prevention Study) may deserve manual weight overrides. Schema must support an optional weight_override field on the Paper node with a justification string.


5. Schema File Format

Deliver as schema/menopause_kg_schema.json conforming to this top-level structure:

{
  "schema_version": "0.1.0",
  "semver_note": "MAJOR.MINOR.PATCH — bump MINOR for new node/edge types, PATCH for property additions",
  "node_types": {
    "Paper": {
      "required": ["pmid", "title", "pub_year", "study_design", "evidence_weight"],
      "optional": ["journal", "sample_size", "citation_count", "weight_override", "weight_override_justification"],
      "enums": { "study_design": ["meta_analysis", "systematic_review", "rct", "prospective_cohort", "cross_sectional", "case_control", "case_report", "narrative_review", "editorial", "animal_in_vitro"] }
    },
    // ... remaining node types per §1
  },
  "edge_types": {
    "MODULATES": {
      "source": "HormonalPhase",
      "target": "Biomarker",
      "directed": true,
      "required_properties": ["direction"],
      "optional_properties": ["magnitude"],
      "example": { /* ... */ }
    },
    // ... remaining edge types per §2
  },
  "evidence_weighting": {
    "components": { /* per §4a */ },
    "formula": "weighted_sum",
    "aggregation": "noisy_or"
  },
  "hormonal_phase_enum": { /* per §3 */ }
}

Acceptance Criteria

  • schema/menopause_kg_schema.json exists, is valid JSON, and passes the JSON Schema meta-validation in tests/test_schema.py
  • All 9 node types documented with required properties, optional properties, and enum values where applicable
  • All 9 edge types documented with directionality, required/optional edge properties, and one example triple each
  • HormonalPhase values enumerated and mapped to STRAW+10 stages (including non-STRAW categories: surgical, chemo-induced, POI)
  • Evidence weighting model fully specified: component scores, composite formula, propagation rule, and override mechanism
  • Paper node carries evidence_weight as a required computed field with clear scoring rubric for each component
  • Schema is versioned (0.1.0, semver) and every extraction record will reference schema_version
  • A CHANGELOG.md stub is created in schema/ to track future schema evolution
  • At least 2 team members review and approve the schema before merge (to prevent premature lock-in)

Open Questions (resolve before or during review)

  1. Granularity of StudyPopulation: Should ethnicity be a free-text summary or a controlled vocabulary (e.g., NIH categories)? Leaning toward controlled vocab + other_text escape hatch.
  2. Temporal edges: Some relationships are time-dependent (e.g., symptom severity changes across phases). Do we model this as edge properties or as separate TemporalObservation nodes? Propose deferring to v0.2 unless the team feels strongly.
  3. Negative results: Papers that find no association are as important as positive findings. The PREDICTS / ASSOCIATED_WITH edges support no_effect / unclear values, but should we add an explicit CONTRADICTS edge type for direct conflict tracking? Propose including it in v0.1 to capture this from the start.

Metadata

Metadata

Assignees

No one assigned

    Type

    No fields configured for Task.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions