Define extraction schema: node types, edge types, and evidence weighting model

### Context & Motivation

Before MedGemma touches a single paper, we need a fixed ontology that tells it **what** to extract and a weighting model that tells downstream consumers **how much to trust it**. Without the schema, outputs will be inconsistent across papers and impossible to merge. Without evidence weighting, a case report and a 10,000-participant RCT would carry identical authority in the graph — making clinical reasoning over it unreliable.

This issue is on the critical path: every extraction, validation, and query issue depends on the schema contract defined here.

---

### 1. Node Types

Each node type must declare: `id` format, required properties, optional properties, and canonical enum values where applicable.

| Node Type | Description | Required Properties | Notes |
|---|---|---|---|
| **Paper** | The source PubMed article; anchor for all provenance and weighting | `pmid`, `title`, `pub_year`, `journal`, `study_design`, `sample_size`, `citation_count`, `evidence_weight` (computed) | First-class node, not metadata. Every extracted triple must trace back here. |
| **HormonalPhase** | Reproductive staging category | `label`, `straw_stage`, `description` | Enumerated values mapped to STRAW+10 (see §3) |
| **Symptom** | Clinical symptom reported or studied | `label`, `mesh_id` (optional), `category` (vasomotor / cognitive / mood / musculoskeletal / sleep / urogenital) | Normalize to MeSH or SNOMED-CT where possible |
| **Biomarker** | Measurable biological indicator | `label`, `unit`, `specimen_type` (serum, saliva, CSF, etc.) | e.g., FSH, E2, AMH, SHBG, CRP, BDNF |
| **BrainRegion** | Neuroanatomical structure | `label`, `hemisphere` (L/R/bilateral/NA), `atlas_id` (optional) | Prefer NeuroNames or AAL atlas identifiers |
| **CognitiveFunction** | Cognitive domain or specific function | `label`, `domain` (memory, executive, attention, processing_speed, verbal_fluency, spatial) | Map to standard neuropsych domains |
| **Intervention** | Any treatment, therapy, or exposure | `label`, `type` (pharmacological / surgical / lifestyle / supplement / psychotherapy), `route` (oral, transdermal, etc., if applicable) | Include dosage info as optional properties when extractable |
| **StudyPopulation** | Cohort characteristics | `age_range`, `mean_age`, `n`, `ethnicity`, `inclusion_criteria_summary`, `menopausal_status` | One node per distinct cohort within a paper |
| **GeneOrVariant** | Genetic factor mentioned in association | `symbol`, `rsid` (optional), `gene_id` (optional) | Captures pharmacogenomic and risk-factor genetics |

---

### 2. Edge Types

Every edge must declare: source type → target type, directionality, required edge properties, and an example triple.

| Edge Type | Direction | Required Edge Properties | Example Triple |
|---|---|---|---|
| **MODULATES** | HormonalPhase → Biomarker | `direction` (increases / decreases / fluctuates / unclear), `magnitude` (if reported) | `(Late Perimenopause) -[MODULATES {direction: "decreases"}]→ (Estradiol)` |
| **ASSOCIATED_WITH** | Symptom ↔ BrainRegion | `correlation_direction` (+/-/unclear), `imaging_modality` (fMRI, PET, sMRI, etc.) | `(Brain Fog) -[ASSOCIATED_WITH {correlation_direction: "-", imaging_modality: "fMRI"}]→ (Prefrontal Cortex)` |
| **PRESENTS_WITH** | HormonalPhase → Symptom | `prevalence` (if reported), `severity_scale` (if reported) | `(Early Postmenopause) -[PRESENTS_WITH {prevalence: "60-80%"}]→ (Hot Flashes)` |
| **PREDICTS** | Biomarker → CognitiveFunction | `association_direction` (+/-), `p_value` (optional), `effect_size` (optional) | `(E2) -[PREDICTS {association_direction: "+", p_value: 0.003}]→ (Verbal Memory)` |
| **TESTED_IN** | Intervention → StudyPopulation | `outcome_measure`, `result_summary` (improved / no_effect / worsened) | `(Transdermal E2 HRT) -[TESTED_IN {result_summary: "improved"}]→ (Peri Women 45-55)` |
| **AFFECTS** | Intervention → Symptom | `effect` (alleviates / worsens / no_effect), `effect_size` (optional) | `(SSRI) -[AFFECTS {effect: "alleviates"}]→ (Hot Flashes)` |
| **INFLUENCES** | Biomarker → BrainRegion | `mechanism` (neuroprotective / neuroinflammatory / neurotrophic / unclear) | `(BDNF) -[INFLUENCES {mechanism: "neurotrophic"}]→ (Hippocampus)` |
| **INTERACTS_WITH** | GeneOrVariant → Intervention | `interaction_type` (efficacy_modifier / risk_modifier / metabolic) | `(CYP2D6 poor metabolizer) -[INTERACTS_WITH {interaction_type: "efficacy_modifier"}]→ (Tamoxifen)` |
| **EXTRACTED_FROM** | *any node or edge* → Paper | `extraction_method` (manual / MedGemma_v*), `confidence_score` [0–1], `text_span` (source sentence) | `(triple: E2 PREDICTS Verbal Memory) -[EXTRACTED_FROM {confidence_score: 0.92}]→ (PMID:33456789)` |

> **Note on `REPORTED_IN` → renamed to `EXTRACTED_FROM`** to distinguish raw paper content from extraction provenance. Every node and every edge in the graph must carry at least one `EXTRACTED_FROM` link.

---

### 3. HormonalPhase Enumeration (STRAW+10 Mapping)

| Label (Graph Enum) | STRAW+10 Stage | FSH Characteristic | Description |
|---|---|---|---|
| `REPRODUCTIVE_LATE` | -1 | Variable | Subtle fertility decline, regular cycles |
| `PERIMENOPAUSE_EARLY` | -2 | Elevated | Cycle length variability ≥7 days |
| `PERIMENOPAUSE_LATE` | -1* | >25 IU/L | Amenorrhea ≥60 days, interval of skipped cycles |
| `MENOPAUSE` | 0 | — | Final menstrual period (retrospective, 12 mo amenorrhea) |
| `POSTMENOPAUSE_EARLY` | +1a / +1b / +1c | Stabilizing high | 0–6 years post-FMP |
| `POSTMENOPAUSE_LATE` | +2 | Stable high | >6 years post-FMP |
| `SURGICAL_MENOPAUSE` | N/A (map to +1a equivalent) | Variable | Bilateral oophorectomy ± hysterectomy |
| `CHEMOTHERAPY_INDUCED` | N/A | Variable | Iatrogenic ovarian failure |
| `PREMATURE_OVARIAN_INSUFFICIENCY` | N/A | Elevated | Spontaneous menopause <40 years |

Schema must store both `label` and `straw_stage` so queries can work at either level of granularity.

---

### 4. Paper Evidence Weighting Model

Each `Paper` node must carry a computed **`evidence_weight`** (float, 0.0–1.0) derived from the following components:

#### 4a. Component Scores

| Component | Weight | Scoring Rule |
|---|---|---|
| **Study Design** | 0.35 | `meta_analysis/systematic_review`: 1.0 · `rct`: 0.9 · `prospective_cohort`: 0.7 · `cross_sectional`: 0.5 · `case_control`: 0.4 · `case_report/case_series`: 0.2 · `narrative_review/editorial`: 0.15 · `animal/in_vitro`: 0.1 |
| **Sample Size** | 0.20 | log-scaled: `min(1.0, log10(n) / log10(10000))` — i.e., n=10→0.25, n=100→0.5, n=1000→0.75, n≥10000→1.0. For meta-analyses, use total pooled N. |
| **Recency** | 0.10 | `max(0, 1 - (current_year - pub_year) / 30)` — papers >30 yrs old score 0; landmark papers can be manually overridden. |
| **Journal Impact Proxy** | 0.10 | Normalized citation rate: `min(1.0, citations_per_year / field_median_cpy)`. Bootstrap field median from initial corpus. |
| **Replication Signal** | 0.15 | Number of other papers in the graph whose extracted triples corroborate ≥1 triple from this paper. Normalized 0–1. (Computed post-ingestion, default 0 on first pass.) |
| **MedGemma Extraction Confidence** | 0.10 | Mean `confidence_score` across all `EXTRACTED_FROM` edges originating from this paper. |

#### 4b. Composite Formula

```
evidence_weight = Σ (component_weight_i × component_score_i)
```

#### 4c. How Weight Propagates

- Every `EXTRACTED_FROM` edge carries the source paper's `evidence_weight`.
- When a triple (e.g., `E2 -[PREDICTS]→ Verbal Memory`) is extracted from **multiple** papers, the triple's **aggregate confidence** is:
  ```
  triple_confidence = 1 - Π (1 - evidence_weight_i × confidence_score_i)
  ```
  (noisy-OR: more independent high-quality sources → higher aggregate confidence, with diminishing returns.)
- Query APIs and visualization layers should expose both per-paper weight and aggregate triple confidence.

#### 4d. Override Mechanism

Some landmark papers (e.g., SWAN cohort publications, Kronos Early Estrogen Prevention Study) may deserve manual weight overrides. Schema must support an optional `weight_override` field on the `Paper` node with a `justification` string.

---

### 5. Schema File Format

Deliver as **`schema/menopause_kg_schema.json`** conforming to this top-level structure:

```jsonc
{
  "schema_version": "0.1.0",
  "semver_note": "MAJOR.MINOR.PATCH — bump MINOR for new node/edge types, PATCH for property additions",
  "node_types": {
    "Paper": {
      "required": ["pmid", "title", "pub_year", "study_design", "evidence_weight"],
      "optional": ["journal", "sample_size", "citation_count", "weight_override", "weight_override_justification"],
      "enums": { "study_design": ["meta_analysis", "systematic_review", "rct", "prospective_cohort", "cross_sectional", "case_control", "case_report", "narrative_review", "editorial", "animal_in_vitro"] }
    },
    // ... remaining node types per §1
  },
  "edge_types": {
    "MODULATES": {
      "source": "HormonalPhase",
      "target": "Biomarker",
      "directed": true,
      "required_properties": ["direction"],
      "optional_properties": ["magnitude"],
      "example": { /* ... */ }
    },
    // ... remaining edge types per §2
  },
  "evidence_weighting": {
    "components": { /* per §4a */ },
    "formula": "weighted_sum",
    "aggregation": "noisy_or"
  },
  "hormonal_phase_enum": { /* per §3 */ }
}
```

---

### Acceptance Criteria

- [ ] `schema/menopause_kg_schema.json` exists, is valid JSON, and passes the JSON Schema meta-validation in `tests/test_schema.py`
- [ ] All 9 node types documented with required properties, optional properties, and enum values where applicable
- [ ] All 9 edge types documented with directionality, required/optional edge properties, and one example triple each
- [ ] `HormonalPhase` values enumerated and mapped to STRAW+10 stages (including non-STRAW categories: surgical, chemo-induced, POI)
- [ ] Evidence weighting model fully specified: component scores, composite formula, propagation rule, and override mechanism
- [ ] `Paper` node carries `evidence_weight` as a required computed field with clear scoring rubric for each component
- [ ] Schema is versioned (`0.1.0`, semver) and every extraction record will reference `schema_version`
- [ ] A `CHANGELOG.md` stub is created in `schema/` to track future schema evolution
- [ ] At least 2 team members review and approve the schema before merge (to prevent premature lock-in)

---

### Open Questions (resolve before or during review)

1. **Granularity of `StudyPopulation`**: Should ethnicity be a free-text summary or a controlled vocabulary (e.g., NIH categories)? Leaning toward controlled vocab + `other_text` escape hatch.
2. **Temporal edges**: Some relationships are time-dependent (e.g., symptom severity changes across phases). Do we model this as edge properties or as separate `TemporalObservation` nodes? Propose deferring to v0.2 unless the team feels strongly.
3. **Negative results**: Papers that find *no* association are as important as positive findings. The `PREDICTS` / `ASSOCIATED_WITH` edges support `no_effect` / `unclear` values, but should we add an explicit `CONTRADICTS` edge type for direct conflict tracking? Propose including it in v0.1 to capture this from the start.

Edge Type	Direction	Required Edge Properties	Example Triple
MODULATES	HormonalPhase → Biomarker	`direction` (increases / decreases / fluctuates / unclear), `magnitude` (if reported)	`(Late Perimenopause) -[MODULATES {direction: "decreases"}]→ (Estradiol)`
ASSOCIATED_WITH	Symptom ↔ BrainRegion	`correlation_direction` (+/-/unclear), `imaging_modality` (fMRI, PET, sMRI, etc.)	`(Brain Fog) -[ASSOCIATED_WITH {correlation_direction: "-", imaging_modality: "fMRI"}]→ (Prefrontal Cortex)`
PRESENTS_WITH	HormonalPhase → Symptom	`prevalence` (if reported), `severity_scale` (if reported)	`(Early Postmenopause) -[PRESENTS_WITH {prevalence: "60-80%"}]→ (Hot Flashes)`
PREDICTS	Biomarker → CognitiveFunction	`association_direction` (+/-), `p_value` (optional), `effect_size` (optional)	`(E2) -[PREDICTS {association_direction: "+", p_value: 0.003}]→ (Verbal Memory)`
TESTED_IN	Intervention → StudyPopulation	`outcome_measure`, `result_summary` (improved / no_effect / worsened)	`(Transdermal E2 HRT) -[TESTED_IN {result_summary: "improved"}]→ (Peri Women 45-55)`
AFFECTS	Intervention → Symptom	`effect` (alleviates / worsens / no_effect), `effect_size` (optional)	`(SSRI) -[AFFECTS {effect: "alleviates"}]→ (Hot Flashes)`
INFLUENCES	Biomarker → BrainRegion	`mechanism` (neuroprotective / neuroinflammatory / neurotrophic / unclear)	`(BDNF) -[INFLUENCES {mechanism: "neurotrophic"}]→ (Hippocampus)`
INTERACTS_WITH	GeneOrVariant → Intervention	`interaction_type` (efficacy_modifier / risk_modifier / metabolic)	`(CYP2D6 poor metabolizer) -[INTERACTS_WITH {interaction_type: "efficacy_modifier"}]→ (Tamoxifen)`
EXTRACTED_FROM	any node or edge → Paper	`extraction_method` (manual / MedGemma_v*), `confidence_score` [0–1], `text_span` (source sentence)	`(triple: E2 PREDICTS Verbal Memory) -[EXTRACTED_FROM {confidence_score: 0.92}]→ (PMID:33456789)`

Component	Weight	Scoring Rule
Study Design	0.35	`meta_analysis/systematic_review`: 1.0 · `rct`: 0.9 · `prospective_cohort`: 0.7 · `cross_sectional`: 0.5 · `case_control`: 0.4 · `case_report/case_series`: 0.2 · `narrative_review/editorial`: 0.15 · `animal/in_vitro`: 0.1
Sample Size	0.20	log-scaled: `min(1.0, log10(n) / log10(10000))` — i.e., n=10→0.25, n=100→0.5, n=1000→0.75, n≥10000→1.0. For meta-analyses, use total pooled N.
Recency	0.10	`max(0, 1 - (current_year - pub_year) / 30)` — papers >30 yrs old score 0; landmark papers can be manually overridden.
Journal Impact Proxy	0.10	Normalized citation rate: `min(1.0, citations_per_year / field_median_cpy)`. Bootstrap field median from initial corpus.
Replication Signal	0.15	Number of other papers in the graph whose extracted triples corroborate ≥1 triple from this paper. Normalized 0–1. (Computed post-ingestion, default 0 on first pass.)
MedGemma Extraction Confidence	0.10	Mean `confidence_score` across all `EXTRACTED_FROM` edges originating from this paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define extraction schema: node types, edge types, and evidence weighting model #1

Context & Motivation

1. Node Types

2. Edge Types

3. HormonalPhase Enumeration (STRAW+10 Mapping)

4. Paper Evidence Weighting Model

4a. Component Scores

4b. Composite Formula

4c. How Weight Propagates

4d. Override Mechanism

5. Schema File Format

Acceptance Criteria

Open Questions (resolve before or during review)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Node Type	Description	Required Properties	Notes
Paper	The source PubMed article; anchor for all provenance and weighting	`pmid`, `title`, `pub_year`, `journal`, `study_design`, `sample_size`, `citation_count`, `evidence_weight` (computed)	First-class node, not metadata. Every extracted triple must trace back here.
HormonalPhase	Reproductive staging category	`label`, `straw_stage`, `description`	Enumerated values mapped to STRAW+10 (see §3)
Symptom	Clinical symptom reported or studied	`label`, `mesh_id` (optional), `category` (vasomotor / cognitive / mood / musculoskeletal / sleep / urogenital)	Normalize to MeSH or SNOMED-CT where possible
Biomarker	Measurable biological indicator	`label`, `unit`, `specimen_type` (serum, saliva, CSF, etc.)	e.g., FSH, E2, AMH, SHBG, CRP, BDNF
BrainRegion	Neuroanatomical structure	`label`, `hemisphere` (L/R/bilateral/NA), `atlas_id` (optional)	Prefer NeuroNames or AAL atlas identifiers
CognitiveFunction	Cognitive domain or specific function	`label`, `domain` (memory, executive, attention, processing_speed, verbal_fluency, spatial)	Map to standard neuropsych domains
Intervention	Any treatment, therapy, or exposure	`label`, `type` (pharmacological / surgical / lifestyle / supplement / psychotherapy), `route` (oral, transdermal, etc., if applicable)	Include dosage info as optional properties when extractable
StudyPopulation	Cohort characteristics	`age_range`, `mean_age`, `n`, `ethnicity`, `inclusion_criteria_summary`, `menopausal_status`	One node per distinct cohort within a paper
GeneOrVariant	Genetic factor mentioned in association	`symbol`, `rsid` (optional), `gene_id` (optional)	Captures pharmacogenomic and risk-factor genetics

Label (Graph Enum)	STRAW+10 Stage	FSH Characteristic	Description
`REPRODUCTIVE_LATE`	-1	Variable	Subtle fertility decline, regular cycles
`PERIMENOPAUSE_EARLY`	-2	Elevated	Cycle length variability ≥7 days
`PERIMENOPAUSE_LATE`	-1*	>25 IU/L	Amenorrhea ≥60 days, interval of skipped cycles
`MENOPAUSE`	0	—	Final menstrual period (retrospective, 12 mo amenorrhea)
`POSTMENOPAUSE_EARLY`	+1a / +1b / +1c	Stabilizing high	0–6 years post-FMP
`POSTMENOPAUSE_LATE`	+2	Stable high	>6 years post-FMP
`SURGICAL_MENOPAUSE`	N/A (map to +1a equivalent)	Variable	Bilateral oophorectomy ± hysterectomy
`CHEMOTHERAPY_INDUCED`	N/A	Variable	Iatrogenic ovarian failure
`PREMATURE_OVARIAN_INSUFFICIENCY`	N/A	Elevated	Spontaneous menopause <40 years

Define extraction schema: node types, edge types, and evidence weighting model #1

Description

Context & Motivation

1. Node Types

2. Edge Types

3. HormonalPhase Enumeration (STRAW+10 Mapping)

4. Paper Evidence Weighting Model

4a. Component Scores

4b. Composite Formula

4c. How Weight Propagates

4d. Override Mechanism

5. Schema File Format

Acceptance Criteria

Open Questions (resolve before or during review)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions