Skip to content

Latest commit

 

History

History
353 lines (246 loc) · 20.2 KB

File metadata and controls

353 lines (246 loc) · 20.2 KB

Sub-Graph Matching Search: Formal Specification

Query Graphs, Matching Criteria, and Ultrametric Ranking on Nested Semantic Trees

Version: 0.3 Date: 2026-05-22 Status: Formal specification Depends on: 0.1.md (Literature Review), 0.2.md (Formal Definitions) Grounded in: Few Become One $\S$V.C (DOI: 10.5281/zenodo.20328374), Q-PNA $\S$2-3 (DOI: 10.5281/zenodo.20287742)


1. Introduction

This document provides the formal specification for the sub-graph matching search architecture — the computational core of the Nested Semantic Graph (NSG) project. Where 0.2.md established the formal definitions (nodes, edges, ultrametric distance, tree-alignment lattice), this document specifies how those definitions are operationalized for information retrieval.

The search architecture described here is the engineering complement to:

  • Few Become One $\S$V.C — the conceptual proposal for sub-graph matching search
  • Q-PNA $\S$2-3 — the neural architecture providing the encoding/representation layer

All definitions in 0.2.md are prerequisite reading.


2. The Search Problem

2.1 Formal Statement

Definition 1 (Sub-Graph Search Problem). Given:

  • A query graph $G_q$, expressed as a nested semantic tree
  • A document corpus $\mathcal{D} = {G_d^{(1)}, G_d^{(2)}, \ldots, G_d^{(N)}}$, where each $G_d^{(i)}$ is a nested semantic tree representing the meaning of document $i$
  • A matching criterion $\mathcal{M}$ specifying how $G_q$ relates to $G_d^{(i)}$
  • A ranking function $\mathcal{R}: \mathcal{D} \to \mathbb{R}^N$ assigning a relevance score to each document

Find: The ordered list $\mathcal{R}(\mathcal{D}, G_q) = (i_1, i_2, \ldots, i_k)$ of document indices sorted by decreasing relevance to the query, where $k \leq N$ is the number of matches satisfying $\mathcal{M}$.

2.2 Scope and Assumptions

Assumption Justification
All graphs are nested semantic trees (NSTs) Per 0.2.md Definition 10 — the tree constraint is fundamental. Non-tree semantic relationships are represented via secondary indexing.
Matching is subtree-based The query $G_q$ is typically smaller than document graphs $G_d^{(i)}$ — we match the query as a subtree of the document
Node labels are semantic categories from $\mathcal{C}$ Per 0.2.md Definition 3 — categories are language-neutral
Edge roles are preserved in matching Agent/patient distinctions matter — structural equivalence requires edge-label preservation
The corpus is offline-indexed Document graphs are pre-computed and stored; queries are processed online

3. Matching Criteria: Three Tiers

3.1 Type I: Exact Subtree Isomorphism

Definition 2 (Exact Match). A query graph $G_q = (V_q, E_q, r_q)$ exactly matches a document graph $G_d = (V_d, E_d, r_d)$ if there exists an injective mapping $\phi: V_q \to V_d$ such that:

  1. Label preservation: $\lambda(\phi(v)) = \lambda(v)$ for all $v \in V_q$
  2. Edge preservation: For every edge $a \to b$ in $E_q$, there is an edge $\phi(a) \to \phi(b)$ in $E_d$
  3. Root preservation: $\phi(r_q)$ is in the subtree of $r_d$ (i.e., the query root maps to a node that is part of the document's event structure)
  4. Structural equivalence: The induced subgraph on $\phi(V_q)$ is isomorphic to $G_q$

Notation: We write $G_q \sqsubseteq G_d$ when an exact match exists.

Interpretation: The proposition encoded by the query is fully contained within the proposition encoded by the document. All conceptual primitives are present with the correct scope relationships.

Example. Query: "dog bites man" → graph with BITE(root), dog(agent), man(patient). Document: "The large brown dog viciously bit the old man in the park yesterday" → graph with BITE(root), dog(agent)+modifiers, man(patient)+modifiers, park(locative), yesterday(tense). The query exactly matches the document: BITE→BITE, dog→dog, man→man, all edges preserved.

3.2 Type II: Partial Match with Distance

Definition 3 (Partial Match). A query graph $G_q$ partially matches a document graph $G_d$ if there exists an injective mapping $\phi: V_q' \to V_d$ for some $V_q' \subseteq V_q$ satisfying Type I criteria. The match quality is:

$$Q(G_q, G_d) = \frac{|\phi(V_q')|}{|V_q|}$$

where $|\cdot|$ denotes the number of nodes.

Definition 4 (Match Distance). For a partial match with mapping $\phi$, the match distance is:

$$D_{\text{match}}(G_q, G_d) = \min_{\phi \in \Phi} D_T(\text{lca}_{\mathcal{L}}(\phi(G_q'), \text{closest_match}_{\mathcal{L}}))$$

where $\Phi$ is the set of all valid partial match mappings and $\mathcal{L}$ is the tree-alignment lattice (0.2.md Definition 16).

Interpretation: Partial matches are ranked by two criteria: (1) what fraction of the query was matched (coverage), and (2) how much the matching subtree structurally diverges from the query (distance). A match covering 80% of query nodes with distance 0 ranks above a match covering 100% of query nodes with distance 5.

3.3 Type III: Approximate Match via Tree Edit Distance

Definition 5 (Approximate Match). When no Type I or Type II match exists, an approximate match is found by computing the minimum tree edit distance between $G_q$ and subtrees of $G_d$, where allowed edit operations are:

  1. Relabel: Change a node's label to match the query — cost $c_{\text{relabel}} = 1$
  2. Delete: Remove a node from the query — cost $c_{\text{delete}} = 2$
  3. Insert: Add a node to the document subtree to match the query — cost $c_{\text{insert}} = 2$
  4. Re-parent: Move a node to a different parent — cost $c_{\text{reparent}} = 3$

Definition 6 (Edit Distance Score). The edit distance score is:

$$S_{\text{edit}}(G_q, G_d) = \min_{\text{edit sequence}} \sum_{e \in \text{edits}} c_{\text{type}(e)}$$

where lower scores indicate better matches (0 = exact match).

Interpretation: Edit distance provides a fallback when structural differences are too large for partial matching. It allows the system to retrieve documents that are "close in meaning" even when the tree structures differ — e.g., a query about "dog chasing cat" might match a document about "cat fleeing dog" through relabel operations.


4. Ranking Algorithm

4.1 The Ranking Function

Definition 7 (Ranking Function). For a query $G_q$ and document corpus $\mathcal{D}$, the ranking function produces:

$$\mathcal{R}(G_q, \mathcal{D}) = \text{sort_descending}\left( \left{ (i, s_i) : s_i = \text{score}(G_q, G_d^{(i)}) \right} \right)$$

where the score function is:

$$\text{score}(G_q, G_d) = \begin{cases} 1.0 - \alpha \cdot D_{\text{match}}(G_q, G_d) & \text{if Type I/II match} \ 1.0 - \beta \cdot S_{\text{edit}}(G_q, G_d) & \text{if Type III match} \ 0 & \text{if no match} \end{cases}$$

with normalization constants $\alpha, \beta$ chosen so that scores fall in $[0, 1]$.

Definition 8 (Score Normalization).

$$\alpha = \frac{1}{H_{\text{max}}}, \quad \beta = \frac{1}{c_{\text{reparent}} \cdot |V_q|}$$

where $H_{\text{max}}$ is the maximum tree height in the corpus.

4.2 Ultrametric Properties of the Ranking

Theorem 1 (Nested Result Clusters). The ranking produced by $\mathcal{R}$ forms nested, ultrametric clusters of results. For any three documents $d_1, d_2, d_3$ with scores $s_1 \leq s_2 \leq s_3$, the score gaps satisfy:

$$\max(s_3 - s_2, s_2 - s_1) \geq s_3 - s_1$$

In particular, there exist natural "cut points" where the score gap jumps, forming discrete result clusters rather than a continuous ranking spectrum.

Proof. The ultrametric property of the tree-alignment lattice $\mathcal{L}$ (0.2.md Theorem 1) guarantees that $D_{\text{match}}$ satisfies the strong triangle condition. Since scores are a linear transformation of distances, the score gaps inherit the ultrametric clustering property. $\square$

Corollary 1 (Natural Cut Points). The search system can identify "natural result page boundaries" by detecting where $\Delta s_i = s_i - s_{i+1}$ exceeds a threshold $\tau$. Results above the cut are in one ultrametric ball (high relevance); results below are in a different ball (low relevance).

4.3 Algorithm: Rank-and-Cluster

Algorithm RANK-AND-CLUSTER(G_q, D, k):
  Input:  Query graph G_q, document corpus D, result limit k
  Output: Ranked list of k document indices with cluster boundaries

  matches = []
  for each G_d in D:
    m = FIND-BEST-MATCH(G_q, G_d)      # Type I, II, or III
    if m is not None:
      matches.append((d.index, m.score, m.match_type, m.distance))

  sort matches by score descending

  # Detect cluster boundaries (natural cut points)
  clusters = []
  for i from 0 to min(k, len(matches))-1:
    if i > 0 and (matches[i-1].score - matches[i].score) > tau:
      mark_cluster_boundary(i)

  return matches[0:k], clusters

4.4 Time Complexity

Operation Complexity Notes
Subtree isomorphism (Type I) $\mathcal{O}( V_q
Partial match (Type II) $\mathcal{O}(2^{ V_q
Approximate match (Type III) $\mathcal{O}( V_q
Ranking (all documents) $\mathcal{O}(N \cdot M)$ $M$ = per-document match cost above
Offline index construction $\mathcal{O}(\sum_{d} E_d)$ $E_d$ = effort to parse document $d$ into NST

5. Connection to Q-PNA Encoding

5.1 The Full Pipeline

The sub-graph matching search ($0.3, this document) is the retrieval layer in a two-layer architecture with Q-PNA providing the encoding layer:

┌──────────────────────────────────────────────────────────────┐
│  QUERY INTERFACE                                              │
│  Surface text (any language) or visual query builder          │
└──────────────────────────┬───────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────┐
│  Q-PNA ENCODING LAYER (§3.2, DOI: 10.5281/zenodo.20287742)   │
│  ┌─────────────────────────────────────────────────────────┐ │
│  │ Text → morpheme segmentation → semantic prime assignment │ │
│  │ → prime product P(t) = Π p_i^{f_i} → valuation vector   │ │
│  │ → Bruhat-Tits tree T_p leaf activation                  │ │
│  └─────────────────────────────────────────────────────────┘ │
│  Output: Nested Semantic Tree G (nodes, edges, heights, LCA)  │
└──────────────────────────┬───────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────┐
│  NSG INDEXING LAYER (this architecture)                       │
│  ┌─────────────────────────────────────────────────────────┐ │
│  │ G → NST structure → indexed in graph database           │ │
│  │  • Nodes: category, label, height                        │ │
│  │  • Edges: (parent, child, role) with scope semantics     │ │
│  │  • LCA precomputed for O(1) distance queries             │ │
│  └─────────────────────────────────────────────────────────┘ │
└──────────────────────────┬───────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────┐
│  NSG MATCHING LAYER (this architecture)                       │
│  ┌─────────────────────────────────────────────────────────┐ │
│  │ G_q (query) → subtree isomorphism search over corpus     │ │
│  │ → FIND-BEST-MATCH → score computation → RANK-AND-CLUSTER │ │
│  └─────────────────────────────────────────────────────────┘ │
│  Output: Ranked document list with cluster boundaries         │
└──────────────────────────────────────────────────────────────┘

5.2 Interface Contract

Layer Input Output
Q-PNA Encoder Raw text (any language) Nested Semantic Tree $G$ with node labels, heights, LCA preprocessed
NSG Indexer $G$ from Q-PNA Indexed graph database entry
NSG Matcher $G_q$ (query), indexed corpus Ranked results

Contract invariant: The Q-PNA encoder guarantees that $G$ is a valid NST (rooted tree, height function monotone, LCA computable). The NSG matcher guarantees that ranking respects the ultrametric inequality.

5.3 Cross-Linguistic Property

The Q-PNA encoder is language-neutral: the same proposition encoded in English, Mohawk, or Turkish produces isomorphic NSTs (0.2.md Theorem 2). Consequently, the NSG matcher is also language-neutral — a query in any language matches against a corpus in any language, because both are represented in the same invariant graph space.


6. Walkthrough: A Complete Query

6.1 Query

User input (English): "dog bit man" User input (Mohawk equivalent): Single verb form encoding the same proposition

6.2 Step 1 — Q-PNA Encoding

The surface text (either language) is parsed into a nested semantic tree:

                 BITE (h=3.0, ACTION, root)
                 /           \
          dog (ENTITY)    man (ENTITY)
          (h=0.0, agent)  (h=0.0, patient)

Token encoding:

  • dog → semantic prime 2, $P = 2^1 = 2$, $\vec{v} = (1)$
  • man → semantic prime 2, $P = 2^2 = 4$, $\vec{v} = (2)$
  • BITE → semantic prime 5, $P = 5^3 = 125$, $\vec{v} = (3)$

6.3 Step 2 — NSG Indexing

The graph is indexed with:

  • Nodes: bite (cat=ACTION, h=3.0), dog (cat=ENTITY, h=0.0, role=agent), man (cat=ENTITY, h=0.0, role=patient)
  • Edges: dog → bite [agent], man → bite [patient]
  • LCA table: $\text{lca}(dog, man) = bite$, $\text{lca}(dog, bite) = bite$, etc.

6.4 Step 3 — Matching

For each document $G_d^{(i)}$ in the corpus, FIND-BEST-MATCH attempts:

  1. Type I: Can we find an exact subtree isomorphism mapping {bite, dog, man} to nodes in $G_d^{(i)}$?

    • If the document contains a BITE event with dog-agent and man-patient → exact match, score = 1.0, distance = 0
    • If the document contains a BITE event but with dog-agent and cat-patient → partial match (man unmatched), score ≈ 0.67
    • If the document contains a CHASE event → approximate match via tree edit (relabel CHASE→BITE, then check arguments)
  2. Type II (if Type I fails): Compute maximum partial match:

    • Matched: {bite, dog} (2 of 3 nodes) → coverage = 2/3, distance = 0 → score = 0.67
  3. Type III (if Type II fails): Compute minimum edit distance:

    • If the document has {bite, cat, man} → relabel cat→dog (cost 1) → match achieved → score = $1 - \beta \cdot 1$

6.5 Step 4 — Ranking

Results sorted by score. A result cluster might look like:

Rank Doc ID Score Match Type Distance
1 42 1.00 Type I 0.0
2 17 1.00 Type I 0.0
3 88 0.95 Type I 0.5
---------- CUT --- --- ---
4 5 0.67 Type II 0.0
5 31 0.67 Type II 0.0
---------- CUT --- --- ---
6 12 0.33 Type III
7 73 0.33 Type III

Results 1-3 form one ultrametric cluster (exact or near-exact semantic matches). Results 4-5 form a second cluster (partial matches, missing some concepts). Results 6-7 form a third cluster (approximate matches via edit distance).


7. Implementation Considerations

7.1 Index Structure

The graph database index should support:

  • Node lookup by label and category$\mathcal{O}(1)$ hash-based
  • Edge traversal — parent/child navigation, $\mathcal{O}(b)$ per level
  • Precomputed LCA — binary lifting table, $\mathcal{O}(\log n)$ query
  • Height-indexed nodes — range queries for "all nodes at height $\leq h$"

7.2 Query Optimization

Optimization Description
Node pruning If a document has no ACTION node, skip — no event can match
Category filtering Build inverted index of $\text{category} \to \text{node_ids}$ for fast candidate generation
Height bounding Only examine document nodes with height $\geq$ query root height
Edge role indexing Pre-index agent/patient/modifier edges for fast structural verification
LCA caching All-pairs LCA for each document precomputed at indexing time

7.3 Scalability

Corpus Size Strategy
$\leq 10^4$ documents In-memory graph database with exhaustive matching
$10^4 - 10^6$ documents Sharded graph database with candidate filtering before matching
$\geq 10^6$ documents Distributed index with approximate nearest-neighbor in ultrametric space + re-ranking

8. Limitations and Future Work

Limitation Mitigation
Subtree isomorphism is NP-complete in general Semantic trees have bounded degree and small depth; practical instances are tractable
Partial matching requires enumerating subsets Greedy approximation with bounded backtracking
Tree edit distance is cubic in node count Bounded by query size, which is small ($\leq 20$ nodes)
Q-PNA integration requires a working encoder Use the ultrametric-ai-poc codebase as the initial encoder
No handling of negation or modality Future work: extend node categories to include ${\text{NEG}, \text{MODAL}}$
Cross-linguistic equivalence is assumed Verified for simple examples; needs large-scale validation
The tree-alignment lattice is not implemented Theoretical definition; practical implementation uses simpler distance measures
Real-time query latency not benchmarked Prototype stage; benchmarking planned for S4

References

  1. 0.1.md — Internal Literature Review: The Ultrametric-Language-AI Research Program
  2. 0.2.md — Formal Definitions: The Nested Semantic Graph
  3. Few Become One $\S$V.C (DOI: 10.5281/zenodo.20328374)
  4. Q-PNA Research Specification v2.0 $\S$2-3 (DOI: 10.5281/zenodo.20287742)
  5. Tree Distance Cophenetic $\S$2 (DOI: 10.5281/zenodo.20213043)
  6. ultrametric-ai-poc — Working proof-of-concept (github.com/rwnq8/ultrametric-ai-poc)

Sub-Graph Matching Search Specification v0.3 — Operationalizes the formalism from 0.2.md into a concrete search architecture. Python prototype in S4 (0.4.py).