Sub-Graph Matching Search: Formal Specification

Query Graphs, Matching Criteria, and Ultrametric Ranking on Nested Semantic Trees

Version: 0.3 Date: 2026-05-22 Status: Formal specification Depends on: 0.1.md (Literature Review), 0.2.md (Formal Definitions) Grounded in: Few Become One $\S$V.C (DOI: 10.5281/zenodo.20328374), Q-PNA $\S$2-3 (DOI: 10.5281/zenodo.20287742)

1. Introduction

This document provides the formal specification for the sub-graph matching search architecture — the computational core of the Nested Semantic Graph (NSG) project. Where 0.2.md established the formal definitions (nodes, edges, ultrametric distance, tree-alignment lattice), this document specifies how those definitions are operationalized for information retrieval.

The search architecture described here is the engineering complement to:

Few Become One $\S$V.C — the conceptual proposal for sub-graph matching search
Q-PNA $\S$2-3 — the neural architecture providing the encoding/representation layer

All definitions in 0.2.md are prerequisite reading.

2. The Search Problem

2.1 Formal Statement

Definition 1 (Sub-Graph Search Problem). Given:

A query graph $G_q$, expressed as a nested semantic tree
A document corpus $\mathcal{D} = {G_d^{(1)}, G_d^{(2)}, \ldots, G_d^{(N)}}$, where each $G_d^{(i)}$ is a nested semantic tree representing the meaning of document $i$
A matching criterion $\mathcal{M}$ specifying how $G_q$ relates to $G_d^{(i)}$
A ranking function $\mathcal{R}: \mathcal{D} \to \mathbb{R}^N$ assigning a relevance score to each document

Find: The ordered list $\mathcal{R}(\mathcal{D}, G_q) = (i_1, i_2, \ldots, i_k)$ of document indices sorted by decreasing relevance to the query, where $k \leq N$ is the number of matches satisfying $\mathcal{M}$.

2.2 Scope and Assumptions

Assumption	Justification
All graphs are nested semantic trees (NSTs)	Per `0.2.md` Definition 10 — the tree constraint is fundamental. Non-tree semantic relationships are represented via secondary indexing.
Matching is subtree-based	The query $G_q$ is typically smaller than document graphs $G_d^{(i)}$ — we match the query as a subtree of the document
Node labels are semantic categories from $\mathcal{C}$	Per `0.2.md` Definition 3 — categories are language-neutral
Edge roles are preserved in matching	Agent/patient distinctions matter — structural equivalence requires edge-label preservation
The corpus is offline-indexed	Document graphs are pre-computed and stored; queries are processed online

3. Matching Criteria: Three Tiers

3.1 Type I: Exact Subtree Isomorphism

Definition 2 (Exact Match). A query graph $G_q = (V_q, E_q, r_q)$ exactly matches a document graph $G_d = (V_d, E_d, r_d)$ if there exists an injective mapping $\phi: V_q \to V_d$ such that:

Label preservation: $\lambda(\phi(v)) = \lambda(v)$ for all $v \in V_q$
Edge preservation: For every edge $a \to b$ in $E_q$, there is an edge $\phi(a) \to \phi(b)$ in $E_d$
Root preservation: $\phi(r_q)$ is in the subtree of $r_d$ (i.e., the query root maps to a node that is part of the document's event structure)
Structural equivalence: The induced subgraph on $\phi(V_q)$ is isomorphic to $G_q$

Notation: We write $G_q \sqsubseteq G_d$ when an exact match exists.

Interpretation: The proposition encoded by the query is fully contained within the proposition encoded by the document. All conceptual primitives are present with the correct scope relationships.

Example. Query: "dog bites man" → graph with BITE(root), dog(agent), man(patient). Document: "The large brown dog viciously bit the old man in the park yesterday" → graph with BITE(root), dog(agent)+modifiers, man(patient)+modifiers, park(locative), yesterday(tense). The query exactly matches the document: BITE→BITE, dog→dog, man→man, all edges preserved.

3.2 Type II: Partial Match with Distance

Definition 3 (Partial Match). A query graph $G_q$ partially matches a document graph $G_d$ if there exists an injective mapping $\phi: V_q' \to V_d$ for some $V_q' \subseteq V_q$ satisfying Type I criteria. The match quality is:

$$Q(G_q, G_d) = \frac{|\phi(V_q')|}{|V_q|}$$

where $|\cdot|$ denotes the number of nodes.

Definition 4 (Match Distance). For a partial match with mapping $\phi$, the match distance is:

$$D_{\text{match}}(G_q, G_d) = \min_{\phi \in \Phi} D_T(\text{lca}_{\mathcal{L}}(\phi(G_q'), \text{closest_match}_{\mathcal{L}}))$$

where $\Phi$ is the set of all valid partial match mappings and $\mathcal{L}$ is the tree-alignment lattice (0.2.md Definition 16).

Interpretation: Partial matches are ranked by two criteria: (1) what fraction of the query was matched (coverage), and (2) how much the matching subtree structurally diverges from the query (distance). A match covering 80% of query nodes with distance 0 ranks above a match covering 100% of query nodes with distance 5.

3.3 Type III: Approximate Match via Tree Edit Distance

Definition 5 (Approximate Match). When no Type I or Type II match exists, an approximate match is found by computing the minimum tree edit distance between $G_q$ and subtrees of $G_d$, where allowed edit operations are:

Relabel: Change a node's label to match the query — cost $c_{\text{relabel}} = 1$
Delete: Remove a node from the query — cost $c_{\text{delete}} = 2$
Insert: Add a node to the document subtree to match the query — cost $c_{\text{insert}} = 2$
Re-parent: Move a node to a different parent — cost $c_{\text{reparent}} = 3$

Definition 6 (Edit Distance Score). The edit distance score is:

$$S_{\text{edit}}(G_q, G_d) = \min_{\text{edit sequence}} \sum_{e \in \text{edits}} c_{\text{type}(e)}$$

where lower scores indicate better matches (0 = exact match).

Interpretation: Edit distance provides a fallback when structural differences are too large for partial matching. It allows the system to retrieve documents that are "close in meaning" even when the tree structures differ — e.g., a query about "dog chasing cat" might match a document about "cat fleeing dog" through relabel operations.

4. Ranking Algorithm

4.1 The Ranking Function

Definition 7 (Ranking Function). For a query $G_q$ and document corpus $\mathcal{D}$, the ranking function produces:

$$\mathcal{R}(G_q, \mathcal{D}) = \text{sort_descending}\left( \left{ (i, s_i) : s_i = \text{score}(G_q, G_d^{(i)}) \right} \right)$$

where the score function is:

$$\text{score}(G_q, G_d) = \begin{cases} 1.0 - \alpha \cdot D_{\text{match}}(G_q, G_d) & \text{if Type I/II match} \ 1.0 - \beta \cdot S_{\text{edit}}(G_q, G_d) & \text{if Type III match} \ 0 & \text{if no match} \end{cases}$$

with normalization constants $\alpha, \beta$ chosen so that scores fall in $[0, 1]$.

Definition 8 (Score Normalization).

$$\alpha = \frac{1}{H_{\text{max}}}, \quad \beta = \frac{1}{c_{\text{reparent}} \cdot |V_q|}$$

where $H_{\text{max}}$ is the maximum tree height in the corpus.

4.2 Ultrametric Properties of the Ranking

Theorem 1 (Nested Result Clusters). The ranking produced by $\mathcal{R}$ forms nested, ultrametric clusters of results. For any three documents $d_1, d_2, d_3$ with scores $s_1 \leq s_2 \leq s_3$, the score gaps satisfy:

$$\max(s_3 - s_2, s_2 - s_1) \geq s_3 - s_1$$

In particular, there exist natural "cut points" where the score gap jumps, forming discrete result clusters rather than a continuous ranking spectrum.

Proof. The ultrametric property of the tree-alignment lattice $\mathcal{L}$ (0.2.md Theorem 1) guarantees that $D_{\text{match}}$ satisfies the strong triangle condition. Since scores are a linear transformation of distances, the score gaps inherit the ultrametric clustering property. $\square$

Corollary 1 (Natural Cut Points). The search system can identify "natural result page boundaries" by detecting where $\Delta s_i = s_i - s_{i+1}$ exceeds a threshold $\tau$. Results above the cut are in one ultrametric ball (high relevance); results below are in a different ball (low relevance).

4.3 Algorithm: Rank-and-Cluster

Algorithm RANK-AND-CLUSTER(G_q, D, k):
  Input:  Query graph G_q, document corpus D, result limit k
  Output: Ranked list of k document indices with cluster boundaries

  matches = []
  for each G_d in D:
    m = FIND-BEST-MATCH(G_q, G_d)      # Type I, II, or III
    if m is not None:
      matches.append((d.index, m.score, m.match_type, m.distance))

  sort matches by score descending

  # Detect cluster boundaries (natural cut points)
  clusters = []
  for i from 0 to min(k, len(matches))-1:
    if i > 0 and (matches[i-1].score - matches[i].score) > tau:
      mark_cluster_boundary(i)

  return matches[0:k], clusters

4.4 Time Complexity

Operation	Complexity	Notes
Subtree isomorphism (Type I)	$\mathcal{O}(	V_q
Partial match (Type II)	$\mathcal{O}(2^{	V_q
Approximate match (Type III)	$\mathcal{O}(	V_q
Ranking (all documents)	$\mathcal{O}(N \cdot M)$	$M$ = per-document match cost above
Offline index construction	$\mathcal{O}(\sum_{d} E_d)$	$E_d$ = effort to parse document $d$ into NST

5. Connection to Q-PNA Encoding

5.1 The Full Pipeline

The sub-graph matching search ($0.3, this document) is the retrieval layer in a two-layer architecture with Q-PNA providing the encoding layer:

┌──────────────────────────────────────────────────────────────┐
│  QUERY INTERFACE                                              │
│  Surface text (any language) or visual query builder          │
└──────────────────────────┬───────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────┐
│  Q-PNA ENCODING LAYER (§3.2, DOI: 10.5281/zenodo.20287742)   │
│  ┌─────────────────────────────────────────────────────────┐ │
│  │ Text → morpheme segmentation → semantic prime assignment │ │
│  │ → prime product P(t) = Π p_i^{f_i} → valuation vector   │ │
│  │ → Bruhat-Tits tree T_p leaf activation                  │ │
│  └─────────────────────────────────────────────────────────┘ │
│  Output: Nested Semantic Tree G (nodes, edges, heights, LCA)  │
└──────────────────────────┬───────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────┐
│  NSG INDEXING LAYER (this architecture)                       │
│  ┌─────────────────────────────────────────────────────────┐ │
│  │ G → NST structure → indexed in graph database           │ │
│  │  • Nodes: category, label, height                        │ │
│  │  • Edges: (parent, child, role) with scope semantics     │ │
│  │  • LCA precomputed for O(1) distance queries             │ │
│  └─────────────────────────────────────────────────────────┘ │
└──────────────────────────┬───────────────────────────────────┘
                           │
                           ▼
┌──────────────────────────────────────────────────────────────┐
│  NSG MATCHING LAYER (this architecture)                       │
│  ┌─────────────────────────────────────────────────────────┐ │
│  │ G_q (query) → subtree isomorphism search over corpus     │ │
│  │ → FIND-BEST-MATCH → score computation → RANK-AND-CLUSTER │ │
│  └─────────────────────────────────────────────────────────┘ │
│  Output: Ranked document list with cluster boundaries         │
└──────────────────────────────────────────────────────────────┘

5.2 Interface Contract

Layer	Input	Output
Q-PNA Encoder	Raw text (any language)	Nested Semantic Tree $G$ with node labels, heights, LCA preprocessed
NSG Indexer	$G$ from Q-PNA	Indexed graph database entry
NSG Matcher	$G_q$ (query), indexed corpus	Ranked results

Contract invariant: The Q-PNA encoder guarantees that $G$ is a valid NST (rooted tree, height function monotone, LCA computable). The NSG matcher guarantees that ranking respects the ultrametric inequality.

5.3 Cross-Linguistic Property

The Q-PNA encoder is language-neutral: the same proposition encoded in English, Mohawk, or Turkish produces isomorphic NSTs (0.2.md Theorem 2). Consequently, the NSG matcher is also language-neutral — a query in any language matches against a corpus in any language, because both are represented in the same invariant graph space.

6. Walkthrough: A Complete Query

6.1 Query

User input (English): "dog bit man" User input (Mohawk equivalent): Single verb form encoding the same proposition

6.2 Step 1 — Q-PNA Encoding

The surface text (either language) is parsed into a nested semantic tree:

                 BITE (h=3.0, ACTION, root)
                 /           \
          dog (ENTITY)    man (ENTITY)
          (h=0.0, agent)  (h=0.0, patient)

Token encoding:

dog → semantic prime 2, $P = 2^1 = 2$, $\vec{v} = (1)$
man → semantic prime 2, $P = 2^2 = 4$, $\vec{v} = (2)$
BITE → semantic prime 5, $P = 5^3 = 125$, $\vec{v} = (3)$

6.3 Step 2 — NSG Indexing

The graph is indexed with:

Nodes: bite (cat=ACTION, h=3.0), dog (cat=ENTITY, h=0.0, role=agent), man (cat=ENTITY, h=0.0, role=patient)
Edges: dog → bite [agent], man → bite [patient]
LCA table: $\text{lca}(dog, man) = bite$, $\text{lca}(dog, bite) = bite$, etc.

6.4 Step 3 — Matching

For each document $G_d^{(i)}$ in the corpus, FIND-BEST-MATCH attempts:

Type I: Can we find an exact subtree isomorphism mapping {bite, dog, man} to nodes in $G_d^{(i)}$?
- If the document contains a BITE event with dog-agent and man-patient → exact match, score = 1.0, distance = 0
- If the document contains a BITE event but with dog-agent and cat-patient → partial match (man unmatched), score ≈ 0.67
- If the document contains a CHASE event → approximate match via tree edit (relabel CHASE→BITE, then check arguments)
Type II (if Type I fails): Compute maximum partial match:
- Matched: {bite, dog} (2 of 3 nodes) → coverage = 2/3, distance = 0 → score = 0.67
Type III (if Type II fails): Compute minimum edit distance:
- If the document has {bite, cat, man} → relabel cat→dog (cost 1) → match achieved → score = $1 - \beta \cdot 1$

6.5 Step 4 — Ranking

Results sorted by score. A result cluster might look like:

Rank	Doc ID	Score	Match Type	Distance
1	42	1.00	Type I	0.0
2	17	1.00	Type I	0.0
3	88	0.95	Type I	0.5
----------	CUT	---	---	---
4	5	0.67	Type II	0.0
5	31	0.67	Type II	0.0
----------	CUT	---	---	---
6	12	0.33	Type III	—
7	73	0.33	Type III	—

Results 1-3 form one ultrametric cluster (exact or near-exact semantic matches). Results 4-5 form a second cluster (partial matches, missing some concepts). Results 6-7 form a third cluster (approximate matches via edit distance).

7. Implementation Considerations

7.1 Index Structure

The graph database index should support:

Node lookup by label and category — $\mathcal{O}(1)$ hash-based
Edge traversal — parent/child navigation, $\mathcal{O}(b)$ per level
Precomputed LCA — binary lifting table, $\mathcal{O}(\log n)$ query
Height-indexed nodes — range queries for "all nodes at height $\leq h$"

7.2 Query Optimization

Optimization	Description
Node pruning	If a document has no ACTION node, skip — no event can match
Category filtering	Build inverted index of $\text{category} \to \text{node_ids}$ for fast candidate generation
Height bounding	Only examine document nodes with height $\geq$ query root height
Edge role indexing	Pre-index agent/patient/modifier edges for fast structural verification
LCA caching	All-pairs LCA for each document precomputed at indexing time

7.3 Scalability

Corpus Size	Strategy
$\leq 10^4$ documents	In-memory graph database with exhaustive matching
$10^4 - 10^6$ documents	Sharded graph database with candidate filtering before matching
$\geq 10^6$ documents	Distributed index with approximate nearest-neighbor in ultrametric space + re-ranking

8. Limitations and Future Work

Limitation	Mitigation
Subtree isomorphism is NP-complete in general	Semantic trees have bounded degree and small depth; practical instances are tractable
Partial matching requires enumerating subsets	Greedy approximation with bounded backtracking
Tree edit distance is cubic in node count	Bounded by query size, which is small ($\leq 20$ nodes)
Q-PNA integration requires a working encoder	Use the `ultrametric-ai-poc` codebase as the initial encoder
No handling of negation or modality	Future work: extend node categories to include ${\text{NEG}, \text{MODAL}}$
Cross-linguistic equivalence is assumed	Verified for simple examples; needs large-scale validation
The tree-alignment lattice is not implemented	Theoretical definition; practical implementation uses simpler distance measures
Real-time query latency not benchmarked	Prototype stage; benchmarking planned for S4

References

0.1.md — Internal Literature Review: The Ultrametric-Language-AI Research Program
0.2.md — Formal Definitions: The Nested Semantic Graph
Few Become One $\S$V.C (DOI: 10.5281/zenodo.20328374)
Q-PNA Research Specification v2.0 $\S$2-3 (DOI: 10.5281/zenodo.20287742)
Tree Distance Cophenetic $\S$2 (DOI: 10.5281/zenodo.20213043)
ultrametric-ai-poc — Working proof-of-concept (github.com/rwnq8/ultrametric-ai-poc)

Sub-Graph Matching Search Specification v0.3 — Operationalizes the formalism from 0.2.md into a concrete search architecture. Python prototype in S4 (0.4.py).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sub-Graph Matching Search: Formal Specification

Query Graphs, Matching Criteria, and Ultrametric Ranking on Nested Semantic Trees

1. Introduction

2. The Search Problem

2.1 Formal Statement

2.2 Scope and Assumptions

3. Matching Criteria: Three Tiers

3.1 Type I: Exact Subtree Isomorphism

3.2 Type II: Partial Match with Distance

3.3 Type III: Approximate Match via Tree Edit Distance

4. Ranking Algorithm

4.1 The Ranking Function

4.2 Ultrametric Properties of the Ranking

4.3 Algorithm: Rank-and-Cluster

4.4 Time Complexity

5. Connection to Q-PNA Encoding

5.1 The Full Pipeline

5.2 Interface Contract

5.3 Cross-Linguistic Property

6. Walkthrough: A Complete Query

6.1 Query

6.2 Step 1 — Q-PNA Encoding

6.3 Step 2 — NSG Indexing

6.4 Step 3 — Matching

6.5 Step 4 — Ranking

7. Implementation Considerations

7.1 Index Structure

7.2 Query Optimization

7.3 Scalability

8. Limitations and Future Work

References

FilesExpand file tree

0.3.md

Latest commit

History

0.3.md

File metadata and controls

Sub-Graph Matching Search: Formal Specification

Query Graphs, Matching Criteria, and Ultrametric Ranking on Nested Semantic Trees

1. Introduction

2. The Search Problem

2.1 Formal Statement

2.2 Scope and Assumptions

3. Matching Criteria: Three Tiers

3.1 Type I: Exact Subtree Isomorphism

3.2 Type II: Partial Match with Distance

3.3 Type III: Approximate Match via Tree Edit Distance

4. Ranking Algorithm

4.1 The Ranking Function

4.2 Ultrametric Properties of the Ranking

4.3 Algorithm: Rank-and-Cluster

4.4 Time Complexity

5. Connection to Q-PNA Encoding

5.1 The Full Pipeline

5.2 Interface Contract

5.3 Cross-Linguistic Property

6. Walkthrough: A Complete Query

6.1 Query

6.2 Step 1 — Q-PNA Encoding

6.3 Step 2 — NSG Indexing

6.4 Step 3 — Matching

6.5 Step 4 — Ranking

7. Implementation Considerations

7.1 Index Structure

7.2 Query Optimization

7.3 Scalability

8. Limitations and Future Work

References