Version: 0.3
Date: 2026-05-22
Status: Formal specification
Depends on: 0.1.md (Literature Review), 0.2.md (Formal Definitions)
Grounded in: Few Become One $\S$V.C (DOI: 10.5281/zenodo.20328374), Q-PNA $\S$2-3 (DOI: 10.5281/zenodo.20287742)
This document provides the formal specification for the sub-graph matching search architecture — the computational core of the Nested Semantic Graph (NSG) project. Where 0.2.md established the formal definitions (nodes, edges, ultrametric distance, tree-alignment lattice), this document specifies how those definitions are operationalized for information retrieval.
The search architecture described here is the engineering complement to:
- Few Become One $\S$V.C — the conceptual proposal for sub-graph matching search
- Q-PNA $\S$2-3 — the neural architecture providing the encoding/representation layer
All definitions in 0.2.md are prerequisite reading.
Definition 1 (Sub-Graph Search Problem). Given:
- A query graph
$G_q$ , expressed as a nested semantic tree - A document corpus
$\mathcal{D} = {G_d^{(1)}, G_d^{(2)}, \ldots, G_d^{(N)}}$ , where each$G_d^{(i)}$ is a nested semantic tree representing the meaning of document$i$ - A matching criterion
$\mathcal{M}$ specifying how$G_q$ relates to$G_d^{(i)}$ - A ranking function
$\mathcal{R}: \mathcal{D} \to \mathbb{R}^N$ assigning a relevance score to each document
Find: The ordered list
| Assumption | Justification |
|---|---|
| All graphs are nested semantic trees (NSTs) | Per 0.2.md Definition 10 — the tree constraint is fundamental. Non-tree semantic relationships are represented via secondary indexing. |
| Matching is subtree-based | The query |
| Node labels are semantic categories from |
Per 0.2.md Definition 3 — categories are language-neutral |
| Edge roles are preserved in matching | Agent/patient distinctions matter — structural equivalence requires edge-label preservation |
| The corpus is offline-indexed | Document graphs are pre-computed and stored; queries are processed online |
Definition 2 (Exact Match). A query graph
-
Label preservation:
$\lambda(\phi(v)) = \lambda(v)$ for all$v \in V_q$ -
Edge preservation: For every edge
$a \to b$ in$E_q$ , there is an edge$\phi(a) \to \phi(b)$ in$E_d$ -
Root preservation:
$\phi(r_q)$ is in the subtree of$r_d$ (i.e., the query root maps to a node that is part of the document's event structure) -
Structural equivalence: The induced subgraph on
$\phi(V_q)$ is isomorphic to$G_q$
Notation: We write
Interpretation: The proposition encoded by the query is fully contained within the proposition encoded by the document. All conceptual primitives are present with the correct scope relationships.
Example. Query: "dog bites man" → graph with BITE(root), dog(agent), man(patient). Document: "The large brown dog viciously bit the old man in the park yesterday" → graph with BITE(root), dog(agent)+modifiers, man(patient)+modifiers, park(locative), yesterday(tense). The query exactly matches the document: BITE→BITE, dog→dog, man→man, all edges preserved.
Definition 3 (Partial Match). A query graph
where
Definition 4 (Match Distance). For a partial match with mapping
where 0.2.md Definition 16).
Interpretation: Partial matches are ranked by two criteria: (1) what fraction of the query was matched (coverage), and (2) how much the matching subtree structurally diverges from the query (distance). A match covering 80% of query nodes with distance 0 ranks above a match covering 100% of query nodes with distance 5.
Definition 5 (Approximate Match). When no Type I or Type II match exists, an approximate match is found by computing the minimum tree edit distance between
-
Relabel: Change a node's label to match the query — cost
$c_{\text{relabel}} = 1$ -
Delete: Remove a node from the query — cost
$c_{\text{delete}} = 2$ -
Insert: Add a node to the document subtree to match the query — cost
$c_{\text{insert}} = 2$ -
Re-parent: Move a node to a different parent — cost
$c_{\text{reparent}} = 3$
Definition 6 (Edit Distance Score). The edit distance score is:
where lower scores indicate better matches (0 = exact match).
Interpretation: Edit distance provides a fallback when structural differences are too large for partial matching. It allows the system to retrieve documents that are "close in meaning" even when the tree structures differ — e.g., a query about "dog chasing cat" might match a document about "cat fleeing dog" through relabel operations.
Definition 7 (Ranking Function). For a query
where the score function is:
with normalization constants
Definition 8 (Score Normalization).
where
Theorem 1 (Nested Result Clusters). The ranking produced by
In particular, there exist natural "cut points" where the score gap jumps, forming discrete result clusters rather than a continuous ranking spectrum.
Proof. The ultrametric property of the tree-alignment lattice 0.2.md Theorem 1) guarantees that
Corollary 1 (Natural Cut Points). The search system can identify "natural result page boundaries" by detecting where
Algorithm RANK-AND-CLUSTER(G_q, D, k):
Input: Query graph G_q, document corpus D, result limit k
Output: Ranked list of k document indices with cluster boundaries
matches = []
for each G_d in D:
m = FIND-BEST-MATCH(G_q, G_d) # Type I, II, or III
if m is not None:
matches.append((d.index, m.score, m.match_type, m.distance))
sort matches by score descending
# Detect cluster boundaries (natural cut points)
clusters = []
for i from 0 to min(k, len(matches))-1:
if i > 0 and (matches[i-1].score - matches[i].score) > tau:
mark_cluster_boundary(i)
return matches[0:k], clusters| Operation | Complexity | Notes |
|---|---|---|
| Subtree isomorphism (Type I) | $\mathcal{O}( | V_q |
| Partial match (Type II) | $\mathcal{O}(2^{ | V_q |
| Approximate match (Type III) | $\mathcal{O}( | V_q |
| Ranking (all documents) |
|
|
| Offline index construction |
|
The sub-graph matching search ($0.3, this document) is the retrieval layer in a two-layer architecture with Q-PNA providing the encoding layer:
┌──────────────────────────────────────────────────────────────┐
│ QUERY INTERFACE │
│ Surface text (any language) or visual query builder │
└──────────────────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Q-PNA ENCODING LAYER (§3.2, DOI: 10.5281/zenodo.20287742) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Text → morpheme segmentation → semantic prime assignment │ │
│ │ → prime product P(t) = Π p_i^{f_i} → valuation vector │ │
│ │ → Bruhat-Tits tree T_p leaf activation │ │
│ └─────────────────────────────────────────────────────────┘ │
│ Output: Nested Semantic Tree G (nodes, edges, heights, LCA) │
└──────────────────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ NSG INDEXING LAYER (this architecture) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ G → NST structure → indexed in graph database │ │
│ │ • Nodes: category, label, height │ │
│ │ • Edges: (parent, child, role) with scope semantics │ │
│ │ • LCA precomputed for O(1) distance queries │ │
│ └─────────────────────────────────────────────────────────┘ │
└──────────────────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ NSG MATCHING LAYER (this architecture) │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ G_q (query) → subtree isomorphism search over corpus │ │
│ │ → FIND-BEST-MATCH → score computation → RANK-AND-CLUSTER │ │
│ └─────────────────────────────────────────────────────────┘ │
│ Output: Ranked document list with cluster boundaries │
└──────────────────────────────────────────────────────────────┘| Layer | Input | Output |
|---|---|---|
| Q-PNA Encoder | Raw text (any language) | Nested Semantic Tree |
| NSG Indexer |
|
Indexed graph database entry |
| NSG Matcher |
|
Ranked results |
Contract invariant: The Q-PNA encoder guarantees that
The Q-PNA encoder is language-neutral: the same proposition encoded in English, Mohawk, or Turkish produces isomorphic NSTs (0.2.md Theorem 2). Consequently, the NSG matcher is also language-neutral — a query in any language matches against a corpus in any language, because both are represented in the same invariant graph space.
User input (English): "dog bit man" User input (Mohawk equivalent): Single verb form encoding the same proposition
The surface text (either language) is parsed into a nested semantic tree:
BITE (h=3.0, ACTION, root)
/ \
dog (ENTITY) man (ENTITY)
(h=0.0, agent) (h=0.0, patient)Token encoding:
-
dog→ semantic prime 2,$P = 2^1 = 2$ ,$\vec{v} = (1)$ -
man→ semantic prime 2,$P = 2^2 = 4$ ,$\vec{v} = (2)$ -
BITE→ semantic prime 5,$P = 5^3 = 125$ ,$\vec{v} = (3)$
The graph is indexed with:
-
Nodes:
bite(cat=ACTION, h=3.0),dog(cat=ENTITY, h=0.0, role=agent),man(cat=ENTITY, h=0.0, role=patient) -
Edges:
dog → bite [agent],man → bite [patient] -
LCA table:
$\text{lca}(dog, man) = bite$ ,$\text{lca}(dog, bite) = bite$ , etc.
For each document FIND-BEST-MATCH attempts:
-
Type I: Can we find an exact subtree isomorphism mapping
{bite, dog, man}to nodes in$G_d^{(i)}$ ?- If the document contains a BITE event with dog-agent and man-patient → exact match, score = 1.0, distance = 0
- If the document contains a BITE event but with dog-agent and cat-patient → partial match (man unmatched), score ≈ 0.67
- If the document contains a CHASE event → approximate match via tree edit (relabel CHASE→BITE, then check arguments)
-
Type II (if Type I fails): Compute maximum partial match:
- Matched:
{bite, dog}(2 of 3 nodes) → coverage = 2/3, distance = 0 → score = 0.67
- Matched:
-
Type III (if Type II fails): Compute minimum edit distance:
- If the document has
{bite, cat, man}→ relabelcat→dog(cost 1) → match achieved → score =$1 - \beta \cdot 1$
- If the document has
Results sorted by score. A result cluster might look like:
| Rank | Doc ID | Score | Match Type | Distance |
|---|---|---|---|---|
| 1 | 42 | 1.00 | Type I | 0.0 |
| 2 | 17 | 1.00 | Type I | 0.0 |
| 3 | 88 | 0.95 | Type I | 0.5 |
| ---------- | CUT | --- | --- | --- |
| 4 | 5 | 0.67 | Type II | 0.0 |
| 5 | 31 | 0.67 | Type II | 0.0 |
| ---------- | CUT | --- | --- | --- |
| 6 | 12 | 0.33 | Type III | — |
| 7 | 73 | 0.33 | Type III | — |
Results 1-3 form one ultrametric cluster (exact or near-exact semantic matches). Results 4-5 form a second cluster (partial matches, missing some concepts). Results 6-7 form a third cluster (approximate matches via edit distance).
The graph database index should support:
-
Node lookup by label and category —
$\mathcal{O}(1)$ hash-based -
Edge traversal — parent/child navigation,
$\mathcal{O}(b)$ per level -
Precomputed LCA — binary lifting table,
$\mathcal{O}(\log n)$ query -
Height-indexed nodes — range queries for "all nodes at height
$\leq h$ "
| Optimization | Description |
|---|---|
| Node pruning | If a document has no ACTION node, skip — no event can match |
| Category filtering | Build inverted index of |
| Height bounding | Only examine document nodes with height |
| Edge role indexing | Pre-index agent/patient/modifier edges for fast structural verification |
| LCA caching | All-pairs LCA for each document precomputed at indexing time |
| Corpus Size | Strategy |
|---|---|
|
|
In-memory graph database with exhaustive matching |
|
|
Sharded graph database with candidate filtering before matching |
|
|
Distributed index with approximate nearest-neighbor in ultrametric space + re-ranking |
| Limitation | Mitigation |
|---|---|
| Subtree isomorphism is NP-complete in general | Semantic trees have bounded degree and small depth; practical instances are tractable |
| Partial matching requires enumerating subsets | Greedy approximation with bounded backtracking |
| Tree edit distance is cubic in node count | Bounded by query size, which is small ( |
| Q-PNA integration requires a working encoder | Use the ultrametric-ai-poc codebase as the initial encoder |
| No handling of negation or modality | Future work: extend node categories to include |
| Cross-linguistic equivalence is assumed | Verified for simple examples; needs large-scale validation |
| The tree-alignment lattice is not implemented | Theoretical definition; practical implementation uses simpler distance measures |
| Real-time query latency not benchmarked | Prototype stage; benchmarking planned for S4 |
-
0.1.md— Internal Literature Review: The Ultrametric-Language-AI Research Program -
0.2.md— Formal Definitions: The Nested Semantic Graph - Few Become One $\S$V.C (DOI:
10.5281/zenodo.20328374) - Q-PNA Research Specification v2.0 $\S$2-3 (DOI:
10.5281/zenodo.20287742) - Tree Distance Cophenetic $\S$2 (DOI:
10.5281/zenodo.20213043) -
ultrametric-ai-poc— Working proof-of-concept (github.com/rwnq8/ultrametric-ai-poc)
Sub-Graph Matching Search Specification v0.3 — Operationalizes the formalism from 0.2.md into a concrete search architecture. Python prototype in S4 (0.4.py).