Version: 0.6
Date: 2026-05-22
Status: Architecture specification
Depends on: 0.1.md (Literature Review), 0.2.md (Formal Definitions), 0.3.md (Search Spec), 0.4.md (Examples), 0.5.py (Prototype)
Grounded in: Q-PNA $\S$2-3 (DOI: 10.5281/zenodo.20287742), Few Become One $\S$V (DOI: 10.5281/zenodo.20328374)
This document specifies the complete computational pathway for the Nested Semantic Graph (NSG) sub-graph matching search system — the end-to-end architecture that converts surface text in any language into ranked, language-neutral search results. It is the synthesis document: it connects the formal definitions (0.2.md), the search specification (0.3.md), the cross-linguistic examples (0.4.md), and the working Python prototype (0.5.py) into a single, coherent architecture, mapped against the existing Q-PNA neural architecture and the broader ultrametric-language research corpus.
The document serves three purposes:
- Integration: Shows how all components fit together in a working system
- Feasibility Assessment: Evaluates what exists, what needs to be built, and the challenges
- Roadmap: Provides a phased implementation plan from prototype to production
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 1: INPUT │
│ Surface text (any language) or visual query builder │
│ "The dog bit the man yesterday" │
│ "Köpek adamı dün ısırdı" │
│ "Wahonwa'kahrá:ko' tsi'niyohseraká:te'" │
└───────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 2: MORPHOLOGICAL ANALYSIS │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Language-specific morphological analyzer │ │
│ │ • English: whitespace tokenization + lemmatization │ │
│ │ • Turkish: morpheme segmentation (agglutinative chains) │ │
│ │ • Mohawk: morpheme decomposition (polysynthetic fusion) │ │
│ │ Output: Morpheme sequence with grammatical annotations │ │
│ └───────────────────────────────────────────────────────────┘ │
│ Status: Partially exists — language-specific tools available │
│ for some languages (Inuktitut, Cree, Turkish); not Mohawk │
└───────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 3: SEMANTIC PARSING (Q-PNA ENCODER) │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ Q-PNA §3.2 Token Encoding Pipeline │ │
│ │ Step 1: Semantic prime assignment per morpheme │ │
│ │ Step 2: Prime product P(t) = Π p_i^{f_i(t)} │ │
│ │ Step 3: Valuation vector v(t) = (v_{p1}(P), v_{p2}(P),…) │ │
│ │ Step 4: Leaf activation on Bruhat-Tits tree │ │
│ │ → Build Nested Semantic Tree: nodes, edges, heights, LCA │ │
│ └───────────────────────────────────────────────────────────┘ │
│ Status: Specified (Q-PNA §2-3). Working PoC: ultrametric-ai-poc│
│ uses WordNet hypernyms for semantic prime assignment. │
└───────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 4: GRAPH INDEXING (NSG INDEXER) │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ NST → Graph database index │ │
│ │ • Nodes indexed by category and label (O(1) lookup) │ │
│ │ • Edges indexed by parent/child for traversal │ │
│ │ • LCA table precomputed via binary lifting (O(log n)) │ │
│ │ • Height-indexed nodes for range queries │ │
│ │ • Node category inverted index for candidate generation │ │
│ └───────────────────────────────────────────────────────────┘ │
│ Status: Specified (0.3.md §7.1). Prototype (0.5.py) in-memory.│
│ Production: graph database (Neo4j, ArangoDB) or custom index. │
└───────────────────────────┬─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ STAGE 5: QUERY MATCHING & RANKING (NSG MATCHER) │
│ ┌───────────────────────────────────────────────────────────┐ │
│ │ For query graph G_q: │ │
│ │ 1. Candidate filtering: category indexes, height bounds │ │
│ │ 2. Type I: Exact subtree isomorphism search │ │
│ │ 3. Type II: Partial match with coverage scoring │ │
│ │ 4. Type III: Tree edit distance (fallback) │ │
│ │ 5. RANK-AND-CLUSTER: sort by score, detect cluster cuts │ │
│ │ → Ranked, clustered results │ │
│ └───────────────────────────────────────────────────────────┘ │
│ Status: Specified (0.3.md). Prototype (0.5.py) working for │
│ small graphs. Production needs scaling to corpus size. │
└─────────────────────────────────────────────────────────────────┘Query (any language) Document Corpus (any language)
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Morphological│ │ Morphological│
│ Analyzer │ │ Analyzer │
│ (Stage 2) │ │ (Stage 2) │
└──────┬───────┘ └──────┬───────┘
│ morpheme seq │ morpheme seq
▼ ▼
┌──────────────┐ ┌──────────────┐
│ Q-PNA Encoder│ │ Q-PNA Encoder│
│ (Stage 3) │ │ (Stage 3) │
└──────┬───────┘ └──────┬───────┘
│ NST G_q │ NSTs {G_d}
▼ ▼
│ ┌──────────────┐
│ │ NSG Indexer │
│ │ (Stage 4) │
│ └──────┬───────┘
│ │ indexed graphs
│ ▼
│ ┌──────────────┐
└──────────────────────────▶│ NSG Matcher │
│ (Stage 5) │
└──────┬───────┘
│ ranked results
▼
┌──────────────┐
│ Results │
│ (clustered) │
└──────────────┘| Stage | Mode | Latency Budget | Complexity |
|---|---|---|---|
| Morphological Analysis | Online (query) / Offline (documents) | <100ms per query | |
| Q-PNA Encoding | Online (query) / Offline (documents) | <50ms per morpheme | |
| Graph Indexing | Offline (documents only) | Batch | |
| Query Matching | Online | <200ms per query | See $\S$5 below |
For the production system, documents are pre-processed through Stages 2-4 offline. At query time, only the query goes through Stages 2-3, and then Stage 5 matches the query graph against the pre-indexed corpus. This provides sub-second query latency for corpora up to
| Aspect | Status | Details |
|---|---|---|
| English | SOLVED | Whitespace tokenization + lemmatization (NLTK, spaCy). Trivial. |
| Turkish | FEASIBLE | Turkish has mature morphological analyzers (TRmorph, Zemberek). Agglutinative chains are transparently segmentable. |
| Mohawk / Polysynthetic | CHALLENGING | No production-ready morphological analyzer exists for Mohawk. Academic tools exist for Inuktitut and Cree. Mohawk would require a custom Finite State Transducer (FST) trained on annotated data. |
| General solution | RESEARCH | A universal morphological analyzer for arbitrary polysynthetic languages does not exist. This is the primary bottleneck for Stage 2. |
Mitigation: The prototype (0.5.py) bypasses Stage 2 entirely by using hardcoded tree builders — the trees are constructed directly from semantic specifications. For production, morphological analysis is the language-specific front-end that must be built per language. This is consistent with the architectural principle: the tree is invariant; only the linearization → parsing path differs per language.
Effort estimate: 3-6 months for a Mohawk morphological analyzer (FST-based), assuming access to annotated data and native speaker expertise. 1-2 months for Turkish (leverage existing tools). 1 week for English.
| Aspect | Status | Details |
|---|---|---|
| Architecture | SPECIFIED | Q-PNA $\S$3.2 defines the full encoding pipeline: semantic prime assignment → prime product → valuation vector → leaf activation |
| Working PoC | EXISTS |
ultrametric-ai-poc (Streamlit app) implements the pipeline using WordNet hypernyms for semantic prime assignment. Working for English. |
| Cross-linguistic extension | DESIGNED | The encoding pipeline is language-agnostic: it takes morphemes as input and produces valuation vectors. The semantic prime assignment must be extended to non-English languages. |
| Bruhat-Tits tree implementation | PARTIAL | The ultrametric-ai-poc implements Bruhat-Tits tree distance computation. Full tree construction with leaf activation vectors is specified but not yet implemented at scale. |
| Token calculus verification | DESIGNED | The syntactic token calculus provides formal verification of every encoding decision. Integration with the encoding pipeline is specified but not yet implemented. |
Feasibility: HIGH for English. MEDIUM for Turkish (requires WordNet-equivalent for Turkish, or a cross-linguistic semantic prime ontology). MEDIUM-LOW for Mohawk (requires a semantic prime ontology for Iroquoian languages).
Mitigation: The "semantic prime" approach can be bootstrapped: start with a small set of universal primes (ENTITY, ACTION, TENSE, ASPECT, EVIDENTIAL, LOCATIVE, MANNER, LOGICAL as defined in 0.2.md Definition 3), assign primes based on morphological analysis output, and expand the prime set as the system matures. The Language-Info-Architecture project's four mandatory feature clusters (reference-tracking, source-tracking, categorical-judgment, spatial-coordinate) provide a principled starting point for prime selection.
| Aspect | Status | Details |
|---|---|---|
| Specification | COMPLETE |
0.3.md $\S$7.1 defines the index structure with complexity bounds |
| Prototype | WORKING |
0.5.py implements in-memory indexing (dict-based node/edge lookup). Works for <1000 nodes. |
| Production index | STANDARD | Graph databases (Neo4j, ArangoDB) natively support the required operations. LCA can be precomputed and stored. Height-indexed node lookups are |
| Scalability | ACHIEVABLE | For |
Feasibility: HIGH. Graph indexing is a solved problem with existing database technology. The NSG's specific requirements (LCA precomputation, category-based indexes, height range queries) are straightforward additions.
| Aspect | Status | Details |
|---|---|---|
| Type I (Exact) | WORKING |
0.5.py implements exact subtree isomorphism via recursive candidate matching. Works for small graphs ( |
| Type II (Partial) | WORKING |
0.5.py implements partial matching with coverage scoring and ultrametric distance penalty. |
| Type III (Edit Distance) | SPECIFIED |
0.3.md $\S$3.3 defines the edit distance framework. Not yet implemented. |
| Ranking | WORKING |
0.5.py implements coverage × distance penalty scoring with ultrametric cluster boundary detection. |
| Scalability | CHALLENGING | Subtree isomorphism is NP-complete in general. For semantic trees with bounded degree and small depth, it is tractable. Practical corpus sizes ( |
Feasibility: HIGH for prototype scale (<1000 documents). MEDIUM for production scale (
Complexity analysis:
| Component | Worst Case | Practical (Semantic Trees) |
|---|---|---|
| Subtree isomorphism | Tractable: |
|
| Partial matching | Tractable: |
|
| Ranking (per query) | Tractable with candidate filtering | |
| Offline indexing | Batch processing, not latency-sensitive |
The Q-PNA neural architecture (DOI: 10.5281/zenodo.20287742) provides Stages 3-4 (encoding and tree construction). The NSG provides Stage 5 (matching and ranking). The interface between them is the Nested Semantic Tree — a clean data structure contract:
Q-PNA OUTPUT → NST (nodes, edges, heights, LCA) → NSG INPUTInterface contract:
# Q-PNA produces:
nst = NestedSemanticTree(language=detected_language)
for morpheme in analyzed_morphemes:
primes = assign_semantic_primes(morpheme)
product = compute_prime_product(primes)
valuation = compute_valuation(product)
leaf_id = activate_leaf(valuation)
nst.add_node(leaf_id, morpheme.concept_label, morpheme.category, …)
# NSG consumes:
results = rank_results(query_nst, indexed_corpus)This clear separation means Q-PNA and NSG can evolve independently as long as the NST format is preserved.
The ultrametric-ai-poc (GitHub: rwnq8/ultrametric-ai-poc) provides a working reference implementation for the encoding pipeline. Its architecture maps directly:
| ultrametric-ai-poc Component | NSG Pipeline Stage |
|---|---|
| WordNet hypernym → semantic prime assignment | Stage 3 (semantic parsing) |
| Prime product computation | Stage 3 (Q-PNA encoding) |
| p-adic valuation extraction | Stage 3 (valuation vector) |
| Bruhat-Tits tree distance | Stage 4 (LCA precomputation) |
| Distinction calculus verification | Stage 3 (token calculus) |
The ultrametric-ai-poc does NOT implement search/retrieval — that is the NSG's contribution.
The Language-Info-Architecture project (DOI: 10.5281/zenodo.20137616) provides the quantitative grounding for cross-linguistic search:
| Finding | NSG Application |
|---|---|
| Entropy gradient (6.48 → 6.80 bits/word) | Validates the compression-tax trade-off — semantic information is invariant under recoding |
| Mutual exclusion of mandatory feature clusters | Constrains the semantic prime design space — certain feature combinations never co-occur |
| Four mandatory clusters | Provides the starting taxonomy for semantic prime categories |
The Syntactic Token Calculus (Archive: 2026/04/Syntactic Token Calculus/) provides the formal verification substrate:
| STC Concept | NSG Application |
|---|---|
| Mark (□) and Void (ε) | The primitive distinction operation for node creation |
| Cross-ratio invariance | The mathematical basis for entropy invariance under recoding |
| Cocycle condition (δω = 0) | Global consistency check for the semantic graph |
| Normal forms | Canonical representations of semantic propositions |
| Component | Status | Deliverable |
|---|---|---|
| Formal definitions | ✅ | 0.2.md + 0.2.py |
| Search specification | ✅ | 0.3.md |
| Cross-linguistic examples | ✅ | 0.4.md |
| Working pipeline | ✅ | 0.5.py |
| Component diagram | ✅ | This document (0.6.md) |
Achievement: Full pipeline demonstrated end-to-end for 3 languages with 5-document corpus. All trees ultrametric. Exact, partial, and cross-linguistic matching working.
| Component | Effort | Description |
|---|---|---|
| Tree edit distance implementation | 1 week | Implement Type III matching (Zhang-Shasha or AP-TED algorithm) |
| Larger corpus | 1 week | 50-100 documents with varied event structures |
| Morphological analyzer integration | 2 weeks | Connect to Turkish morphological analyzer (Zemberek) |
| Performance benchmarking | 1 week | Measure query latency vs. corpus size, identify bottlenecks |
| Component | Effort | Description |
|---|---|---|
| Graph database integration | 1 month | Migrate from in-memory to Neo4j/ArangoDB |
| Q-PNA encoder integration | 2 months | Connect to Q-PNA's encoding pipeline |
| Candidate filtering optimization | 1 month | Category indexes, height bounds, approximate filtering |
| Cross-linguistic corpus | 2 months | Parallel texts in 5+ languages with morphological annotations |
| Evaluation framework | 1 month | Precision/recall metrics, benchmark datasets |
| Component | Effort | Description |
|---|---|---|
| Production search engine | 6 months | Distributed index, caching, load balancing |
| Mohawk morphological analyzer | 3-6 months | FST-based analyzer trained on annotated data |
| Neural semantic parser | 6-12 months | Train or fine-tune parser for common representation |
| Polysynthetic language corpus | 6-12 months | Collect and annotate parallel corpora |
Per ADR-0001 (DECISIONS.md). Trees enforce the ultrametric property, which guarantees:
- Nested result clusters (natural cut points in ranking)
- Geometric error confinement (the Threshold Principle)
-
$\mathcal{O}(\log n)$ distance computation via LCA - Direct connection to the Bruhat-Tits tree (Q-PNA's substrate)
Non-tree semantic relationships (coreference, symmetric relations) are handled by secondary indexing — the primary search operates on the tree.
Per the prototype (0.5.py). Labels are language-specific surface forms. Categories (ENTITY, ACTION, TENSE, etc.) are language-neutral. Cross-linguistic matching operates at the category level. Same-language matching can additionally use label matching for semantic precision. This is the distinction between the check_label parameter in _subtree_matches_node.
Q-PNA handles encoding (text → NST). NSG handles matching (NST → results). This separation:
- Allows independent evolution of both components
- Enables different encoders for different languages (same NST format)
- Keeps the search architecture language-agnostic
- Mirrors the standard separation in information retrieval (indexer vs. query processor)
Documents are pre-processed through Stages 2-4 offline. Queries only go through Stages 2-3 online, then Stage 5 matches against the pre-indexed corpus. This provides sub-second query latency because:
- Morphological analysis of the corpus is done once, not per query
- Q-PNA encoding of the corpus is done once, not per query
- LCA preprocessing of the corpus is done once, not per query
- Only the query graph needs online processing
| Limitation | Severity | Mitigation |
|---|---|---|
| Morphological analysis for polysynthetic languages | HIGH | This is the primary bottleneck. No production-ready tools exist for most polysynthetic languages. Requires per-language FST development. Mitigated in prototype by hardcoded trees. |
| Scalability of subtree isomorphism | MEDIUM | NP-complete in general but tractable for semantic trees ( |
| Semantic prime ontology | MEDIUM | No universal semantic prime ontology exists for all languages. Bootstrappable from linguistic typology (Language-Info-Architecture's four mandatory clusters). |
| Real corpus validation | MEDIUM | All prototype examples use hand-constructed trees. Real-world corpora will have noise, ambiguity, and parsing errors. |
| No handling of negation, modality, or quantification | LOW | These are important for semantic precision but can be added as node categories without changing the architecture. |
| Q-PNA integration not yet tested | HIGH | The Q-PNA specification is published but not yet implemented at scale. The NSG prototype assumes the Q-PNA encoder exists. Mitigated by the clean interface contract. |
The computational pathway from surface text to ranked search results consists of five stages:
| Stage | Function | Status |
|---|---|---|
| 1. Input | Surface text (any language) | Trivial |
| 2. Morphological Analysis | Text → morpheme sequence | Partial (English solved, Turkish feasible, Mohawk challenging) |
| 3. Q-PNA Encoding | Morphemes → Nested Semantic Tree | Specified (Q-PNA $\S$3.2), PoC exists (ultrametric-ai-poc) |
| 4. Graph Indexing | NST → indexed graph database | Specified (0.3.md $\S$7.1), prototype working |
| 5. Query Matching | Query NST + indexed corpus → ranked results | Specified (0.3.md), prototype working (0.5.py) |
The two critical open challenges are:
- Morphological analysis for polysynthetic languages (Stage 2 bottlenecks) — requires per-language tools
- Q-PNA integration at scale (Stage 3 scaling) — specified but not yet implemented in production
All other stages are either solved (indexing) or demonstrated in prototype (matching, ranking). The architecture is sound, the interfaces are clean, and the mathematical foundation (ultrametric inequality) provides formal guarantees for correctness and performance bounds.
0.1.md— Internal Literature Review0.2.md— Formal Definitions0.3.md— Sub-Graph Matching Search Specification0.4.md— Cross-Linguistic Examples0.5.py— Python Prototype: Full Pipeline- Q-PNA Research Specification v2.0 (DOI:
10.5281/zenodo.20287742) - Few Become One (DOI:
10.5281/zenodo.20328374) - Language as Information Architecture (DOI:
10.5281/zenodo.20137616) - Syntactic Token Calculus (Archive:
2026/04/Syntactic Token Calculus/) - ultrametric-ai-poc (GitHub:
github.com/rwnq8/ultrametric-ai-poc)
Computational Pathway v0.6 — The full architecture, specified, assessed, and roadmapped.