Computational Pathway: From Surface Text to Ranked Search Results

The Full Pipeline Architecture for Sub-Graph Matching Search on Nested Semantic Graphs

Version: 0.6 Date: 2026-05-22 Status: Architecture specification Depends on: 0.1.md (Literature Review), 0.2.md (Formal Definitions), 0.3.md (Search Spec), 0.4.md (Examples), 0.5.py (Prototype) Grounded in: Q-PNA $\S$2-3 (DOI: 10.5281/zenodo.20287742), Few Become One $\S$V (DOI: 10.5281/zenodo.20328374)

1. Introduction

This document specifies the complete computational pathway for the Nested Semantic Graph (NSG) sub-graph matching search system — the end-to-end architecture that converts surface text in any language into ranked, language-neutral search results. It is the synthesis document: it connects the formal definitions (0.2.md), the search specification (0.3.md), the cross-linguistic examples (0.4.md), and the working Python prototype (0.5.py) into a single, coherent architecture, mapped against the existing Q-PNA neural architecture and the broader ultrametric-language research corpus.

The document serves three purposes:

Integration: Shows how all components fit together in a working system
Feasibility Assessment: Evaluates what exists, what needs to be built, and the challenges
Roadmap: Provides a phased implementation plan from prototype to production

2. The Full Pipeline: Five-Stage Architecture

2.1 Pipeline Overview

┌─────────────────────────────────────────────────────────────────┐
│                     STAGE 1: INPUT                               │
│  Surface text (any language) or visual query builder             │
│  "The dog bit the man yesterday"                                 │
│  "Köpek adamı dün ısırdı"                                       │
│  "Wahonwa'kahrá:ko' tsi'niyohseraká:te'"                        │
└───────────────────────────┬─────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│                STAGE 2: MORPHOLOGICAL ANALYSIS                   │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ Language-specific morphological analyzer                  │  │
│  │ • English: whitespace tokenization + lemmatization        │  │
│  │ • Turkish: morpheme segmentation (agglutinative chains)   │  │
│  │ • Mohawk: morpheme decomposition (polysynthetic fusion)   │  │
│  │ Output: Morpheme sequence with grammatical annotations    │  │
│  └───────────────────────────────────────────────────────────┘  │
│  Status: Partially exists — language-specific tools available   │
│  for some languages (Inuktitut, Cree, Turkish); not Mohawk      │
└───────────────────────────┬─────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│              STAGE 3: SEMANTIC PARSING (Q-PNA ENCODER)          │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ Q-PNA §3.2 Token Encoding Pipeline                        │  │
│  │ Step 1: Semantic prime assignment per morpheme            │  │
│  │ Step 2: Prime product P(t) = Π p_i^{f_i(t)}              │  │
│  │ Step 3: Valuation vector v(t) = (v_{p1}(P), v_{p2}(P),…) │  │
│  │ Step 4: Leaf activation on Bruhat-Tits tree               │  │
│  │ → Build Nested Semantic Tree: nodes, edges, heights, LCA  │  │
│  └───────────────────────────────────────────────────────────┘  │
│  Status: Specified (Q-PNA §2-3). Working PoC: ultrametric-ai-poc│
│  uses WordNet hypernyms for semantic prime assignment.          │
└───────────────────────────┬─────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│               STAGE 4: GRAPH INDEXING (NSG INDEXER)             │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ NST → Graph database index                                │  │
│  │ • Nodes indexed by category and label (O(1) lookup)       │  │
│  │ • Edges indexed by parent/child for traversal             │  │
│  │ • LCA table precomputed via binary lifting (O(log n))     │  │
│  │ • Height-indexed nodes for range queries                  │  │
│  │ • Node category inverted index for candidate generation   │  │
│  └───────────────────────────────────────────────────────────┘  │
│  Status: Specified (0.3.md §7.1). Prototype (0.5.py) in-memory.│
│  Production: graph database (Neo4j, ArangoDB) or custom index.  │
└───────────────────────────┬─────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│             STAGE 5: QUERY MATCHING & RANKING (NSG MATCHER)     │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ For query graph G_q:                                       │  │
│  │ 1. Candidate filtering: category indexes, height bounds   │  │
│  │ 2. Type I: Exact subtree isomorphism search                │  │
│  │ 3. Type II: Partial match with coverage scoring            │  │
│  │ 4. Type III: Tree edit distance (fallback)                 │  │
│  │ 5. RANK-AND-CLUSTER: sort by score, detect cluster cuts   │  │
│  │ → Ranked, clustered results                                │  │
│  └───────────────────────────────────────────────────────────┘  │
│  Status: Specified (0.3.md). Prototype (0.5.py) working for    │
│  small graphs. Production needs scaling to corpus size.         │
└─────────────────────────────────────────────────────────────────┘

2.2 Data Flow Diagram

Query (any language)              Document Corpus (any language)
       │                                      │
       ▼                                      ▼
┌──────────────┐                    ┌──────────────┐
│ Morphological│                    │ Morphological│
│ Analyzer     │                    │ Analyzer     │
│ (Stage 2)    │                    │ (Stage 2)    │
└──────┬───────┘                    └──────┬───────┘
       │ morpheme seq                      │ morpheme seq
       ▼                                   ▼
┌──────────────┐                    ┌──────────────┐
│ Q-PNA Encoder│                    │ Q-PNA Encoder│
│ (Stage 3)    │                    │ (Stage 3)    │
└──────┬───────┘                    └──────┬───────┘
       │ NST G_q                           │ NSTs {G_d}
       ▼                                   ▼
       │                           ┌──────────────┐
       │                           │ NSG Indexer  │
       │                           │ (Stage 4)    │
       │                           └──────┬───────┘
       │                                   │ indexed graphs
       │                                   ▼
       │                           ┌──────────────┐
       └──────────────────────────▶│ NSG Matcher  │
                                   │ (Stage 5)    │
                                   └──────┬───────┘
                                          │ ranked results
                                          ▼
                                   ┌──────────────┐
                                   │ Results      │
                                   │ (clustered)  │
                                   └──────────────┘

2.3 Offline vs. Online Processing

Stage	Mode	Latency Budget	Complexity
Morphological Analysis	Online (query) / Offline (documents)	<100ms per query	$\mathcal{O}(\lvert\text{text}\rvert)$
Q-PNA Encoding	Online (query) / Offline (documents)	<50ms per morpheme	$\mathcal{O}(k \cdot m)$
Graph Indexing	Offline (documents only)	Batch	$\mathcal{O}(N \cdot C_d)$
Query Matching	Online	<200ms per query	See $\S$5 below

For the production system, documents are pre-processed through Stages 2-4 offline. At query time, only the query goes through Stages 2-3, and then Stage 5 matches the query graph against the pre-indexed corpus. This provides sub-second query latency for corpora up to $\sim 10^5$ documents.

3. Stage-by-Stage Feasibility Assessment

3.1 Stage 2: Morphological Analysis

Aspect	Status	Details
English	SOLVED	Whitespace tokenization + lemmatization (NLTK, spaCy). Trivial.
Turkish	FEASIBLE	Turkish has mature morphological analyzers (TRmorph, Zemberek). Agglutinative chains are transparently segmentable.
Mohawk / Polysynthetic	CHALLENGING	No production-ready morphological analyzer exists for Mohawk. Academic tools exist for Inuktitut and Cree. Mohawk would require a custom Finite State Transducer (FST) trained on annotated data.
General solution	RESEARCH	A universal morphological analyzer for arbitrary polysynthetic languages does not exist. This is the primary bottleneck for Stage 2.

Mitigation: The prototype (0.5.py) bypasses Stage 2 entirely by using hardcoded tree builders — the trees are constructed directly from semantic specifications. For production, morphological analysis is the language-specific front-end that must be built per language. This is consistent with the architectural principle: the tree is invariant; only the linearization → parsing path differs per language.

Effort estimate: 3-6 months for a Mohawk morphological analyzer (FST-based), assuming access to annotated data and native speaker expertise. 1-2 months for Turkish (leverage existing tools). 1 week for English.

3.2 Stage 3: Q-PNA Semantic Parsing

Aspect	Status	Details
Architecture	SPECIFIED	Q-PNA $\S$3.2 defines the full encoding pipeline: semantic prime assignment → prime product → valuation vector → leaf activation
Working PoC	EXISTS	`ultrametric-ai-poc` (Streamlit app) implements the pipeline using WordNet hypernyms for semantic prime assignment. Working for English.
Cross-linguistic extension	DESIGNED	The encoding pipeline is language-agnostic: it takes morphemes as input and produces valuation vectors. The semantic prime assignment must be extended to non-English languages.
Bruhat-Tits tree implementation	PARTIAL	The ultrametric-ai-poc implements Bruhat-Tits tree distance computation. Full tree construction with leaf activation vectors is specified but not yet implemented at scale.
Token calculus verification	DESIGNED	The syntactic token calculus provides formal verification of every encoding decision. Integration with the encoding pipeline is specified but not yet implemented.

Feasibility: HIGH for English. MEDIUM for Turkish (requires WordNet-equivalent for Turkish, or a cross-linguistic semantic prime ontology). MEDIUM-LOW for Mohawk (requires a semantic prime ontology for Iroquoian languages).

Mitigation: The "semantic prime" approach can be bootstrapped: start with a small set of universal primes (ENTITY, ACTION, TENSE, ASPECT, EVIDENTIAL, LOCATIVE, MANNER, LOGICAL as defined in 0.2.md Definition 3), assign primes based on morphological analysis output, and expand the prime set as the system matures. The Language-Info-Architecture project's four mandatory feature clusters (reference-tracking, source-tracking, categorical-judgment, spatial-coordinate) provide a principled starting point for prime selection.

3.3 Stage 4: Graph Indexing

Aspect	Status	Details
Specification	COMPLETE	`0.3.md` $\S$7.1 defines the index structure with complexity bounds
Prototype	WORKING	`0.5.py` implements in-memory indexing (dict-based node/edge lookup). Works for <1000 nodes.
Production index	STANDARD	Graph databases (Neo4j, ArangoDB) natively support the required operations. LCA can be precomputed and stored. Height-indexed node lookups are $\mathcal{O}(1)$ with B-tree indexes.
Scalability	ACHIEVABLE	For $\sim 10^4$ documents, in-memory index is viable. For $10^4 - 10^6$, sharded graph database. For $> 10^6$, distributed index with approximate nearest-neighbor in ultrametric space.

Feasibility: HIGH. Graph indexing is a solved problem with existing database technology. The NSG's specific requirements (LCA precomputation, category-based indexes, height range queries) are straightforward additions.

3.4 Stage 5: Query Matching & Ranking

Aspect	Status	Details
Type I (Exact)	WORKING	`0.5.py` implements exact subtree isomorphism via recursive candidate matching. Works for small graphs ($\leq 20$ nodes).
Type II (Partial)	WORKING	`0.5.py` implements partial matching with coverage scoring and ultrametric distance penalty.
Type III (Edit Distance)	SPECIFIED	`0.3.md` $\S$3.3 defines the edit distance framework. Not yet implemented.
Ranking	WORKING	`0.5.py` implements coverage × distance penalty scoring with ultrametric cluster boundary detection.
Scalability	CHALLENGING	Subtree isomorphism is NP-complete in general. For semantic trees with bounded degree and small depth, it is tractable. Practical corpus sizes ($10^4 - 10^6$) require candidate filtering before matching.

Feasibility: HIGH for prototype scale (<1000 documents). MEDIUM for production scale ($10^4 - 10^6$). The primary scalability challenge is not the matching algorithm but the candidate filtering layer — efficiently narrowing the search space before running exact matching.

Complexity analysis:

Component	Worst Case	Practical (Semantic Trees)
Subtree isomorphism	$\mathcal{O}(\lvert V_q\rvert \cdot \lvert V_d\rvert \cdot b^2)$	Tractable: $b \leq 5$, $\lvert V_q\rvert \leq 10$
Partial matching	$\mathcal{O}(2^{\lvert V_q\rvert} \cdot \lvert V_d\rvert)$	Tractable: $\lvert V_q\rvert \leq 10$, $\lvert V_d\rvert \leq 50$
Ranking (per query)	$\mathcal{O}(\lvert\mathcal{D}\rvert \cdot M)$	Tractable with candidate filtering
Offline indexing	$\mathcal{O}(\sum_d \lvert V_d\rvert \cdot b^2)$	Batch processing, not latency-sensitive

4. Integration with Existing Systems

4.1 Q-PNA Integration

The Q-PNA neural architecture (DOI: 10.5281/zenodo.20287742) provides Stages 3-4 (encoding and tree construction). The NSG provides Stage 5 (matching and ranking). The interface between them is the Nested Semantic Tree — a clean data structure contract:

Q-PNA OUTPUT → NST (nodes, edges, heights, LCA) → NSG INPUT

Interface contract:

# Q-PNA produces:
nst = NestedSemanticTree(language=detected_language)
for morpheme in analyzed_morphemes:
    primes = assign_semantic_primes(morpheme)
    product = compute_prime_product(primes)
    valuation = compute_valuation(product)
    leaf_id = activate_leaf(valuation)
    nst.add_node(leaf_id, morpheme.concept_label, morpheme.category, …)

# NSG consumes:
results = rank_results(query_nst, indexed_corpus)

This clear separation means Q-PNA and NSG can evolve independently as long as the NST format is preserved.

4.2 ultrametric-ai-poc Integration

The ultrametric-ai-poc (GitHub: rwnq8/ultrametric-ai-poc) provides a working reference implementation for the encoding pipeline. Its architecture maps directly:

ultrametric-ai-poc Component	NSG Pipeline Stage
WordNet hypernym → semantic prime assignment	Stage 3 (semantic parsing)
Prime product computation	Stage 3 (Q-PNA encoding)
p-adic valuation extraction	Stage 3 (valuation vector)
Bruhat-Tits tree distance	Stage 4 (LCA precomputation)
Distinction calculus verification	Stage 3 (token calculus)

The ultrametric-ai-poc does NOT implement search/retrieval — that is the NSG's contribution.

4.3 Language-Info-Architecture Integration

The Language-Info-Architecture project (DOI: 10.5281/zenodo.20137616) provides the quantitative grounding for cross-linguistic search:

Finding	NSG Application
Entropy gradient (6.48 → 6.80 bits/word)	Validates the compression-tax trade-off — semantic information is invariant under recoding
Mutual exclusion of mandatory feature clusters	Constrains the semantic prime design space — certain feature combinations never co-occur
Four mandatory clusters	Provides the starting taxonomy for semantic prime categories

4.4 Syntactic Token Calculus Integration

The Syntactic Token Calculus (Archive: 2026/04/Syntactic Token Calculus/) provides the formal verification substrate:

STC Concept	NSG Application
Mark (□) and Void (ε)	The primitive distinction operation for node creation
Cross-ratio invariance	The mathematical basis for entropy invariance under recoding
Cocycle condition (δω = 0)	Global consistency check for the semantic graph
Normal forms	Canonical representations of semantic propositions

5. Phased Implementation Roadmap

5.1 Phase 1: Prototype (Current — COMPLETE)

Component	Status	Deliverable
Formal definitions	✅	`0.2.md` + `0.2.py`
Search specification	✅	`0.3.md`
Cross-linguistic examples	✅	`0.4.md`
Working pipeline	✅	`0.5.py`
Component diagram	✅	This document (`0.6.md`)

Achievement: Full pipeline demonstrated end-to-end for 3 languages with 5-document corpus. All trees ultrametric. Exact, partial, and cross-linguistic matching working.

5.2 Phase 2: Enhanced Prototype (Planned — BACKLOG P1)

Component	Effort	Description
Tree edit distance implementation	1 week	Implement Type III matching (Zhang-Shasha or AP-TED algorithm)
Larger corpus	1 week	50-100 documents with varied event structures
Morphological analyzer integration	2 weeks	Connect to Turkish morphological analyzer (Zemberek)
Performance benchmarking	1 week	Measure query latency vs. corpus size, identify bottlenecks

5.3 Phase 3: Production-Ready (Planned — BACKLOG P2)

Component	Effort	Description
Graph database integration	1 month	Migrate from in-memory to Neo4j/ArangoDB
Q-PNA encoder integration	2 months	Connect to Q-PNA's encoding pipeline
Candidate filtering optimization	1 month	Category indexes, height bounds, approximate filtering
Cross-linguistic corpus	2 months	Parallel texts in 5+ languages with morphological annotations
Evaluation framework	1 month	Precision/recall metrics, benchmark datasets

5.4 Phase 4: Full-Scale (Planned — BACKLOG P3)

Component	Effort	Description
Production search engine	6 months	Distributed index, caching, load balancing
Mohawk morphological analyzer	3-6 months	FST-based analyzer trained on annotated data
Neural semantic parser	6-12 months	Train or fine-tune parser for common representation
Polysynthetic language corpus	6-12 months	Collect and annotate parallel corpora

6. Key Architectural Decisions

6.1 Why Trees, Not General Graphs

Per ADR-0001 (DECISIONS.md). Trees enforce the ultrametric property, which guarantees:

Nested result clusters (natural cut points in ranking)
Geometric error confinement (the Threshold Principle)
$\mathcal{O}(\log n)$ distance computation via LCA
Direct connection to the Bruhat-Tits tree (Q-PNA's substrate)

Non-tree semantic relationships (coreference, symmetric relations) are handled by secondary indexing — the primary search operates on the tree.

6.2 Why Category-Level Matching for Cross-Linguistic Search

Per the prototype (0.5.py). Labels are language-specific surface forms. Categories (ENTITY, ACTION, TENSE, etc.) are language-neutral. Cross-linguistic matching operates at the category level. Same-language matching can additionally use label matching for semantic precision. This is the distinction between the check_label parameter in _subtree_matches_node.

6.3 Why Separation of Encoder and Matcher

Q-PNA handles encoding (text → NST). NSG handles matching (NST → results). This separation:

Allows independent evolution of both components
Enables different encoders for different languages (same NST format)
Keeps the search architecture language-agnostic
Mirrors the standard separation in information retrieval (indexer vs. query processor)

6.4 Why Offline Indexing + Online Querying

Documents are pre-processed through Stages 2-4 offline. Queries only go through Stages 2-3 online, then Stage 5 matches against the pre-indexed corpus. This provides sub-second query latency because:

Morphological analysis of the corpus is done once, not per query
Q-PNA encoding of the corpus is done once, not per query
LCA preprocessing of the corpus is done once, not per query
Only the query graph needs online processing

7. Limitations and Open Challenges

Limitation	Severity	Mitigation
Morphological analysis for polysynthetic languages	HIGH	This is the primary bottleneck. No production-ready tools exist for most polysynthetic languages. Requires per-language FST development. Mitigated in prototype by hardcoded trees.
Scalability of subtree isomorphism	MEDIUM	NP-complete in general but tractable for semantic trees ($b \leq 5$, $\lvert V_q\rvert \leq 10$). Candidate filtering reduces the search space.
Semantic prime ontology	MEDIUM	No universal semantic prime ontology exists for all languages. Bootstrappable from linguistic typology (Language-Info-Architecture's four mandatory clusters).
Real corpus validation	MEDIUM	All prototype examples use hand-constructed trees. Real-world corpora will have noise, ambiguity, and parsing errors.
No handling of negation, modality, or quantification	LOW	These are important for semantic precision but can be added as node categories without changing the architecture.
Q-PNA integration not yet tested	HIGH	The Q-PNA specification is published but not yet implemented at scale. The NSG prototype assumes the Q-PNA encoder exists. Mitigated by the clean interface contract.

8. Summary

The computational pathway from surface text to ranked search results consists of five stages:

Stage	Function	Status
1. Input	Surface text (any language)	Trivial
2. Morphological Analysis	Text → morpheme sequence	Partial (English solved, Turkish feasible, Mohawk challenging)
3. Q-PNA Encoding	Morphemes → Nested Semantic Tree	Specified (Q-PNA $\S$3.2), PoC exists (ultrametric-ai-poc)
4. Graph Indexing	NST → indexed graph database	Specified (`0.3.md` $\S$7.1), prototype working
5. Query Matching	Query NST + indexed corpus → ranked results	Specified (`0.3.md`), prototype working (`0.5.py`)

The two critical open challenges are:

Morphological analysis for polysynthetic languages (Stage 2 bottlenecks) — requires per-language tools
Q-PNA integration at scale (Stage 3 scaling) — specified but not yet implemented in production

All other stages are either solved (indexing) or demonstrated in prototype (matching, ranking). The architecture is sound, the interfaces are clean, and the mathematical foundation (ultrametric inequality) provides formal guarantees for correctness and performance bounds.

References

0.1.md — Internal Literature Review
0.2.md — Formal Definitions
0.3.md — Sub-Graph Matching Search Specification
0.4.md — Cross-Linguistic Examples
0.5.py — Python Prototype: Full Pipeline
Q-PNA Research Specification v2.0 (DOI: 10.5281/zenodo.20287742)
Few Become One (DOI: 10.5281/zenodo.20328374)
Language as Information Architecture (DOI: 10.5281/zenodo.20137616)
Syntactic Token Calculus (Archive: 2026/04/Syntactic Token Calculus/)
ultrametric-ai-poc (GitHub: github.com/rwnq8/ultrametric-ai-poc)

Computational Pathway v0.6 — The full architecture, specified, assessed, and roadmapped.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Computational Pathway: From Surface Text to Ranked Search Results

The Full Pipeline Architecture for Sub-Graph Matching Search on Nested Semantic Graphs

1. Introduction

2. The Full Pipeline: Five-Stage Architecture

2.1 Pipeline Overview

2.2 Data Flow Diagram

2.3 Offline vs. Online Processing

3. Stage-by-Stage Feasibility Assessment

3.1 Stage 2: Morphological Analysis

3.2 Stage 3: Q-PNA Semantic Parsing

3.3 Stage 4: Graph Indexing

3.4 Stage 5: Query Matching & Ranking

4. Integration with Existing Systems

4.1 Q-PNA Integration

4.2 ultrametric-ai-poc Integration

4.3 Language-Info-Architecture Integration

4.4 Syntactic Token Calculus Integration

5. Phased Implementation Roadmap

5.1 Phase 1: Prototype (Current — COMPLETE)

5.2 Phase 2: Enhanced Prototype (Planned — BACKLOG P1)

5.3 Phase 3: Production-Ready (Planned — BACKLOG P2)

5.4 Phase 4: Full-Scale (Planned — BACKLOG P3)

6. Key Architectural Decisions

6.1 Why Trees, Not General Graphs

6.2 Why Category-Level Matching for Cross-Linguistic Search

6.3 Why Separation of Encoder and Matcher

6.4 Why Offline Indexing + Online Querying

7. Limitations and Open Challenges

8. Summary

References

FilesExpand file tree

0.6.md

Latest commit

History

0.6.md

File metadata and controls

Computational Pathway: From Surface Text to Ranked Search Results

The Full Pipeline Architecture for Sub-Graph Matching Search on Nested Semantic Graphs

1. Introduction

2. The Full Pipeline: Five-Stage Architecture

2.1 Pipeline Overview

2.2 Data Flow Diagram

2.3 Offline vs. Online Processing

3. Stage-by-Stage Feasibility Assessment

3.1 Stage 2: Morphological Analysis

3.2 Stage 3: Q-PNA Semantic Parsing

3.3 Stage 4: Graph Indexing

3.4 Stage 5: Query Matching & Ranking

4. Integration with Existing Systems

4.1 Q-PNA Integration

4.2 ultrametric-ai-poc Integration

4.3 Language-Info-Architecture Integration

4.4 Syntactic Token Calculus Integration

5. Phased Implementation Roadmap

5.1 Phase 1: Prototype (Current — COMPLETE)

5.2 Phase 2: Enhanced Prototype (Planned — BACKLOG P1)

5.3 Phase 3: Production-Ready (Planned — BACKLOG P2)

5.4 Phase 4: Full-Scale (Planned — BACKLOG P3)

6. Key Architectural Decisions

6.1 Why Trees, Not General Graphs

6.2 Why Category-Level Matching for Cross-Linguistic Search

6.3 Why Separation of Encoder and Matcher

6.4 Why Offline Indexing + Online Querying

7. Limitations and Open Challenges

8. Summary

References