Skip to content

Latest commit

 

History

History
402 lines (314 loc) · 27.3 KB

File metadata and controls

402 lines (314 loc) · 27.3 KB

Computational Pathway: From Surface Text to Ranked Search Results

The Full Pipeline Architecture for Sub-Graph Matching Search on Nested Semantic Graphs

Version: 0.6 Date: 2026-05-22 Status: Architecture specification Depends on: 0.1.md (Literature Review), 0.2.md (Formal Definitions), 0.3.md (Search Spec), 0.4.md (Examples), 0.5.py (Prototype) Grounded in: Q-PNA $\S$2-3 (DOI: 10.5281/zenodo.20287742), Few Become One $\S$V (DOI: 10.5281/zenodo.20328374)


1. Introduction

This document specifies the complete computational pathway for the Nested Semantic Graph (NSG) sub-graph matching search system — the end-to-end architecture that converts surface text in any language into ranked, language-neutral search results. It is the synthesis document: it connects the formal definitions (0.2.md), the search specification (0.3.md), the cross-linguistic examples (0.4.md), and the working Python prototype (0.5.py) into a single, coherent architecture, mapped against the existing Q-PNA neural architecture and the broader ultrametric-language research corpus.

The document serves three purposes:

  1. Integration: Shows how all components fit together in a working system
  2. Feasibility Assessment: Evaluates what exists, what needs to be built, and the challenges
  3. Roadmap: Provides a phased implementation plan from prototype to production

2. The Full Pipeline: Five-Stage Architecture

2.1 Pipeline Overview

┌─────────────────────────────────────────────────────────────────┐
│                     STAGE 1: INPUT                               │
│  Surface text (any language) or visual query builder             │
│  "The dog bit the man yesterday"                                 │
│  "Köpek adamı dün ısırdı"                                       │
│  "Wahonwa'kahrá:ko' tsi'niyohseraká:te'"                        │
└───────────────────────────┬─────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│                STAGE 2: MORPHOLOGICAL ANALYSIS                   │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ Language-specific morphological analyzer                  │  │
│  │ • English: whitespace tokenization + lemmatization        │  │
│  │ • Turkish: morpheme segmentation (agglutinative chains)   │  │
│  │ • Mohawk: morpheme decomposition (polysynthetic fusion)   │  │
│  │ Output: Morpheme sequence with grammatical annotations    │  │
│  └───────────────────────────────────────────────────────────┘  │
│  Status: Partially exists — language-specific tools available   │
│  for some languages (Inuktitut, Cree, Turkish); not Mohawk      │
└───────────────────────────┬─────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│              STAGE 3: SEMANTIC PARSING (Q-PNA ENCODER)          │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ Q-PNA §3.2 Token Encoding Pipeline                        │  │
│  │ Step 1: Semantic prime assignment per morpheme            │  │
│  │ Step 2: Prime product P(t) = Π p_i^{f_i(t)}              │  │
│  │ Step 3: Valuation vector v(t) = (v_{p1}(P), v_{p2}(P),…) │  │
│  │ Step 4: Leaf activation on Bruhat-Tits tree               │  │
│  │ → Build Nested Semantic Tree: nodes, edges, heights, LCA  │  │
│  └───────────────────────────────────────────────────────────┘  │
│  Status: Specified (Q-PNA §2-3). Working PoC: ultrametric-ai-poc│
│  uses WordNet hypernyms for semantic prime assignment.          │
└───────────────────────────┬─────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│               STAGE 4: GRAPH INDEXING (NSG INDEXER)             │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ NST → Graph database index                                │  │
│  │ • Nodes indexed by category and label (O(1) lookup)       │  │
│  │ • Edges indexed by parent/child for traversal             │  │
│  │ • LCA table precomputed via binary lifting (O(log n))     │  │
│  │ • Height-indexed nodes for range queries                  │  │
│  │ • Node category inverted index for candidate generation   │  │
│  └───────────────────────────────────────────────────────────┘  │
│  Status: Specified (0.3.md §7.1). Prototype (0.5.py) in-memory.│
│  Production: graph database (Neo4j, ArangoDB) or custom index.  │
└───────────────────────────┬─────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────────┐
│             STAGE 5: QUERY MATCHING & RANKING (NSG MATCHER)     │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │ For query graph G_q:                                       │  │
│  │ 1. Candidate filtering: category indexes, height bounds   │  │
│  │ 2. Type I: Exact subtree isomorphism search                │  │
│  │ 3. Type II: Partial match with coverage scoring            │  │
│  │ 4. Type III: Tree edit distance (fallback)                 │  │
│  │ 5. RANK-AND-CLUSTER: sort by score, detect cluster cuts   │  │
│  │ → Ranked, clustered results                                │  │
│  └───────────────────────────────────────────────────────────┘  │
│  Status: Specified (0.3.md). Prototype (0.5.py) working for    │
│  small graphs. Production needs scaling to corpus size.         │
└─────────────────────────────────────────────────────────────────┘

2.2 Data Flow Diagram

Query (any language)              Document Corpus (any language)
       │                                      │
       ▼                                      ▼
┌──────────────┐                    ┌──────────────┐
│ Morphological│                    │ Morphological│
│ Analyzer     │                    │ Analyzer     │
│ (Stage 2)    │                    │ (Stage 2)    │
└──────┬───────┘                    └──────┬───────┘
       │ morpheme seq                      │ morpheme seq
       ▼                                   ▼
┌──────────────┐                    ┌──────────────┐
│ Q-PNA Encoder│                    │ Q-PNA Encoder│
│ (Stage 3)    │                    │ (Stage 3)    │
└──────┬───────┘                    └──────┬───────┘
       │ NST G_q                           │ NSTs {G_d}
       ▼                                   ▼
       │                           ┌──────────────┐
       │                           │ NSG Indexer  │
       │                           │ (Stage 4)    │
       │                           └──────┬───────┘
       │                                   │ indexed graphs
       │                                   ▼
       │                           ┌──────────────┐
       └──────────────────────────▶│ NSG Matcher  │
                                   │ (Stage 5)    │
                                   └──────┬───────┘
                                          │ ranked results
                                          ▼
                                   ┌──────────────┐
                                   │ Results      │
                                   │ (clustered)  │
                                   └──────────────┘

2.3 Offline vs. Online Processing

Stage Mode Latency Budget Complexity
Morphological Analysis Online (query) / Offline (documents) <100ms per query $\mathcal{O}(\lvert\text{text}\rvert)$
Q-PNA Encoding Online (query) / Offline (documents) <50ms per morpheme $\mathcal{O}(k \cdot m)$
Graph Indexing Offline (documents only) Batch $\mathcal{O}(N \cdot C_d)$
Query Matching Online <200ms per query See $\S$5 below

For the production system, documents are pre-processed through Stages 2-4 offline. At query time, only the query goes through Stages 2-3, and then Stage 5 matches the query graph against the pre-indexed corpus. This provides sub-second query latency for corpora up to $\sim 10^5$ documents.


3. Stage-by-Stage Feasibility Assessment

3.1 Stage 2: Morphological Analysis

Aspect Status Details
English SOLVED Whitespace tokenization + lemmatization (NLTK, spaCy). Trivial.
Turkish FEASIBLE Turkish has mature morphological analyzers (TRmorph, Zemberek). Agglutinative chains are transparently segmentable.
Mohawk / Polysynthetic CHALLENGING No production-ready morphological analyzer exists for Mohawk. Academic tools exist for Inuktitut and Cree. Mohawk would require a custom Finite State Transducer (FST) trained on annotated data.
General solution RESEARCH A universal morphological analyzer for arbitrary polysynthetic languages does not exist. This is the primary bottleneck for Stage 2.

Mitigation: The prototype (0.5.py) bypasses Stage 2 entirely by using hardcoded tree builders — the trees are constructed directly from semantic specifications. For production, morphological analysis is the language-specific front-end that must be built per language. This is consistent with the architectural principle: the tree is invariant; only the linearization → parsing path differs per language.

Effort estimate: 3-6 months for a Mohawk morphological analyzer (FST-based), assuming access to annotated data and native speaker expertise. 1-2 months for Turkish (leverage existing tools). 1 week for English.

3.2 Stage 3: Q-PNA Semantic Parsing

Aspect Status Details
Architecture SPECIFIED Q-PNA $\S$3.2 defines the full encoding pipeline: semantic prime assignment → prime product → valuation vector → leaf activation
Working PoC EXISTS ultrametric-ai-poc (Streamlit app) implements the pipeline using WordNet hypernyms for semantic prime assignment. Working for English.
Cross-linguistic extension DESIGNED The encoding pipeline is language-agnostic: it takes morphemes as input and produces valuation vectors. The semantic prime assignment must be extended to non-English languages.
Bruhat-Tits tree implementation PARTIAL The ultrametric-ai-poc implements Bruhat-Tits tree distance computation. Full tree construction with leaf activation vectors is specified but not yet implemented at scale.
Token calculus verification DESIGNED The syntactic token calculus provides formal verification of every encoding decision. Integration with the encoding pipeline is specified but not yet implemented.

Feasibility: HIGH for English. MEDIUM for Turkish (requires WordNet-equivalent for Turkish, or a cross-linguistic semantic prime ontology). MEDIUM-LOW for Mohawk (requires a semantic prime ontology for Iroquoian languages).

Mitigation: The "semantic prime" approach can be bootstrapped: start with a small set of universal primes (ENTITY, ACTION, TENSE, ASPECT, EVIDENTIAL, LOCATIVE, MANNER, LOGICAL as defined in 0.2.md Definition 3), assign primes based on morphological analysis output, and expand the prime set as the system matures. The Language-Info-Architecture project's four mandatory feature clusters (reference-tracking, source-tracking, categorical-judgment, spatial-coordinate) provide a principled starting point for prime selection.

3.3 Stage 4: Graph Indexing

Aspect Status Details
Specification COMPLETE 0.3.md $\S$7.1 defines the index structure with complexity bounds
Prototype WORKING 0.5.py implements in-memory indexing (dict-based node/edge lookup). Works for <1000 nodes.
Production index STANDARD Graph databases (Neo4j, ArangoDB) natively support the required operations. LCA can be precomputed and stored. Height-indexed node lookups are $\mathcal{O}(1)$ with B-tree indexes.
Scalability ACHIEVABLE For $\sim 10^4$ documents, in-memory index is viable. For $10^4 - 10^6$, sharded graph database. For $&gt; 10^6$, distributed index with approximate nearest-neighbor in ultrametric space.

Feasibility: HIGH. Graph indexing is a solved problem with existing database technology. The NSG's specific requirements (LCA precomputation, category-based indexes, height range queries) are straightforward additions.

3.4 Stage 5: Query Matching & Ranking

Aspect Status Details
Type I (Exact) WORKING 0.5.py implements exact subtree isomorphism via recursive candidate matching. Works for small graphs ($\leq 20$ nodes).
Type II (Partial) WORKING 0.5.py implements partial matching with coverage scoring and ultrametric distance penalty.
Type III (Edit Distance) SPECIFIED 0.3.md $\S$3.3 defines the edit distance framework. Not yet implemented.
Ranking WORKING 0.5.py implements coverage × distance penalty scoring with ultrametric cluster boundary detection.
Scalability CHALLENGING Subtree isomorphism is NP-complete in general. For semantic trees with bounded degree and small depth, it is tractable. Practical corpus sizes ($10^4 - 10^6$) require candidate filtering before matching.

Feasibility: HIGH for prototype scale (<1000 documents). MEDIUM for production scale ($10^4 - 10^6$). The primary scalability challenge is not the matching algorithm but the candidate filtering layer — efficiently narrowing the search space before running exact matching.

Complexity analysis:

Component Worst Case Practical (Semantic Trees)
Subtree isomorphism $\mathcal{O}(\lvert V_q\rvert \cdot \lvert V_d\rvert \cdot b^2)$ Tractable: $b \leq 5$, $\lvert V_q\rvert \leq 10$
Partial matching $\mathcal{O}(2^{\lvert V_q\rvert} \cdot \lvert V_d\rvert)$ Tractable: $\lvert V_q\rvert \leq 10$, $\lvert V_d\rvert \leq 50$
Ranking (per query) $\mathcal{O}(\lvert\mathcal{D}\rvert \cdot M)$ Tractable with candidate filtering
Offline indexing $\mathcal{O}(\sum_d \lvert V_d\rvert \cdot b^2)$ Batch processing, not latency-sensitive

4. Integration with Existing Systems

4.1 Q-PNA Integration

The Q-PNA neural architecture (DOI: 10.5281/zenodo.20287742) provides Stages 3-4 (encoding and tree construction). The NSG provides Stage 5 (matching and ranking). The interface between them is the Nested Semantic Tree — a clean data structure contract:

Q-PNA OUTPUT → NST (nodes, edges, heights, LCA) → NSG INPUT

Interface contract:

# Q-PNA produces:
nst = NestedSemanticTree(language=detected_language)
for morpheme in analyzed_morphemes:
    primes = assign_semantic_primes(morpheme)
    product = compute_prime_product(primes)
    valuation = compute_valuation(product)
    leaf_id = activate_leaf(valuation)
    nst.add_node(leaf_id, morpheme.concept_label, morpheme.category, …)

# NSG consumes:
results = rank_results(query_nst, indexed_corpus)

This clear separation means Q-PNA and NSG can evolve independently as long as the NST format is preserved.

4.2 ultrametric-ai-poc Integration

The ultrametric-ai-poc (GitHub: rwnq8/ultrametric-ai-poc) provides a working reference implementation for the encoding pipeline. Its architecture maps directly:

ultrametric-ai-poc Component NSG Pipeline Stage
WordNet hypernym → semantic prime assignment Stage 3 (semantic parsing)
Prime product computation Stage 3 (Q-PNA encoding)
p-adic valuation extraction Stage 3 (valuation vector)
Bruhat-Tits tree distance Stage 4 (LCA precomputation)
Distinction calculus verification Stage 3 (token calculus)

The ultrametric-ai-poc does NOT implement search/retrieval — that is the NSG's contribution.

4.3 Language-Info-Architecture Integration

The Language-Info-Architecture project (DOI: 10.5281/zenodo.20137616) provides the quantitative grounding for cross-linguistic search:

Finding NSG Application
Entropy gradient (6.48 → 6.80 bits/word) Validates the compression-tax trade-off — semantic information is invariant under recoding
Mutual exclusion of mandatory feature clusters Constrains the semantic prime design space — certain feature combinations never co-occur
Four mandatory clusters Provides the starting taxonomy for semantic prime categories

4.4 Syntactic Token Calculus Integration

The Syntactic Token Calculus (Archive: 2026/04/Syntactic Token Calculus/) provides the formal verification substrate:

STC Concept NSG Application
Mark (□) and Void (ε) The primitive distinction operation for node creation
Cross-ratio invariance The mathematical basis for entropy invariance under recoding
Cocycle condition (δω = 0) Global consistency check for the semantic graph
Normal forms Canonical representations of semantic propositions

5. Phased Implementation Roadmap

5.1 Phase 1: Prototype (Current — COMPLETE)

Component Status Deliverable
Formal definitions 0.2.md + 0.2.py
Search specification 0.3.md
Cross-linguistic examples 0.4.md
Working pipeline 0.5.py
Component diagram This document (0.6.md)

Achievement: Full pipeline demonstrated end-to-end for 3 languages with 5-document corpus. All trees ultrametric. Exact, partial, and cross-linguistic matching working.

5.2 Phase 2: Enhanced Prototype (Planned — BACKLOG P1)

Component Effort Description
Tree edit distance implementation 1 week Implement Type III matching (Zhang-Shasha or AP-TED algorithm)
Larger corpus 1 week 50-100 documents with varied event structures
Morphological analyzer integration 2 weeks Connect to Turkish morphological analyzer (Zemberek)
Performance benchmarking 1 week Measure query latency vs. corpus size, identify bottlenecks

5.3 Phase 3: Production-Ready (Planned — BACKLOG P2)

Component Effort Description
Graph database integration 1 month Migrate from in-memory to Neo4j/ArangoDB
Q-PNA encoder integration 2 months Connect to Q-PNA's encoding pipeline
Candidate filtering optimization 1 month Category indexes, height bounds, approximate filtering
Cross-linguistic corpus 2 months Parallel texts in 5+ languages with morphological annotations
Evaluation framework 1 month Precision/recall metrics, benchmark datasets

5.4 Phase 4: Full-Scale (Planned — BACKLOG P3)

Component Effort Description
Production search engine 6 months Distributed index, caching, load balancing
Mohawk morphological analyzer 3-6 months FST-based analyzer trained on annotated data
Neural semantic parser 6-12 months Train or fine-tune parser for common representation
Polysynthetic language corpus 6-12 months Collect and annotate parallel corpora

6. Key Architectural Decisions

6.1 Why Trees, Not General Graphs

Per ADR-0001 (DECISIONS.md). Trees enforce the ultrametric property, which guarantees:

  • Nested result clusters (natural cut points in ranking)
  • Geometric error confinement (the Threshold Principle)
  • $\mathcal{O}(\log n)$ distance computation via LCA
  • Direct connection to the Bruhat-Tits tree (Q-PNA's substrate)

Non-tree semantic relationships (coreference, symmetric relations) are handled by secondary indexing — the primary search operates on the tree.

6.2 Why Category-Level Matching for Cross-Linguistic Search

Per the prototype (0.5.py). Labels are language-specific surface forms. Categories (ENTITY, ACTION, TENSE, etc.) are language-neutral. Cross-linguistic matching operates at the category level. Same-language matching can additionally use label matching for semantic precision. This is the distinction between the check_label parameter in _subtree_matches_node.

6.3 Why Separation of Encoder and Matcher

Q-PNA handles encoding (text → NST). NSG handles matching (NST → results). This separation:

  • Allows independent evolution of both components
  • Enables different encoders for different languages (same NST format)
  • Keeps the search architecture language-agnostic
  • Mirrors the standard separation in information retrieval (indexer vs. query processor)

6.4 Why Offline Indexing + Online Querying

Documents are pre-processed through Stages 2-4 offline. Queries only go through Stages 2-3 online, then Stage 5 matches against the pre-indexed corpus. This provides sub-second query latency because:

  • Morphological analysis of the corpus is done once, not per query
  • Q-PNA encoding of the corpus is done once, not per query
  • LCA preprocessing of the corpus is done once, not per query
  • Only the query graph needs online processing

7. Limitations and Open Challenges

Limitation Severity Mitigation
Morphological analysis for polysynthetic languages HIGH This is the primary bottleneck. No production-ready tools exist for most polysynthetic languages. Requires per-language FST development. Mitigated in prototype by hardcoded trees.
Scalability of subtree isomorphism MEDIUM NP-complete in general but tractable for semantic trees ($b \leq 5$, $\lvert V_q\rvert \leq 10$). Candidate filtering reduces the search space.
Semantic prime ontology MEDIUM No universal semantic prime ontology exists for all languages. Bootstrappable from linguistic typology (Language-Info-Architecture's four mandatory clusters).
Real corpus validation MEDIUM All prototype examples use hand-constructed trees. Real-world corpora will have noise, ambiguity, and parsing errors.
No handling of negation, modality, or quantification LOW These are important for semantic precision but can be added as node categories without changing the architecture.
Q-PNA integration not yet tested HIGH The Q-PNA specification is published but not yet implemented at scale. The NSG prototype assumes the Q-PNA encoder exists. Mitigated by the clean interface contract.

8. Summary

The computational pathway from surface text to ranked search results consists of five stages:

Stage Function Status
1. Input Surface text (any language) Trivial
2. Morphological Analysis Text → morpheme sequence Partial (English solved, Turkish feasible, Mohawk challenging)
3. Q-PNA Encoding Morphemes → Nested Semantic Tree Specified (Q-PNA $\S$3.2), PoC exists (ultrametric-ai-poc)
4. Graph Indexing NST → indexed graph database Specified (0.3.md $\S$7.1), prototype working
5. Query Matching Query NST + indexed corpus → ranked results Specified (0.3.md), prototype working (0.5.py)

The two critical open challenges are:

  1. Morphological analysis for polysynthetic languages (Stage 2 bottlenecks) — requires per-language tools
  2. Q-PNA integration at scale (Stage 3 scaling) — specified but not yet implemented in production

All other stages are either solved (indexing) or demonstrated in prototype (matching, ranking). The architecture is sound, the interfaces are clean, and the mathematical foundation (ultrametric inequality) provides formal guarantees for correctness and performance bounds.


References

  1. 0.1.md — Internal Literature Review
  2. 0.2.md — Formal Definitions
  3. 0.3.md — Sub-Graph Matching Search Specification
  4. 0.4.md — Cross-Linguistic Examples
  5. 0.5.py — Python Prototype: Full Pipeline
  6. Q-PNA Research Specification v2.0 (DOI: 10.5281/zenodo.20287742)
  7. Few Become One (DOI: 10.5281/zenodo.20328374)
  8. Language as Information Architecture (DOI: 10.5281/zenodo.20137616)
  9. Syntactic Token Calculus (Archive: 2026/04/Syntactic Token Calculus/)
  10. ultrametric-ai-poc (GitHub: github.com/rwnq8/ultrametric-ai-poc)

Computational Pathway v0.6 — The full architecture, specified, assessed, and roadmapped.