Version: 0.8
Date: 2026-05-22
Status: Research survey — LLM-inferred, requires verification against current tool versions
Depends on: 0.1.md (Literature Review), 0.6.md (Computational Pathway)
This document surveys existing morphological analysis tools for languages spanning the isolating-to-polysynthetic spectrum. The Nested Semantic Graph architecture requires Stage 2 (Morphological Analysis) to segment surface text into morphemes before Stage 3 (Semantic Parsing) can construct NSTs. This survey assesses what tools exist, what gaps remain, and what would be required to close those gaps.
Caveat: All tool names, version states, and capability claims in this document are [LLM-INFERRED] — based on the author's training-data knowledge. Verification against current documentation and live tool versions is required before any production dependency.
| Type | Example | Word Structure | Analyzer Complexity |
|---|---|---|---|
| Isolating | English, Mandarin | Each morpheme ≈ one word | Trivial — whitespace tokenization is nearly sufficient |
| Agglutinative | Turkish, Japanese, Finnish | Morpheme chains with transparent boundaries | Moderate — FST-based segmenters exist |
| Fusional | Latin, Russian, Arabic | Morphemes fused; one affix encodes multiple categories | Hard — requires disambiguation of portmanteau morphemes |
| Polysynthetic | Mohawk, Inuktitut, Cree, Nahuatl | Entire proposition in one word; extensive affixation | Very hard — few tools exist, limited annotated data |
| Tool | Type | Status | Notes |
|---|---|---|---|
| NLTK WordNetLemmatizer | Lemmatizer | Production | [LLM-INFERRED] Maps inflected forms to base forms. Adequate for English. |
| spaCy | NLP pipeline | Production | Includes tokenization, POS tagging, lemmatization, dependency parsing. |
| Stanford CoreNLP | NLP pipeline | Production | Full pipeline including morphological analysis. |
Assessment: Trivial. English morphological analysis is a solved problem. Whitespace tokenization covers ~95% of cases; lemmatization handles irregular forms (bite→bit→bitten).
NSG integration: English text can be tokenized by any production NLP tool. Morphemes are extracted as base forms. Dependency parsing provides initial edge structure for NST construction.
| Tool | Type | Status | Notes |
|---|---|---|---|
| Zemberek | FST-based morphological analyzer | Production | [LLM-INFERRED] Open-source. Handles Turkish's complex agglutinative morphology including vowel harmony, consonant mutation, and extensive suffix chains. |
| TRmorph | Two-level morphology | Academic/Production | [LLM-INFERRED] Based on Koskenniemi's two-level morphology. Handles Turkish suffixation. |
| ITU Turkish NLP | Pipeline | Academic | [LLM-INFERRED] Includes morphological analysis, disambiguation, and dependency parsing. |
Assessment: Feasible. Turkish agglutinative morphology is well-studied and has production-ready tools. Suffix chains are transparently segmentable (each suffix carries one grammatical function).
NSG integration: Zemberek can segment a Turkish word like "ısırdı" (bite-PAST) into morphemes [ısır, -dı]. These morphemes map directly to NSG nodes: ısır → BITE(ACTION), -dı → PAST(TENSE).
| Tool | Type | Status | Notes |
|---|---|---|---|
| Omorfi | FST-based analyzer | Production | [LLM-INFERRED] Open-source. Handles Finnish noun cases (15), verb conjugation, and possessive suffixes. |
| HFST (Helsinki Finite-State Toolkit) | FST framework | Production | [LLM-INFERRED] General framework for building morphological analyzers. Used for Finnish and other languages. |
Assessment: Solved. Finnish shares Turkish's agglutinative structure and has comparable tool support.
| Tool | Type | Status | Notes |
|---|---|---|---|
| Uqailaut | FST-based morphological analyzer | Research | [LLM-INFERRED] Developed at the University of Toronto and Inuit Tapiriit Kanatami. Handles Inuktitut's complex polysynthetic verb morphology. Reported ~85% coverage on test corpora. |
| Inuktitut Morphological Analyzer (NRC) | Rule-based | Research | [LLM-INFERRED] National Research Council Canada project. |
Assessment: Feasible but not production-ready. Uqailaut demonstrates that FST-based analysis of polysynthetic morphology is possible at ~85% accuracy. The remaining 15% involves dialectal variation, irregular forms, and productive compounding that the FST rules don't yet cover.
NSG integration: For a word like Inuululauqsimanngittualuujunga ("I definitely didn't live for a very long time"), Uqailaut would segment it into morphemes like [inuu-, -lauq-, -sima-, -nngit-, -tualuu-, -junga]. Each morpheme maps to an NSG node.
| Tool | Type | Status | Notes |
|---|---|---|---|
| Plains Cree FST | FST-based analyzer | Research | [LLM-INFERRED] Developed by the GI Windows project (Alberta Language Technology Lab). Handles Plains Cree verb morphology. |
| Cree Language Project | Resources | Academic | [LLM-INFERRED] Various dictionary and grammatical resources at the University of Alberta. |
Assessment: Similar to Inuktitut — research-stage tools exist but are not production-ready. Cree shares the polysynthetic pattern: verb roots with extensive prefix/suffix chains encoding arguments, tense, aspect, and modality.
| Tool | Type | Status | Notes |
|---|---|---|---|
| No known FST | — | Does not exist | [LLM-INFERRED] Based on the author's training data, no publicly available morphological analyzer exists for Mohawk. |
| Mohawk Language Resources | Dictionary/grammar | Academic | [LLM-INFERRED] Comprehensive descriptive grammars exist (Mithun, Bonvillain, Michelson). Dictionaries exist. But no computational morphological analyzer. |
Assessment: This is the primary bottleneck for the NSG architecture. No production-ready morphological analyzer exists for Mohawk. However, the linguistic resources needed to build one (descriptive grammars, dictionaries, annotated texts) do exist — the gap is computational, not linguistic.
| Language | Tools | Status |
|---|---|---|
| Nahuatl | [LLM-INFERRED] Classical Nahuatl has academic morphological analyzers. Modern variants less supported. |
Mixed |
| Quechua | [LLM-INFERRED] Some computational work (RUNASIMI project). ~8-10 million speakers. |
Research |
| Mapudungun | [LLM-INFERRED] Limited computational resources. |
Gap |
| Tiwi, Warlpiri (Australian) | [LLM-INFERRED] Minimal computational resources. |
Gap |
Finite State Transducers (FSTs) are the dominant approach to morphological analysis. An FST maps between two levels:
- Lexical level: morpheme sequence with grammatical tags (e.g.,
bite+V+PAST) - Surface level: the actual written/pronounced form (e.g., "ısırdı")
The FST encodes morphological rules as state transitions: vowel harmony, consonant mutation, affix ordering constraints, and morpheme boundary phonology.
| Resource | Status | Effort |
|---|---|---|
| Comprehensive descriptive grammar | Exists (Mithun, Bonvillain) | 0 — already available |
| Morpheme inventory (roots, prefixes, suffixes) | Exists in grammars/dictionaries | 1-2 months to digitize |
| Morphological rule set (affix ordering, allomorphy) | Described in grammars | 2-3 months to formalize as FST rules |
| Annotated corpus (morpheme-segmented text) | Does not exist | 6-12 months to create with native speaker collaboration |
| Native speaker verification | Requires community partnership | Ongoing |
| FST implementation (HFST toolkit) | Framework exists | 3-6 months development |
Estimated total: 12-24 months for a production-quality Mohawk morphological analyzer, assuming access to native speaker expertise and funding.
Until a full FST exists, the NSG can operate with:
- Dictionary-based lookup: Parse known verb roots and affix combinations from a digitized dictionary
- Rule-based approximation: Use hand-crafted rules for the most common morphological patterns
- Human-in-the-loop: Accept manually-segmented text for indexing, with automated analysis for queries
- Cross-linguistic transfer: Use patterns from Inuktitut and Cree analyzers as templates for Mohawk rules (the polysynthetic pattern is shared)
- Integrate Zemberek for Turkish. Turkish is the easiest agglutinative language to add after English. Zemberek is production-ready and well-documented.
- Evaluate Uqailaut for Inuktitut. Test on the Inuktitut example from Few Become One. Assess accuracy.
- Document the Mohawk gap explicitly. The NSG architecture works — the bottleneck is per-language tooling, not architectural.
- Build a minimal Mohawk FST prototype. Start with the 10 most common verb roots and 5 most common affix patterns. Use HFST.
- Create a small annotated Mohawk corpus. 100-200 sentences, manually segmented and verified with a native speaker.
- Develop a Turkish NSG pipeline. Connect Zemberek → NSG builder → corpus index.
- Full Mohawk FST. Production-quality analyzer with >90% coverage.
- Extend to other polysynthetic languages. Nahuatl, Quechua, Cree, Mapudungun — prioritized by speaker population and available resources.
- Universal morphological analyzer. A single FST framework that can be configured for any language by providing a morpheme inventory and rule set.
The morphological analysis landscape is uneven. English and Turkish are solved. Inuktitut and Cree have research-stage tools. Mohawk has none. This is the primary bottleneck for the NSG architecture's production deployment — but it is a per-language bottleneck, not an architectural one. Each language that gains a morphological analyzer can be plugged into the same NST pipeline. The tree is invariant; only the parsing path differs.
0.6.md— Computational Pathway (this project)0.1.md— Internal Literature Review- Koskenniemi, K. (1983). "Two-Level Morphology: A General Computational Model for Word-Form Recognition and Production."
[LLM-INFERRED] - HFST — Helsinki Finite-State Technology.
[LLM-INFERRED] - Mithun, M. (1999). "The Languages of Native North America." Cambridge University Press.
[LLM-INFERRED]
Morphological Analyzer Survey v0.8 — LLM-inferred tool assessments. All tool names and status claims require verification against current documentation.