Morphological Analyzer Survey for Cross-Linguistic NST Parsing

Version: 0.8 Date: 2026-05-22 Status: Research survey — LLM-inferred, requires verification against current tool versions Depends on: 0.1.md (Literature Review), 0.6.md (Computational Pathway)

1. Purpose

This document surveys existing morphological analysis tools for languages spanning the isolating-to-polysynthetic spectrum. The Nested Semantic Graph architecture requires Stage 2 (Morphological Analysis) to segment surface text into morphemes before Stage 3 (Semantic Parsing) can construct NSTs. This survey assesses what tools exist, what gaps remain, and what would be required to close those gaps.

Caveat: All tool names, version states, and capability claims in this document are [LLM-INFERRED] — based on the author's training-data knowledge. Verification against current documentation and live tool versions is required before any production dependency.

2. The Morphological Spectrum

Type	Example	Word Structure	Analyzer Complexity
Isolating	English, Mandarin	Each morpheme ≈ one word	Trivial — whitespace tokenization is nearly sufficient
Agglutinative	Turkish, Japanese, Finnish	Morpheme chains with transparent boundaries	Moderate — FST-based segmenters exist
Fusional	Latin, Russian, Arabic	Morphemes fused; one affix encodes multiple categories	Hard — requires disambiguation of portmanteau morphemes
Polysynthetic	Mohawk, Inuktitut, Cree, Nahuatl	Entire proposition in one word; extensive affixation	Very hard — few tools exist, limited annotated data

3. Tool Survey by Language

3.1 English (Isolating) — SOLVED

Tool	Type	Status	Notes
NLTK WordNetLemmatizer	Lemmatizer	Production	`[LLM-INFERRED]` Maps inflected forms to base forms. Adequate for English.
spaCy	NLP pipeline	Production	Includes tokenization, POS tagging, lemmatization, dependency parsing.
Stanford CoreNLP	NLP pipeline	Production	Full pipeline including morphological analysis.

Assessment: Trivial. English morphological analysis is a solved problem. Whitespace tokenization covers ~95% of cases; lemmatization handles irregular forms (bite→bit→bitten).

NSG integration: English text can be tokenized by any production NLP tool. Morphemes are extracted as base forms. Dependency parsing provides initial edge structure for NST construction.

3.2 Turkish (Agglutinative) — SOLVED

Tool	Type	Status	Notes
Zemberek	FST-based morphological analyzer	Production	`[LLM-INFERRED]` Open-source. Handles Turkish's complex agglutinative morphology including vowel harmony, consonant mutation, and extensive suffix chains.
TRmorph	Two-level morphology	Academic/Production	`[LLM-INFERRED]` Based on Koskenniemi's two-level morphology. Handles Turkish suffixation.
ITU Turkish NLP	Pipeline	Academic	`[LLM-INFERRED]` Includes morphological analysis, disambiguation, and dependency parsing.

Assessment: Feasible. Turkish agglutinative morphology is well-studied and has production-ready tools. Suffix chains are transparently segmentable (each suffix carries one grammatical function).

NSG integration: Zemberek can segment a Turkish word like "ısırdı" (bite-PAST) into morphemes [ısır, -dı]. These morphemes map directly to NSG nodes: ısır → BITE(ACTION), -dı → PAST(TENSE).

3.3 Finnish (Agglutinative) — SOLVED

Tool	Type	Status	Notes
Omorfi	FST-based analyzer	Production	`[LLM-INFERRED]` Open-source. Handles Finnish noun cases (15), verb conjugation, and possessive suffixes.
HFST (Helsinki Finite-State Toolkit)	FST framework	Production	`[LLM-INFERRED]` General framework for building morphological analyzers. Used for Finnish and other languages.

Assessment: Solved. Finnish shares Turkish's agglutinative structure and has comparable tool support.

3.4 Inuktitut (Polysynthetic) — RESEARCH-STAGE

Tool	Type	Status	Notes
Uqailaut	FST-based morphological analyzer	Research	`[LLM-INFERRED]` Developed at the University of Toronto and Inuit Tapiriit Kanatami. Handles Inuktitut's complex polysynthetic verb morphology. Reported ~85% coverage on test corpora.
Inuktitut Morphological Analyzer (NRC)	Rule-based	Research	`[LLM-INFERRED]` National Research Council Canada project.

Assessment: Feasible but not production-ready. Uqailaut demonstrates that FST-based analysis of polysynthetic morphology is possible at ~85% accuracy. The remaining 15% involves dialectal variation, irregular forms, and productive compounding that the FST rules don't yet cover.

NSG integration: For a word like Inuululauqsimanngittualuujunga ("I definitely didn't live for a very long time"), Uqailaut would segment it into morphemes like [inuu-, -lauq-, -sima-, -nngit-, -tualuu-, -junga]. Each morpheme maps to an NSG node.

3.5 Cree (Polysynthetic) — RESEARCH-STAGE

Tool	Type	Status	Notes
Plains Cree FST	FST-based analyzer	Research	`[LLM-INFERRED]` Developed by the GI Windows project (Alberta Language Technology Lab). Handles Plains Cree verb morphology.
Cree Language Project	Resources	Academic	`[LLM-INFERRED]` Various dictionary and grammatical resources at the University of Alberta.

Assessment: Similar to Inuktitut — research-stage tools exist but are not production-ready. Cree shares the polysynthetic pattern: verb roots with extensive prefix/suffix chains encoding arguments, tense, aspect, and modality.

3.6 Mohawk (Polysynthetic) — GAP

Tool	Type	Status	Notes
No known FST	—	Does not exist	`[LLM-INFERRED]` Based on the author's training data, no publicly available morphological analyzer exists for Mohawk.
Mohawk Language Resources	Dictionary/grammar	Academic	`[LLM-INFERRED]` Comprehensive descriptive grammars exist (Mithun, Bonvillain, Michelson). Dictionaries exist. But no computational morphological analyzer.

Assessment: This is the primary bottleneck for the NSG architecture. No production-ready morphological analyzer exists for Mohawk. However, the linguistic resources needed to build one (descriptive grammars, dictionaries, annotated texts) do exist — the gap is computational, not linguistic.

3.7 Other Polysynthetic Languages

Language	Tools	Status
Nahuatl	`[LLM-INFERRED]` Classical Nahuatl has academic morphological analyzers. Modern variants less supported.	Mixed
Quechua	`[LLM-INFERRED]` Some computational work (RUNASIMI project). ~8-10 million speakers.	Research
Mapudungun	`[LLM-INFERRED]` Limited computational resources.	Gap
Tiwi, Warlpiri (Australian)	`[LLM-INFERRED]` Minimal computational resources.	Gap

4. The FST Approach to Polysynthetic Analysis

4.1 Why Finite State Transducers

Finite State Transducers (FSTs) are the dominant approach to morphological analysis. An FST maps between two levels:

Lexical level: morpheme sequence with grammatical tags (e.g., bite+V+PAST)
Surface level: the actual written/pronounced form (e.g., "ısırdı")

The FST encodes morphological rules as state transitions: vowel harmony, consonant mutation, affix ordering constraints, and morpheme boundary phonology.

4.2 What Building a Mohawk FST Requires

Resource	Status	Effort
Comprehensive descriptive grammar	Exists (Mithun, Bonvillain)	0 — already available
Morpheme inventory (roots, prefixes, suffixes)	Exists in grammars/dictionaries	1-2 months to digitize
Morphological rule set (affix ordering, allomorphy)	Described in grammars	2-3 months to formalize as FST rules
Annotated corpus (morpheme-segmented text)	Does not exist	6-12 months to create with native speaker collaboration
Native speaker verification	Requires community partnership	Ongoing
FST implementation (HFST toolkit)	Framework exists	3-6 months development

Estimated total: 12-24 months for a production-quality Mohawk morphological analyzer, assuming access to native speaker expertise and funding.

4.3 Bootstrapping Without a Full Analyzer

Until a full FST exists, the NSG can operate with:

Dictionary-based lookup: Parse known verb roots and affix combinations from a digitized dictionary
Rule-based approximation: Use hand-crafted rules for the most common morphological patterns
Human-in-the-loop: Accept manually-segmented text for indexing, with automated analysis for queries
Cross-linguistic transfer: Use patterns from Inuktitut and Cree analyzers as templates for Mohawk rules (the polysynthetic pattern is shared)

5. Recommendations

5.1 Immediate (Sprint 2-3)

Integrate Zemberek for Turkish. Turkish is the easiest agglutinative language to add after English. Zemberek is production-ready and well-documented.
Evaluate Uqailaut for Inuktitut. Test on the Inuktitut example from Few Become One. Assess accuracy.
Document the Mohawk gap explicitly. The NSG architecture works — the bottleneck is per-language tooling, not architectural.

5.2 Medium-Term (Sprint 4-5)

Build a minimal Mohawk FST prototype. Start with the 10 most common verb roots and 5 most common affix patterns. Use HFST.
Create a small annotated Mohawk corpus. 100-200 sentences, manually segmented and verified with a native speaker.
Develop a Turkish NSG pipeline. Connect Zemberek → NSG builder → corpus index.

5.3 Long-Term (Phase 4)

Full Mohawk FST. Production-quality analyzer with >90% coverage.
Extend to other polysynthetic languages. Nahuatl, Quechua, Cree, Mapudungun — prioritized by speaker population and available resources.
Universal morphological analyzer. A single FST framework that can be configured for any language by providing a morpheme inventory and rule set.

6. Conclusion

The morphological analysis landscape is uneven. English and Turkish are solved. Inuktitut and Cree have research-stage tools. Mohawk has none. This is the primary bottleneck for the NSG architecture's production deployment — but it is a per-language bottleneck, not an architectural one. Each language that gains a morphological analyzer can be plugged into the same NST pipeline. The tree is invariant; only the parsing path differs.

References

0.6.md — Computational Pathway (this project)
0.1.md — Internal Literature Review
Koskenniemi, K. (1983). "Two-Level Morphology: A General Computational Model for Word-Form Recognition and Production." [LLM-INFERRED]
HFST — Helsinki Finite-State Technology. [LLM-INFERRED]
Mithun, M. (1999). "The Languages of Native North America." Cambridge University Press. [LLM-INFERRED]

Morphological Analyzer Survey v0.8 — LLM-inferred tool assessments. All tool names and status claims require verification against current documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Morphological Analyzer Survey for Cross-Linguistic NST Parsing

1. Purpose

2. The Morphological Spectrum

3. Tool Survey by Language

3.1 English (Isolating) — SOLVED

3.2 Turkish (Agglutinative) — SOLVED

3.3 Finnish (Agglutinative) — SOLVED

3.4 Inuktitut (Polysynthetic) — RESEARCH-STAGE

3.5 Cree (Polysynthetic) — RESEARCH-STAGE

3.6 Mohawk (Polysynthetic) — GAP

3.7 Other Polysynthetic Languages

4. The FST Approach to Polysynthetic Analysis

4.1 Why Finite State Transducers

4.2 What Building a Mohawk FST Requires

4.3 Bootstrapping Without a Full Analyzer

5. Recommendations

5.1 Immediate (Sprint 2-3)

5.2 Medium-Term (Sprint 4-5)

5.3 Long-Term (Phase 4)

6. Conclusion

References

FilesExpand file tree

0.8.md

Latest commit

History

0.8.md

File metadata and controls

Morphological Analyzer Survey for Cross-Linguistic NST Parsing

1. Purpose

2. The Morphological Spectrum

3. Tool Survey by Language

3.1 English (Isolating) — SOLVED

3.2 Turkish (Agglutinative) — SOLVED

3.3 Finnish (Agglutinative) — SOLVED

3.4 Inuktitut (Polysynthetic) — RESEARCH-STAGE

3.5 Cree (Polysynthetic) — RESEARCH-STAGE

3.6 Mohawk (Polysynthetic) — GAP

3.7 Other Polysynthetic Languages

4. The FST Approach to Polysynthetic Analysis

4.1 Why Finite State Transducers

4.2 What Building a Mohawk FST Requires

4.3 Bootstrapping Without a Full Analyzer

5. Recommendations

5.1 Immediate (Sprint 2-3)

5.2 Medium-Term (Sprint 4-5)

5.3 Long-Term (Phase 4)

6. Conclusion

References