Skip to content

Latest commit

 

History

History
176 lines (116 loc) · 11.2 KB

File metadata and controls

176 lines (116 loc) · 11.2 KB

Morphological Analyzer Survey for Cross-Linguistic NST Parsing

Version: 0.8 Date: 2026-05-22 Status: Research survey — LLM-inferred, requires verification against current tool versions Depends on: 0.1.md (Literature Review), 0.6.md (Computational Pathway)


1. Purpose

This document surveys existing morphological analysis tools for languages spanning the isolating-to-polysynthetic spectrum. The Nested Semantic Graph architecture requires Stage 2 (Morphological Analysis) to segment surface text into morphemes before Stage 3 (Semantic Parsing) can construct NSTs. This survey assesses what tools exist, what gaps remain, and what would be required to close those gaps.

Caveat: All tool names, version states, and capability claims in this document are [LLM-INFERRED] — based on the author's training-data knowledge. Verification against current documentation and live tool versions is required before any production dependency.


2. The Morphological Spectrum

Type Example Word Structure Analyzer Complexity
Isolating English, Mandarin Each morpheme ≈ one word Trivial — whitespace tokenization is nearly sufficient
Agglutinative Turkish, Japanese, Finnish Morpheme chains with transparent boundaries Moderate — FST-based segmenters exist
Fusional Latin, Russian, Arabic Morphemes fused; one affix encodes multiple categories Hard — requires disambiguation of portmanteau morphemes
Polysynthetic Mohawk, Inuktitut, Cree, Nahuatl Entire proposition in one word; extensive affixation Very hard — few tools exist, limited annotated data

3. Tool Survey by Language

3.1 English (Isolating) — SOLVED

Tool Type Status Notes
NLTK WordNetLemmatizer Lemmatizer Production [LLM-INFERRED] Maps inflected forms to base forms. Adequate for English.
spaCy NLP pipeline Production Includes tokenization, POS tagging, lemmatization, dependency parsing.
Stanford CoreNLP NLP pipeline Production Full pipeline including morphological analysis.

Assessment: Trivial. English morphological analysis is a solved problem. Whitespace tokenization covers ~95% of cases; lemmatization handles irregular forms (bite→bit→bitten).

NSG integration: English text can be tokenized by any production NLP tool. Morphemes are extracted as base forms. Dependency parsing provides initial edge structure for NST construction.

3.2 Turkish (Agglutinative) — SOLVED

Tool Type Status Notes
Zemberek FST-based morphological analyzer Production [LLM-INFERRED] Open-source. Handles Turkish's complex agglutinative morphology including vowel harmony, consonant mutation, and extensive suffix chains.
TRmorph Two-level morphology Academic/Production [LLM-INFERRED] Based on Koskenniemi's two-level morphology. Handles Turkish suffixation.
ITU Turkish NLP Pipeline Academic [LLM-INFERRED] Includes morphological analysis, disambiguation, and dependency parsing.

Assessment: Feasible. Turkish agglutinative morphology is well-studied and has production-ready tools. Suffix chains are transparently segmentable (each suffix carries one grammatical function).

NSG integration: Zemberek can segment a Turkish word like "ısırdı" (bite-PAST) into morphemes [ısır, -dı]. These morphemes map directly to NSG nodes: ısır → BITE(ACTION), -dı → PAST(TENSE).

3.3 Finnish (Agglutinative) — SOLVED

Tool Type Status Notes
Omorfi FST-based analyzer Production [LLM-INFERRED] Open-source. Handles Finnish noun cases (15), verb conjugation, and possessive suffixes.
HFST (Helsinki Finite-State Toolkit) FST framework Production [LLM-INFERRED] General framework for building morphological analyzers. Used for Finnish and other languages.

Assessment: Solved. Finnish shares Turkish's agglutinative structure and has comparable tool support.

3.4 Inuktitut (Polysynthetic) — RESEARCH-STAGE

Tool Type Status Notes
Uqailaut FST-based morphological analyzer Research [LLM-INFERRED] Developed at the University of Toronto and Inuit Tapiriit Kanatami. Handles Inuktitut's complex polysynthetic verb morphology. Reported ~85% coverage on test corpora.
Inuktitut Morphological Analyzer (NRC) Rule-based Research [LLM-INFERRED] National Research Council Canada project.

Assessment: Feasible but not production-ready. Uqailaut demonstrates that FST-based analysis of polysynthetic morphology is possible at ~85% accuracy. The remaining 15% involves dialectal variation, irregular forms, and productive compounding that the FST rules don't yet cover.

NSG integration: For a word like Inuululauqsimanngittualuujunga ("I definitely didn't live for a very long time"), Uqailaut would segment it into morphemes like [inuu-, -lauq-, -sima-, -nngit-, -tualuu-, -junga]. Each morpheme maps to an NSG node.

3.5 Cree (Polysynthetic) — RESEARCH-STAGE

Tool Type Status Notes
Plains Cree FST FST-based analyzer Research [LLM-INFERRED] Developed by the GI Windows project (Alberta Language Technology Lab). Handles Plains Cree verb morphology.
Cree Language Project Resources Academic [LLM-INFERRED] Various dictionary and grammatical resources at the University of Alberta.

Assessment: Similar to Inuktitut — research-stage tools exist but are not production-ready. Cree shares the polysynthetic pattern: verb roots with extensive prefix/suffix chains encoding arguments, tense, aspect, and modality.

3.6 Mohawk (Polysynthetic) — GAP

Tool Type Status Notes
No known FST Does not exist [LLM-INFERRED] Based on the author's training data, no publicly available morphological analyzer exists for Mohawk.
Mohawk Language Resources Dictionary/grammar Academic [LLM-INFERRED] Comprehensive descriptive grammars exist (Mithun, Bonvillain, Michelson). Dictionaries exist. But no computational morphological analyzer.

Assessment: This is the primary bottleneck for the NSG architecture. No production-ready morphological analyzer exists for Mohawk. However, the linguistic resources needed to build one (descriptive grammars, dictionaries, annotated texts) do exist — the gap is computational, not linguistic.

3.7 Other Polysynthetic Languages

Language Tools Status
Nahuatl [LLM-INFERRED] Classical Nahuatl has academic morphological analyzers. Modern variants less supported. Mixed
Quechua [LLM-INFERRED] Some computational work (RUNASIMI project). ~8-10 million speakers. Research
Mapudungun [LLM-INFERRED] Limited computational resources. Gap
Tiwi, Warlpiri (Australian) [LLM-INFERRED] Minimal computational resources. Gap

4. The FST Approach to Polysynthetic Analysis

4.1 Why Finite State Transducers

Finite State Transducers (FSTs) are the dominant approach to morphological analysis. An FST maps between two levels:

  • Lexical level: morpheme sequence with grammatical tags (e.g., bite+V+PAST)
  • Surface level: the actual written/pronounced form (e.g., "ısırdı")

The FST encodes morphological rules as state transitions: vowel harmony, consonant mutation, affix ordering constraints, and morpheme boundary phonology.

4.2 What Building a Mohawk FST Requires

Resource Status Effort
Comprehensive descriptive grammar Exists (Mithun, Bonvillain) 0 — already available
Morpheme inventory (roots, prefixes, suffixes) Exists in grammars/dictionaries 1-2 months to digitize
Morphological rule set (affix ordering, allomorphy) Described in grammars 2-3 months to formalize as FST rules
Annotated corpus (morpheme-segmented text) Does not exist 6-12 months to create with native speaker collaboration
Native speaker verification Requires community partnership Ongoing
FST implementation (HFST toolkit) Framework exists 3-6 months development

Estimated total: 12-24 months for a production-quality Mohawk morphological analyzer, assuming access to native speaker expertise and funding.

4.3 Bootstrapping Without a Full Analyzer

Until a full FST exists, the NSG can operate with:

  1. Dictionary-based lookup: Parse known verb roots and affix combinations from a digitized dictionary
  2. Rule-based approximation: Use hand-crafted rules for the most common morphological patterns
  3. Human-in-the-loop: Accept manually-segmented text for indexing, with automated analysis for queries
  4. Cross-linguistic transfer: Use patterns from Inuktitut and Cree analyzers as templates for Mohawk rules (the polysynthetic pattern is shared)

5. Recommendations

5.1 Immediate (Sprint 2-3)

  1. Integrate Zemberek for Turkish. Turkish is the easiest agglutinative language to add after English. Zemberek is production-ready and well-documented.
  2. Evaluate Uqailaut for Inuktitut. Test on the Inuktitut example from Few Become One. Assess accuracy.
  3. Document the Mohawk gap explicitly. The NSG architecture works — the bottleneck is per-language tooling, not architectural.

5.2 Medium-Term (Sprint 4-5)

  1. Build a minimal Mohawk FST prototype. Start with the 10 most common verb roots and 5 most common affix patterns. Use HFST.
  2. Create a small annotated Mohawk corpus. 100-200 sentences, manually segmented and verified with a native speaker.
  3. Develop a Turkish NSG pipeline. Connect Zemberek → NSG builder → corpus index.

5.3 Long-Term (Phase 4)

  1. Full Mohawk FST. Production-quality analyzer with >90% coverage.
  2. Extend to other polysynthetic languages. Nahuatl, Quechua, Cree, Mapudungun — prioritized by speaker population and available resources.
  3. Universal morphological analyzer. A single FST framework that can be configured for any language by providing a morpheme inventory and rule set.

6. Conclusion

The morphological analysis landscape is uneven. English and Turkish are solved. Inuktitut and Cree have research-stage tools. Mohawk has none. This is the primary bottleneck for the NSG architecture's production deployment — but it is a per-language bottleneck, not an architectural one. Each language that gains a morphological analyzer can be plugged into the same NST pipeline. The tree is invariant; only the parsing path differs.


References

  1. 0.6.md — Computational Pathway (this project)
  2. 0.1.md — Internal Literature Review
  3. Koskenniemi, K. (1983). "Two-Level Morphology: A General Computational Model for Word-Form Recognition and Production." [LLM-INFERRED]
  4. HFST — Helsinki Finite-State Technology. [LLM-INFERRED]
  5. Mithun, M. (1999). "The Languages of Native North America." Cambridge University Press. [LLM-INFERRED]

Morphological Analyzer Survey v0.8 — LLM-inferred tool assessments. All tool names and status claims require verification against current documentation.