A rule-based Natural Language Interpreter for the Igbo language, built in Java. This project performs lexical analysis, POS tagging, syntax validation, and phonetic syllabification on Igbo sentences.
Built as a term paper project for CSC 331 (Programming Principles and Paradigms) at the University of Ibadan.
- Lexical Analysis — tokenizes input sentences and tags each token with its Part-of-Speech using a lexicon derived from the Masakhane Igbo dataset
- Diacritic-Insensitive Lookup — users can type words without diacritics (e.g.
giniinstead ofgịnị) and the system resolves them correctly - Syntax Validation — validates sentences against common Igbo sentence patterns:
V,V-O,S-V,S-V-O,S-V-C,S-V-O-C - Phonetic Syllabification — breaks each token into syllables based on Igbo phonological rules, displayed using the original diacritic form from the lexicon
- Error Reporting — reports lexical errors (unknown words) and syntax errors (invalid sentence structure) immediately on detection
- Java 21
- Maven
- Jackson Databind 2.21.3
- Lombok 1.18.46
igbo-nlp/
├── src/
│ ├── main/
│ │ ├── java/com/kenneth/
│ │ │ ├── Main.java # Entry point
│ │ │ ├── repository/
│ │ │ │ └── Lexicon.java # Loads and queries the lexicon
│ │ │ ├── wordProcessor/
│ │ │ │ └── Word.java # Model class for lexicon entries
│ │ │ └── utils/
│ │ │ ├── Tokenizer.java # Splits sentence into tokens
│ │ │ ├── Lexer.java # POS tags each token
│ │ │ ├── Parser.java # Validates sentence structure
│ │ │ ├── PhoneticEngine.java # Syllabifies tokens
│ │ │ └── DiacriticUtil.java # Strips diacritics for lookup
│ │ └── resources/
│ │ └── masakhane.json # Igbo lexicon dataset
- Java 21
- Maven 3.x
git clone https://github.com/kenndo127/IgboNLP.git
cd IgboNLP
mvn clean installmvn exec:java -Dexec.mainClass="com.kenneth.Main"Or run Main.java directly from IntelliJ IDEA.
The program accepts a sentence of up to 5 words typed in Igbo. Diacritics are optional as the system normalizes input before lookup.
Enter a sentence in Igbo: gini bu aha gi?
Output:
Tokens => [gini, bu, aha, gi]
Token : POS
gini : [PRON]
bu : [VERB, AUX]
aha : [NOUN]
gi : [PRON]
Valid: [S-V-O-C] -> Subject-Verb-Object-Complement
Phonetics => [Gị-nị, bu, a-ha, gị]
Lexical Error — when a word does not exist in the lexicon:
Enter a sentence in Igbo: aha m bu chiamaka
Lexical Error: chiamaka does not exist in the lexicon
Syntax Error — when the sentence does not match any valid pattern:
Enter a sentence in Igbo: nwoke abali ututu bia
Syntax Error: Invalid: does not match S-V-O-C
| Pattern | Description |
|---|---|
V |
Imperative |
V-O |
Imperative with object |
S-V |
Subject-Verb |
S-V-O |
Subject-Verb-Object |
S-V-C |
Subject-Verb-Complement |
S-V-O-C |
Subject-Verb-Object-Complement |
| Tag | Meaning |
|---|---|
NOUN |
Noun |
VERB |
Verb |
PRON |
Pronoun |
ADJ |
Adjective |
ADV |
Adverb |
AUX |
Auxiliary verb |
ADP |
Adposition |
PROPN |
Proper noun |
Syllabification is rule-based, derived from Igbo phonological structure:
- A vowel alone forms a syllable —
a,e,i,o,u,ị,ọ,ụ - A consonant followed by a vowel forms a syllable —
ka,nọ,bu - Consonant clusters stay together before a vowel —
nw,kw,gb,ch,kpetc. - A syllabic
nbefore a consonant stands alone as its own syllable.
This project uses the Igbo corpus from the Masakhane project, converted from CoNLL format to JSON. The dataset provides word-level POS annotations using Universal Dependencies tags (NOUN, VERB, PROPN, PRON, ADP, etc.).
- Lexicon coverage is limited to words present in the Masakhane training corpus
- Word sense disambiguation always picks the first POS tag when a word has multiple senses
- Maximum sentence length is 5 tokens
- Multi-word lexicon entries (e.g. compound words) are not yet fully supported
- Expand lexicon coverage
- Support longer and complex sentences
- Make it ML-based
- Improve ambiguity resolution
- Support compound words and phrases
- Build GUI interface
© 2026 Okechukwu Kenneth Chidiebube