Igbo NLP

A rule-based Natural Language Interpreter for the Igbo language, built in Java. This project performs lexical analysis, POS tagging, syntax validation, and phonetic syllabification on Igbo sentences.

Built as a term paper project for CSC 331 (Programming Principles and Paradigms) at the University of Ibadan.

Features

Lexical Analysis — tokenizes input sentences and tags each token with its Part-of-Speech using a lexicon derived from the Masakhane Igbo dataset
Diacritic-Insensitive Lookup — users can type words without diacritics (e.g. gini instead of gịnị) and the system resolves them correctly
Syntax Validation — validates sentences against common Igbo sentence patterns: V, V-O, S-V, S-V-O, S-V-C, S-V-O-C
Phonetic Syllabification — breaks each token into syllables based on Igbo phonological rules, displayed using the original diacritic form from the lexicon
Error Reporting — reports lexical errors (unknown words) and syntax errors (invalid sentence structure) immediately on detection

Tech Stack

Java 21
Maven
Jackson Databind 2.21.3
Lombok 1.18.46

Project Structure

igbo-nlp/
├── src/
│   ├── main/
│   │   ├── java/com/kenneth/
│   │   │   ├── Main.java                  # Entry point
│   │   │   ├── repository/
│   │   │   │   └── Lexicon.java           # Loads and queries the lexicon
│   │   │   ├── wordProcessor/
│   │   │   │   └── Word.java              # Model class for lexicon entries
│   │   │   └── utils/
│   │   │       ├── Tokenizer.java         # Splits sentence into tokens
│   │   │       ├── Lexer.java             # POS tags each token
│   │   │       ├── Parser.java            # Validates sentence structure
│   │   │       ├── PhoneticEngine.java    # Syllabifies tokens
│   │   │       └── DiacriticUtil.java     # Strips diacritics for lookup
│   │   └── resources/
│   │       └── masakhane.json             # Igbo lexicon dataset

Getting Started

Prerequisites

Java 21
Maven 3.x

Installation

git clone https://github.com/kenndo127/IgboNLP.git
cd IgboNLP
mvn clean install

Running the Program

mvn exec:java -Dexec.mainClass="com.kenneth.Main"

Or run Main.java directly from IntelliJ IDEA.

Usage

The program accepts a sentence of up to 5 words typed in Igbo. Diacritics are optional as the system normalizes input before lookup.

Enter a sentence in Igbo: gini bu aha gi?

Output:

Tokens => [gini, bu, aha, gi]

Token : POS
gini  : [PRON]
bu    : [VERB, AUX]
aha   : [NOUN]
gi    : [PRON]

Valid: [S-V-O-C] -> Subject-Verb-Object-Complement

Phonetics => [Gị-nị, bu, a-ha, gị]

Error Reporting

Lexical Error — when a word does not exist in the lexicon:

Enter a sentence in Igbo: aha m bu chiamaka

Lexical Error: chiamaka does not exist in the lexicon

Syntax Error — when the sentence does not match any valid pattern:

Enter a sentence in Igbo: nwoke abali ututu bia

Syntax Error: Invalid: does not match S-V-O-C

Sentence Patterns Supported

Pattern	Description
`V`	Imperative
`V-O`	Imperative with object
`S-V`	Subject-Verb
`S-V-O`	Subject-Verb-Object
`S-V-C`	Subject-Verb-Complement
`S-V-O-C`	Subject-Verb-Object-Complement

POS Tags

Tag	Meaning
`NOUN`	Noun
`VERB`	Verb
`PRON`	Pronoun
`ADJ`	Adjective
`ADV`	Adverb
`AUX`	Auxiliary verb
`ADP`	Adposition
`PROPN`	Proper noun

Phonetic Rules

Syllabification is rule-based, derived from Igbo phonological structure:

A vowel alone forms a syllable — a, e, i, o, u, ị, ọ, ụ
A consonant followed by a vowel forms a syllable — ka, nọ, bu
Consonant clusters stay together before a vowel — nw, kw, gb, ch, kp etc.
A syllabic n before a consonant stands alone as its own syllable.

Dataset

This project uses the Igbo corpus from the Masakhane project, converted from CoNLL format to JSON. The dataset provides word-level POS annotations using Universal Dependencies tags (NOUN, VERB, PROPN, PRON, ADP, etc.).

Limitations

Lexicon coverage is limited to words present in the Masakhane training corpus
Word sense disambiguation always picks the first POS tag when a word has multiple senses
Maximum sentence length is 5 tokens
Multi-word lexicon entries (e.g. compound words) are not yet fully supported

Future Improvements

Expand lexicon coverage
Support longer and complex sentences
Make it ML-based
Improve ambiguity resolution
Support compound words and phrases
Build GUI interface

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.idea		.idea
src/main		src/main
.gitignore		.gitignore
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Igbo NLP

Features

Tech Stack

Project Structure

Getting Started

Prerequisites

Installation

Running the Program

Usage

Error Reporting

Sentence Patterns Supported

POS Tags

Phonetic Rules

Dataset

Limitations

Future Improvements

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Igbo NLP

Features

Tech Stack

Project Structure

Getting Started

Prerequisites

Installation

Running the Program

Usage

Error Reporting

Sentence Patterns Supported

POS Tags

Phonetic Rules

Dataset

Limitations

Future Improvements

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages