Skip to content

kenndo127/IgboNLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Igbo NLP

A rule-based Natural Language Interpreter for the Igbo language, built in Java. This project performs lexical analysis, POS tagging, syntax validation, and phonetic syllabification on Igbo sentences.

Built as a term paper project for CSC 331 (Programming Principles and Paradigms) at the University of Ibadan.


Features

  • Lexical Analysis — tokenizes input sentences and tags each token with its Part-of-Speech using a lexicon derived from the Masakhane Igbo dataset
  • Diacritic-Insensitive Lookup — users can type words without diacritics (e.g. gini instead of gịnị) and the system resolves them correctly
  • Syntax Validation — validates sentences against common Igbo sentence patterns: V, V-O, S-V, S-V-O, S-V-C, S-V-O-C
  • Phonetic Syllabification — breaks each token into syllables based on Igbo phonological rules, displayed using the original diacritic form from the lexicon
  • Error Reporting — reports lexical errors (unknown words) and syntax errors (invalid sentence structure) immediately on detection

Tech Stack

  • Java 21
  • Maven
  • Jackson Databind 2.21.3
  • Lombok 1.18.46

Project Structure

igbo-nlp/
├── src/
│   ├── main/
│   │   ├── java/com/kenneth/
│   │   │   ├── Main.java                  # Entry point
│   │   │   ├── repository/
│   │   │   │   └── Lexicon.java           # Loads and queries the lexicon
│   │   │   ├── wordProcessor/
│   │   │   │   └── Word.java              # Model class for lexicon entries
│   │   │   └── utils/
│   │   │       ├── Tokenizer.java         # Splits sentence into tokens
│   │   │       ├── Lexer.java             # POS tags each token
│   │   │       ├── Parser.java            # Validates sentence structure
│   │   │       ├── PhoneticEngine.java    # Syllabifies tokens
│   │   │       └── DiacriticUtil.java     # Strips diacritics for lookup
│   │   └── resources/
│   │       └── masakhane.json             # Igbo lexicon dataset

Getting Started

Prerequisites

  • Java 21
  • Maven 3.x

Installation

git clone https://github.com/kenndo127/IgboNLP.git
cd IgboNLP
mvn clean install

Running the Program

mvn exec:java -Dexec.mainClass="com.kenneth.Main"

Or run Main.java directly from IntelliJ IDEA.


Usage

The program accepts a sentence of up to 5 words typed in Igbo. Diacritics are optional as the system normalizes input before lookup.

Enter a sentence in Igbo: gini bu aha gi?

Output:

Tokens => [gini, bu, aha, gi]

Token : POS
gini  : [PRON]
bu    : [VERB, AUX]
aha   : [NOUN]
gi    : [PRON]

Valid: [S-V-O-C] -> Subject-Verb-Object-Complement

Phonetics => [Gị-nị, bu, a-ha, gị]

Error Reporting

Lexical Error — when a word does not exist in the lexicon:

Enter a sentence in Igbo: aha m bu chiamaka

Lexical Error: chiamaka does not exist in the lexicon

Syntax Error — when the sentence does not match any valid pattern:

Enter a sentence in Igbo: nwoke abali ututu bia

Syntax Error: Invalid: does not match S-V-O-C

Sentence Patterns Supported

Pattern Description
V Imperative
V-O Imperative with object
S-V Subject-Verb
S-V-O Subject-Verb-Object
S-V-C Subject-Verb-Complement
S-V-O-C Subject-Verb-Object-Complement

POS Tags

Tag Meaning
NOUN Noun
VERB Verb
PRON Pronoun
ADJ Adjective
ADV Adverb
AUX Auxiliary verb
ADP Adposition
PROPN Proper noun

Phonetic Rules

Syllabification is rule-based, derived from Igbo phonological structure:

  • A vowel alone forms a syllable — a, e, i, o, u, , ,
  • A consonant followed by a vowel forms a syllable — ka, nọ, bu
  • Consonant clusters stay together before a vowel — nw, kw, gb, ch, kp etc.
  • A syllabic n before a consonant stands alone as its own syllable.

Dataset

This project uses the Igbo corpus from the Masakhane project, converted from CoNLL format to JSON. The dataset provides word-level POS annotations using Universal Dependencies tags (NOUN, VERB, PROPN, PRON, ADP, etc.).


Limitations

  • Lexicon coverage is limited to words present in the Masakhane training corpus
  • Word sense disambiguation always picks the first POS tag when a word has multiple senses
  • Maximum sentence length is 5 tokens
  • Multi-word lexicon entries (e.g. compound words) are not yet fully supported

Future Improvements

  • Expand lexicon coverage
  • Support longer and complex sentences
  • Make it ML-based
  • Improve ambiguity resolution
  • Support compound words and phrases
  • Build GUI interface

Author

© 2026 Okechukwu Kenneth Chidiebube

About

A rule-based natural language interpreter for the Igbo language built in Java. Performs lexical analysis, POS tagging, syntax validation, phonetic reading, and error reporting using the Masakhane Igbo dataset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages