A benchmarking study across Nigerian Pidgin, Hausa, Yoruba, and Igbo using the AfriHate Corpus
This project is the implementation component of my final year dissertation at Miva Open University, Abuja (DTS 497). It conducts the first Nigeria-focused benchmarking analysis of automated hate speech detection across the four major Nigerian indigenous languages present in the AfriHate corpus (Muhammad et al., 2025).
Two classification models are evaluated under identical experimental conditions for each language:
- TF-IDF + Logistic Regression: a surface-level lexical baseline
- Fine-tuned AfriBERTa: a transformer pre-trained on 11 African languages
The study addresses a gap left by the AfriHate paper itself: no prior work had conducted a Nigeria-specific comparative analysis of these four language subsets, and no TF-IDF baseline had been reported for them.
| Language | Code | Subset Size | Hate Class | IAA (kappa) |
|---|---|---|---|---|
| Nigerian Pidgin | pcm | 10,599 | 11.1% | 0.65 |
| Hausa | hau | 6,644 | 5.2% | 0.75 |
| Igbo | ibo | 5,003 | 5.0% | 0.80 |
| Yoruba | yor | 4,879 | 3.1% | 0.68 |
| Language | TF-IDF LR (F1-Macro) | AfriBERTa (F1-Macro) |
|---|---|---|
| Hausa | 0.785 | 0.888 |
| Igbo | 0.956 | 0.949 |
| Nigerian Pidgin | 0.741 | 0.739 |
| Yoruba | 0.510 | 0.465 |
| Average | 0.748 | 0.760 |
Main finding: AfriBERTa's contextual pre-training provides meaningful gains only for Hausa (+0.103 F1-Macro). For the other three languages, the surface-level TF-IDF baseline performs comparably or better. Yoruba shows complete Hate-class detection failure across both models due to severe class imbalance (30 Hate test instances).
nigeria-multilingual-hate-speech-detection/
│
├── DTS497_Project_Implementation.ipynb # Main Colab notebook (17 cells)
├── README.md
│
└── outputs/ # Generated after running the notebook
├── all_results.json # Full results for all language-model combinations
├── results_table.csv # Formatted results table
├── comparative_results.png # F1-Macro and Accuracy bar charts
├── perclass_f1_heatmap.png # Per-class F1 heatmap
├── class_distribution.png # Class distribution by language
├── training_curves.png # AfriBERTa val F1 and loss per epoch
├── cm_Hausa_TF-IDF_LR.png # Confusion matrices (one per language)
├── cm_Igbo_TF-IDF_LR.png
├── cm_Nigerian_Pidgin_TF-IDF_LR.png
├── cm_Yoruba_TF-IDF_LR.png
├── error_analysis_hau.csv # Misclassified instances per language
├── error_analysis_ibo.csv
├── error_analysis_pcm.csv
└── error_analysis_yor.csv
- Google account (for Colab)
- HuggingFace account with access to the AfriHate dataset
- T4 GPU runtime (free tier on Colab is sufficient)
Visit https://huggingface.co/datasets/afrihate/afrihate, log in, and agree to the dataset terms. Then generate a token at https://huggingface.co/settings/tokens.
Click the Colab badge at the top of this README, or open it directly:
https://colab.research.google.com/drive/1cF3pkDbho45M8muN540H2VoA1aTLiy0g
Go to Runtime > Change Runtime Type > Hardware Accelerator > T4 GPU.
| Cell | Purpose | Time |
|---|---|---|
| 0 | Create outputs folder | Instant |
| 1 | Install dependencies, restart runtime | 2 min |
| 2 | Imports, config, HuggingFace login | Instant |
| 3 | Load AfriHate language subsets | 3-5 min |
| 4 | Preprocess all four languages | 3 min |
| 5-6 | Evaluation and visualisation helpers | Instant |
| 7 | TF-IDF + LR training (all 4 languages) | 10-15 min |
| 8-9 | AfriBERTa tokenisation and metrics helper | 5 min |
| 10 | AfriBERTa fine-tuning (all 4 languages) | 90-120 min |
| 11-16 | Results, charts, error analysis | 10 min |
After Cell 16 completes, run this to copy everything to Google Drive:
import shutil, os
drive_path = '/content/drive/MyDrive/DTS497_Outputs'
os.makedirs(drive_path, exist_ok=True)
shutil.copytree('outputs', drive_path, dirs_exist_ok=True)
print("Saved to Google Drive.")Why TF-IDF before AfriBERTa? The AfriHate paper (Muhammad et al., 2025) evaluated several large transformer and LLM-based models but did not include a TF-IDF baseline. This study fills that gap to quantify how much detection performance comes from surface-level lexical patterns versus contextual pre-trained representations.
Class imbalance handling
Both models use balanced class weights during training. For AfriBERTa, a custom WeightedTrainer subclass overrides compute_loss() to apply class-weighted cross-entropy, giving the Hate class proportionally higher loss weight.
A dataset quirk worth knowing The AfriHate Nigerian subsets, as distributed on HuggingFace at the time of this study, contained only two effective label classes (Normal and Hate). The Abusive class was absent from all four language splits. This turned the study into an effective binary classification task. This is documented transparently in the thesis and reported here for reproducibility.
Pidgin preprocessing Nigerian Pidgin has non-standardised orthography. A 30-item spelling variant dictionary normalises common variants (e.g. 'wia' to 'where', 'dem dem' to 'dem') before feature extraction. This reduces vocabulary fragmentation and improves TF-IDF coverage for Pidgin tweets.
AfriHate Corpus Muhammad, S. H., Abdulmumin, I., Ayele, A. A., Adelani, D. I., et al. (2025). AfriHate: A multilingual collection of hate speech and abusive language datasets for African languages. In Proceedings of NAACL 2025 (pp. 1854-1871). https://aclanthology.org/2025.naacl-long.92/
AfriBERTa Ogueji, K., Zhu, Y., & Lin, J. (2021). Small data? No problem! Exploring the viability of pretrained multilingual language models for low-resourced languages. In Proceedings of the First Workshop on Multilingual Representation Learning (pp. 116-126). https://aclanthology.org/2021.mrl-1.11/
Author: Solomon Ayuba
Institution: Department of Data Science, School of Computing, Miva Open University, Abuja, Nigeria
Programme: BSc (Hons) Data Science
Year: 2026
Thesis title: Multilingual Abusive Language Detection in Nigerian Social Media Using NLP: A Focused Benchmarking Study on Nigerian Pidgin, Hausa, Yoruba, and Igbo Using the AfriHate Corpus
Thanks to Dr. Saminu Mohammad Aliyu and the AfriHate team for making their dataset publicly available, and to the Masakhane community for the African NLP infrastructure that made this work possible.
This project is released under the MIT License. The AfriHate dataset is subject to its own terms at https://huggingface.co/datasets/afrihate/afrihate.