Sheng-Sentiment-Base

Project Sheng V1: The foundational NLP infrastructure for Kenya's informal economy.

"Building the engine, not just the car."

Overview

Sheng-Sentiment-Base is a high-fidelity NLP preprocessing engine for Sheng—Nairobi's dynamic urban slang that blends Swahili, English, and local dialects. This infrastructure layer enables accurate sentiment analysis, text classification, and language understanding where generic LLMs fail.

The Problem

Global AI models (GPT-4, Gemini) struggle with:

Code-switching patterns: "Hii mbogi ni fiti" → "This crew is good"
Rapidly evolving slang: New terms emerge weekly in Nairobi's streets
Context-dependent sentiment: "Kudunda" can mean partying (positive) or failing (negative)

The Solution

A specialized "Cultural API" that:

Normalizes Sheng variant spellings ('ronga' → 'rongai')
Detects code-switching boundaries
Applies contextual sentiment rules
Provides training-ready datasets for fine-tuning

Quick Start

# Clone the repository
git clone https://github.com/okech-christopher/sheng-sentiment-base.git
cd sheng-sentiment-base

# Install dependencies
pip install -r requirements.txt

# Run the tokenizer
python -m src.tokenizers.sheng_tokenizer

Usage

API Usage (FastAPI Service)

Start the API server:

# Development mode
python -m src.api.main

# Production mode
uvicorn src.api.main:app --host 0.0.0.0 --port 8000 --workers 4

Analyze Sheng text via API:

curl -X POST http://localhost:8000/v1/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Karao wako mabs, jam imetupa",
    "include_logistics": true,
    "include_code_switches": true
  }'

Response:

{
  "original_text": "Karao wako mabs, jam imetupa",
  "normalized_text": "karao wako mabs jam imetupa",
  "tokens": ["karao", "wako", "mabs", "jam", "imetupa"],
  "slang_terms": ["karao", "mabs", "jam"],
  "code_switches": [],
  "sentiment_score": -0.5,
  "sentiment_label": "negative",
  "logistics_intent": {
    "intent": "police_report",
    "severity": "medium",
    "description": "Police checkpoint or presence detected"
  },
  "is_logistics": true,
  "confidence": 0.6,
  "metadata": {
    "token_count": 5,
    "slang_count": 3,
    "switch_count": 0,
    "has_logistics_intent": true
  }
}

Run the demo script:

# Start the API in a terminal
python -m src.api.main

# In another terminal, run the demo
python demo_api.pyIn Progrss

###xTokenize Sheng Text

from src.tokenizers.sheng_tokenizer import ShengTokenizer

tokenizer = ShengTokenizer()

result = tokenizer.tokenize("Buda hana chapaa, amekudunda")
# Output: normalized text, slang terms detected, contextual sentiment

print(f"Sentiment: {result.sentiment_label}")  # negative
print(f"Slang: {result.slang_terms}")  # ['buda', 'chapaa', 'kudunda']

Scrape and Process

from src.pipelines.sheng_pipeline import ShengPipeline

pipeline = ShengPipeline()

# Scrape and process in one step
results = pipeline.scrape_and_process("sheng", count=100)

# Export for LLM training
pipeline.export_to_training_format("training_data.jsonl")

Project Structure

sheng_sentiment_base/
├── src/
│   ├── scrapers/          # Data collection (X/TikTok)
│   ├── tokenizers/        # NLP preprocessing
│   ├── pipelines/         # End-to-end workflows
│   └── utils/             # Helpers
├── data/
│   ├── raw/               # Scraped content
│   ├── processed/         # Cleaned datasets
│   └── dictionaries/      # Sheng slang mappings
├── tests/                 # Unit tests
├── docs/                  # Documentation
└── notebooks/             # Exploration

Architecture

See ARCH.md for detailed system design.

Roadmap

Phase 1: Foundation (Complete)

Sheng tokenizer with normalization
Basic scraper framework
Contextual sentiment engine
Dictionary with 75+ slang mappings

Phase 2: Scale (Next)

Collect 10,000+ labeled samples
Fine-tune Llama-3/Mistral on Sheng
REST API for real-time inference
Boda-Pulse GIS integration

Phase 3: Ecosystem (Future)

Chama-Chain credit scoring
Trust-Grid escrow protocol
Multi-dialect expansion (Lagos, Accra)

Contributing

This project follows strict coding standards:

Type hints on all functions
Google-style docstrings
Conventional commits (feat(nlp): add tokenizer)
PEP 8 compliance

The Vision

"Software is eating the world, but context is eating software."

In Kenya, the informal economy represents 83% of employment and $1B+ in gig sector value. By instrumenting this "invisible" knowledge, we create the infrastructure layer for Africa's next 100 million digital users.

Target Markets:

Contextual AI Data: $20B (20-year projection)
Shadow Logistics: $12B
Social Commerce: $15B
Digital Chamas: $50B

Connect

Builder: @okech-christopher
Status: 0.91 Production Hardening (Day 3 of 7300)
Location: Nairobi, Kenya

Compliance

Kenya AI Bill 2026 Alignment:

Transparency: All sentiment and intent decisions are explainable through rule-based logic
Bias Mitigation: RecursiveSlangResolver prevents single-intent bias through weighting matrix
Data Privacy: No personal data collection; all processing is on-premise
Accountability: Full audit trail with confidence scores and reasoning metadata

License

MIT License - See LICENSE for details.

Built with the Nvidia mindset: architect systems that outlast the code.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
docs		docs
src		src
tests		tests
.gitignore		.gitignore
.windsurfrules		.windsurfrules
ARCH.md		ARCH.md
Dockerfile		Dockerfile
EVALUATION_REPORT.md		EVALUATION_REPORT.md
FINAL_VERIFICATION_REPORT.md		FINAL_VERIFICATION_REPORT.md
LICENSE		LICENSE
README.md		README.md
WINDSURF_PRD.md		WINDSURF_PRD.md
demo_api.py		demo_api.py
demo_app.py		demo_app.py
docker-compose.yml		docker-compose.yml
misclassifications.csv		misclassifications.csv
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
stress_test_api.py		stress_test_api.py
validation_set_v2.jsonl		validation_set_v2.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sheng-Sentiment-Base

Overview

The Problem

The Solution

Quick Start

Usage

API Usage (FastAPI Service)

Scrape and Process

Project Structure

Architecture

Roadmap

Phase 1: Foundation (Complete)

Phase 2: Scale (Next)

Phase 3: Ecosystem (Future)

Contributing

The Vision

Connect

Compliance

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Sheng-Sentiment-Base

Overview

The Problem

The Solution

Quick Start

Usage

API Usage (FastAPI Service)

Scrape and Process

Project Structure

Architecture

Roadmap

Phase 1: Foundation (Complete)

Phase 2: Scale (Next)

Phase 3: Ecosystem (Future)

Contributing

The Vision

Connect

Compliance

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages