Skip to content
View hajirufai's full-sized avatar

Highlights

  • Pro

Block or report hajirufai

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
hajirufai/readme.md

Hey, I'm Haji 👋

Data engineer who builds infrastructure from scratch to understand how it actually works.

Based in Mombasa, Kenya. I write Python, SQL, and whatever gets the pipeline running.

LinkedIn Dev.to Email


What I build

Data pipelines and infrastructure. Most of my recent work is a series of from-scratch implementations of tools I use daily. No frameworks, no dependencies, just the core algorithms:

Project What it is
streamlite Stream processing engine - windowing, watermarks, keyed state, checkpoints. Flink internals demystified.
brokerlite Message broker with pub/sub, consumer groups, WAL, dead letter queues. Kafka-inspired.
raftkv Distributed key-value store with Raft consensus - leader election, log replication, strong consistency.
queryforge SQL query engine - lexer, parser, optimizer, executor. SELECT, JOIN, GROUP BY, subqueries over CSV/JSON.
searchlite Full-text search engine - inverted index, BM25 scoring, Porter stemmer, faceted search.
cachelite In-memory cache with LRU/LFU/FIFO eviction, TTL, snapshots, HTTP API.
cronlite Task scheduler - POSIX cron syntax, priority queues, DAG dependencies, retry strategies, SQLite persistence.
vaultlite Secrets manager with AES-128 from scratch. Envelope encryption, seal/unseal, audit logging, versioning.
gatelite API gateway - routing, rate limiting, JWT auth, circuit breaking, load balancing, caching.
tracelite Distributed tracing - W3C Trace Context, sampling, critical path analysis, waterfall visualization.
servekit HTTP/1.1 server built from raw TCP sockets.
tinylang Programming language interpreter - lexer, parser, AST, closures, first-class functions.

Every one of these is zero dependencies, pure Python standard library.


Production data work

Project Stack
afridata-pipeline World Bank API to DuckDB star-schema warehouse. Dimensional modeling, data quality checks, Vercel dashboard.
realtime-event-pipeline Kafka + DuckDB streaming pipeline. Ingestion, transformation, enrichment, OLAP analytics.
dbt-ecommerce-warehouse dbt + DuckDB analytics warehouse. Star schema, 50+ tests, custom macros, incremental models.
stock-market-data-pipeline Real-time stock tracking. Airflow, Spark, Slack alerts, Metabase dashboards.
datapact Data quality and contract validation library. Declare expectations, enforce in pipelines and CI.
datadrift Drift detection framework - schema changes, distribution shifts, statistical testing, HTML reports.

Tools and AI

Project What it does
documind RAG document Q&A. Hybrid search (BM25 + TF-IDF), cited answers, pluggable LLMs.
ai-agent-toolkit Composable agent framework - tool use, memory, multi-agent orchestration. Under 1000 lines of core.
pipeforge CI/CD pipeline generator - analyzes codebases and outputs GitHub Actions, GitLab CI, Docker configs.
vectorlite Vector search engine - Flat, IVF, HNSW indexes with cosine/euclidean/dot product.
airbnb-clone Full-stack MERN app. MongoDB, Express, React, Node. Auth, search, bookings, image upload.

Background

  • BSc Mathematics and Computer Science, JKUAT
  • Data Engineering certs from ExploreAI Academy and Wizeline Academy
  • AWS Certified Cloud Practitioner
  • Day-to-day: Python, SQL, dbt, Airflow, Spark, Kafka, DuckDB, BigQuery, Docker, GCP, Azure

GitHub Stats

GitHub Streak

Pinned Loading

  1. airbnb-clone airbnb-clone Public

    JavaScript 1

  2. stock-market-data-pipeline stock-market-data-pipeline Public

    Python 2

  3. classic-snake-game classic-snake-game Public

    HTML

  4. audio-recorder audio-recorder Public

    JavaScript