Multilingual OCR Toolkit

Multilingual OCR with per-region script routing. Detects whether each text region is Arabic or Latin and sends it to the engine that handles it best. Built for MENA documents that freely mix Arabic and English — packaging, ads, signage, government forms.

Why this exists

Most OCR projects treat OCR as a single-engine problem: pass the image to Tesseract, or pass it to EasyOCR, get text back. That works for single-language documents. It falls apart on real MENA content.

A typical Arabic newspaper ad has an Arabic headline, an English brand name, a phone number, and a French luxury tagline. A government form has Arabic field labels and English placeholder text. A product package has Arabic ingredients on one side and English nutrition facts on the other.

Tesseract is faster and historically more accurate on Latin script. EasyOCR is much stronger on Arabic and handles RTL natively. The right answer is to run both, but only on the regions where each actually wins. That's what this toolkit does.

How it works

Image
  │
  ▼
[Detect text regions]      ← EasyOCR's CRAFT detector (RTL-aware)
  │
  ▼
[Classify per region]      ← Unicode script counting (no ML)
  │
  ├── Latin  → Tesseract        (fast, accurate on Latin)
  ├── Arabic → EasyOCR ar+en    (RTL + diacritics handled)
  └── Mixed  → EasyOCR ar+en    (single pass over both scripts)
  │
  ▼
[Optional: Arabic normalization]   ← diacritics, alef/yaa folding
  │
  ▼
Regions: text + bbox + script + confidence + engine

The script-detection step is deliberately script-level (Arabic vs Latin), not language-level (English vs French vs Spanish). The OCR engine doesn't care if Latin text is English or French; same characters, same recognition. Distinguishing those is a separate language-ID problem solved AFTER OCR, not before.

Quickstart

git clone https://github.com/Abd-alrhman1/multilingual-ocr-toolkit
cd multilingual-ocr-toolkit
pip install -r requirements.txt

# Install the Tesseract binary (one-time):
#   macOS:   brew install tesseract
#   Ubuntu:  sudo apt-get install tesseract-ocr
#   Windows: https://github.com/UB-Mannheim/tesseract/wiki

python tests/test_core.py             # 18/18 passing, no model downloads
streamlit run scripts/streamlit_app.py # first run downloads ~600MB EasyOCR weights

Library usage

import numpy as np
from PIL import Image
from mlocr import MultilingualOCR, detect_from_text

# 1) Pure-text script detection (no models needed)
detect_from_text("Hello مرحبا")        # → "mixed"
detect_from_text("هذا نص عربي طويل")    # → "arabic"
detect_from_text("Hello, World!")       # → "latin"

# 2) Full pipeline on an image
img = np.array(Image.open("ad.jpg").convert("RGB"))
pipeline = MultilingualOCR(
    prefer_tesseract_for_latin=True,
    normalize_arabic=False,
)
result = pipeline.run(img)

for r in result.regions:
    print(f"[{r.script}] ({r.engine}, conf={r.confidence:.2f}) {r.text}")

print(f"\n{result.elapsed_ms:.0f} ms total")
print(f"Routing: {result.routing_summary}")

What's inside

Script detection — Unicode-based, no ML, no network. Counts code points per script (Arabic, Latin, CJK, Cyrillic) with configurable thresholds for "mixed" detection. Fully tested.
Pluggable OCR engine adapters — TesseractEngine and EasyOCREngine behind a common OCREngine interface. Lazy-loaded so importing the package is fast.
Routing pipeline — first pass through EasyOCR for region detection + candidate text, then re-route Latin regions to Tesseract for the second pass when it's the better choice.
Optional Arabic normalization — strip diacritics, fold alef and yaa variants, normalize Arabic-Indic digits. Off by default; available for downstream search use cases.
Synthetic image generator — render arbitrary text on plain backgrounds with the right font for the script. Used by tests and demos so CI doesn't need real images.
Streamlit demo — upload an image, see boxes color-coded by script, get the extracted text with engine + confidence per region.

Project structure

multilingual-ocr-toolkit/
├── src/mlocr/
│   ├── script_detect.py    # Unicode-based script classification
│   ├── engines.py          # Tesseract + EasyOCR adapters
│   ├── pipeline.py         # detection → routing → recognition
│   └── synthetic.py        # test image generator
├── scripts/
│   └── streamlit_app.py    # interactive demo
├── tests/
│   └── test_core.py        # 18 unit tests, no network
└── benchmarks/             # accuracy/latency comparison (WIP)

What this is not trying to do

Beat the state of the art on a published benchmark. This is pragmatic engineering, not research. The contribution is the routing strategy, not a new model.
Handle every script. v1 routes Arabic and Latin. CJK, Cyrillic, Devanagari etc. fall through to EasyOCR's multi-script model. PRs welcome to add specialized routing.
Detect language inside Latin script. A region of "Hola amigos" classifies as latin and goes to Tesseract — which reads Spanish characters correctly even with the English language pack. Distinguishing English-vs-Spanish-vs-French is a post-OCR language-ID problem, out of scope here.
Be the fastest possible OCR. EasyOCR's first-run model download is ~600 MB. Inference is real-time on GPU, slower on CPU. For throughput-critical pipelines, switch to a paid API or train a task-specific model.

Roadmap

Curated Arabic+English benchmark set (CER per script, latency p95)
Comparison with paid APIs (Google Vision, AWS Textract)
Confidence-weighted ensemble when both engines disagree
Document-structure preservation (paragraphs, tables)
FastAPI service for batch jobs

License

Apache 2.0 — see LICENSE.

Author

Built by Abdalrhman Qasim as part of an AI/ML engineering portfolio focused on Arabic enterprise AI in MENA.

Companion projects:

OCI Arabic RAG Toolkit — Arabic RAG for Oracle Cloud Infrastructure
Transaction Fraud Detector — cost-aware fraud detection with Optuna tuning

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
scripts		scripts
src/mlocr		src/mlocr
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilingual OCR Toolkit

Why this exists

How it works

Quickstart

Library usage

What's inside

Project structure

What this is not trying to do

Roadmap

License

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multilingual OCR Toolkit

Why this exists

How it works

Quickstart

Library usage

What's inside

Project structure

What this is not trying to do

Roadmap

License

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages