Multilingual OCR with per-region script routing. Detects whether each text region is Arabic or Latin and sends it to the engine that handles it best. Built for MENA documents that freely mix Arabic and English — packaging, ads, signage, government forms.
Most OCR projects treat OCR as a single-engine problem: pass the image to Tesseract, or pass it to EasyOCR, get text back. That works for single-language documents. It falls apart on real MENA content.
A typical Arabic newspaper ad has an Arabic headline, an English brand name, a phone number, and a French luxury tagline. A government form has Arabic field labels and English placeholder text. A product package has Arabic ingredients on one side and English nutrition facts on the other.
Tesseract is faster and historically more accurate on Latin script. EasyOCR is much stronger on Arabic and handles RTL natively. The right answer is to run both, but only on the regions where each actually wins. That's what this toolkit does.
Image
│
▼
[Detect text regions] ← EasyOCR's CRAFT detector (RTL-aware)
│
▼
[Classify per region] ← Unicode script counting (no ML)
│
├── Latin → Tesseract (fast, accurate on Latin)
├── Arabic → EasyOCR ar+en (RTL + diacritics handled)
└── Mixed → EasyOCR ar+en (single pass over both scripts)
│
▼
[Optional: Arabic normalization] ← diacritics, alef/yaa folding
│
▼
Regions: text + bbox + script + confidence + engine
The script-detection step is deliberately script-level (Arabic vs Latin), not language-level (English vs French vs Spanish). The OCR engine doesn't care if Latin text is English or French; same characters, same recognition. Distinguishing those is a separate language-ID problem solved AFTER OCR, not before.
git clone https://github.com/Abd-alrhman1/multilingual-ocr-toolkit
cd multilingual-ocr-toolkit
pip install -r requirements.txt
# Install the Tesseract binary (one-time):
# macOS: brew install tesseract
# Ubuntu: sudo apt-get install tesseract-ocr
# Windows: https://github.com/UB-Mannheim/tesseract/wiki
python tests/test_core.py # 18/18 passing, no model downloads
streamlit run scripts/streamlit_app.py # first run downloads ~600MB EasyOCR weightsimport numpy as np
from PIL import Image
from mlocr import MultilingualOCR, detect_from_text
# 1) Pure-text script detection (no models needed)
detect_from_text("Hello مرحبا") # → "mixed"
detect_from_text("هذا نص عربي طويل") # → "arabic"
detect_from_text("Hello, World!") # → "latin"
# 2) Full pipeline on an image
img = np.array(Image.open("ad.jpg").convert("RGB"))
pipeline = MultilingualOCR(
prefer_tesseract_for_latin=True,
normalize_arabic=False,
)
result = pipeline.run(img)
for r in result.regions:
print(f"[{r.script}] ({r.engine}, conf={r.confidence:.2f}) {r.text}")
print(f"\n{result.elapsed_ms:.0f} ms total")
print(f"Routing: {result.routing_summary}")- Script detection — Unicode-based, no ML, no network. Counts code points per script (Arabic, Latin, CJK, Cyrillic) with configurable thresholds for "mixed" detection. Fully tested.
- Pluggable OCR engine adapters —
TesseractEngineandEasyOCREnginebehind a commonOCREngineinterface. Lazy-loaded so importing the package is fast. - Routing pipeline — first pass through EasyOCR for region detection + candidate text, then re-route Latin regions to Tesseract for the second pass when it's the better choice.
- Optional Arabic normalization — strip diacritics, fold alef and yaa variants, normalize Arabic-Indic digits. Off by default; available for downstream search use cases.
- Synthetic image generator — render arbitrary text on plain backgrounds with the right font for the script. Used by tests and demos so CI doesn't need real images.
- Streamlit demo — upload an image, see boxes color-coded by script, get the extracted text with engine + confidence per region.
multilingual-ocr-toolkit/
├── src/mlocr/
│ ├── script_detect.py # Unicode-based script classification
│ ├── engines.py # Tesseract + EasyOCR adapters
│ ├── pipeline.py # detection → routing → recognition
│ └── synthetic.py # test image generator
├── scripts/
│ └── streamlit_app.py # interactive demo
├── tests/
│ └── test_core.py # 18 unit tests, no network
└── benchmarks/ # accuracy/latency comparison (WIP)
- Beat the state of the art on a published benchmark. This is pragmatic engineering, not research. The contribution is the routing strategy, not a new model.
- Handle every script. v1 routes Arabic and Latin. CJK, Cyrillic, Devanagari etc. fall through to EasyOCR's multi-script model. PRs welcome to add specialized routing.
- Detect language inside Latin script. A region of "Hola
amigos" classifies as
latinand goes to Tesseract — which reads Spanish characters correctly even with the English language pack. Distinguishing English-vs-Spanish-vs-French is a post-OCR language-ID problem, out of scope here. - Be the fastest possible OCR. EasyOCR's first-run model download is ~600 MB. Inference is real-time on GPU, slower on CPU. For throughput-critical pipelines, switch to a paid API or train a task-specific model.
- Curated Arabic+English benchmark set (CER per script, latency p95)
- Comparison with paid APIs (Google Vision, AWS Textract)
- Confidence-weighted ensemble when both engines disagree
- Document-structure preservation (paragraphs, tables)
- FastAPI service for batch jobs
Apache 2.0 — see LICENSE.
Built by Abdalrhman Qasim as part of an AI/ML engineering portfolio focused on Arabic enterprise AI in MENA.
Companion projects:
- OCI Arabic RAG Toolkit — Arabic RAG for Oracle Cloud Infrastructure
- Transaction Fraud Detector — cost-aware fraud detection with Optuna tuning