In this repository, OCR-related datasets are available.
-
Updated
Jan 6, 2026
In this repository, OCR-related datasets are available.
Arabic Chat with PDF is a user-friendly application that lets you interact with Arabic PDF documents. Powered by advanced language models, OCR, and vector search, it allows you to upload PDFs, ask questions, and receive accurate Arabic responses 🚀
This research aims to fine-tune an Arabic OCR model using Tesseract 5.0, enhancing text recognition accuracy through extensive data collection, preprocessing, and image generation. By leveraging advanced training techniques and data augmentation, we achieve significant improvements in word error rates (WER).
Alef-OCR-Image2Html, an OCR model designed to transform Arabic documents including historical texts, scanned pages, and handwritten materials into structured and semantic HTML.
Official code for "Ketaba-OCR at AR-MS NakbaNLP 2026" — QLoRA fine-tuning of a specialized HTR model with Linear+Boost ensemble for Arabic manuscript recognition. 1st place per-line (CER 0.082) and 3rd place official leaderboard at NakbaNLP 2026 (LREC 2026).
Optical Character Recognition, OCR pipeline, Arabic OCR, Deep Learning OCR, Computer Vision text extraction, Text recognition system, AI document processing, Multilingual OCR, Transformer OCR, OCR benchmarking, Bounding box detection, Ground truth evaluation.
Additional experimental model for NakbaNLP 2026 Shared Task (AR-MS) — LoRA/DoRA fine-tuning of Qari-OCR (Qwen2-VL-2B) for Arabic handwritten manuscript recognition on the Omar Al-Saleh Memoir Collection (1951-1965).
Nassij V3: High-accuracy Arabic PDF-to-DOCX converter with direct digital extraction (NassijScanner) and cryptographic linguistic integrity verification (Merkle proofs).
Multilingual OCR with per-region script routing for Arabic + Latin. Built for MENA documents.
OCR-first Arabic book corpus platform with citation-grade APIs
Local Python pipeline + bilingual SPA archiving the @AqmarTofan Telegram channel — Telethon, ffmpeg, EasyOCR (Arabic+English), openpyxl, Alpine.js.
A deep learning-based handwritten Arabic OCR system using ResNet50 + BiLSTM + Attention with CTC decoding. Achieves 96.3% character accuracy and 80% word accuracy on the IFN/ENIT dataset, featuring a PyQt6 desktop GUI for real-time inference. Supports both greedy and beam search decoding.
Local Arabic OCR field extraction for utility bills with PaddleOCR, FastAPI, CLI, and validation.
Arabic Plate Recognition System
An AI-powered OCR and document processing system designed to convert Arabic PDF books and images into high-quality, editable scientific text layouts
Fork of h9-tec/Manazir-OCR — Arabic-first multi-model OCR framework. Patched for API-only install (torch/transformers moved to optional local-models extra).
End-to-end Arabic manuscript digitization and AI summarization pipeline for digital humanities research.
Add a description, image, and links to the arabic-ocr topic page so that developers can more easily learn about it.
To associate your repository with the arabic-ocr topic, visit your repo's landing page and select "manage topics."