|
Ancient Sanskrit and Hindi manuscripts—tantras, stotras, and sacred texts—are being lost to time. Existing OCR tools:
|
OCR-Devnagari combines local OCR speed with Gemini AI accuracy: |
| Feature | Description | |
|---|---|---|
| 🔀 | Multi-Engine Support | 5 OCR backends to choose from |
| 🧠 | Smart Hybrid Mode | EasyOCR + Gemini for optimal results |
| 🕉️ | Mantra Detection | Auto-detect and preserve sacred text |
| ⚡ | High Performance | Async concurrent workers |
| 💾 | Crash-Safe | Resume from any interruption |
| 📊 | Live Progress | Real-time tracking with ETA |
| 🛡️ | Graceful Shutdown | Ctrl+C saves all work |
| 🧹 | Memory Efficient | Handles 1000+ page PDFs |
| ✅ | Response Validation | Rejects invalid OCR results |
# Clone the repository
git clone https://github.com/rajeshkanaka/OCR-Devnagari.git
cd OCR-Devnagari
# Install with UV (recommended)
uv sync && uv pip install easyocr
# Or with pip
pip install -r requirements.txt && pip install easyocr# Option A: Vertex AI (Recommended for production)
export GOOGLE_CLOUD_PROJECT="your-project"
export GOOGLE_CLOUD_LOCATION="global"
export GOOGLE_GENAI_USE_VERTEXAI=1
# Option B: API Key (Quick setup)
export GEMINI_API_KEY="your-key"# 🔥 Hybrid mode — 90% savings, maximum accuracy
python -m ocr_hindi ocr manuscript.pdf --pages "all"
# 🆓 100% FREE local processing
python -m ocr_hindi ocr manuscript.pdf -e easyocr
# 💎 Premium Gemini mode for critical documents
python -m ocr_hindi ocr manuscript.pdf -e gemini
|
|
| Engine | Cost | Accuracy | Speed | Best For |
|---|---|---|---|---|
| 🔀 hybrid | ~$0.30/1K | ⭐⭐⭐⭐⭐ | ⚡⚡⚡ | Recommended |
| 🆓 easyocr | FREE | ⭐⭐⭐⭐ | ⚡⚡ | Budget-conscious |
| 🆓 marker | FREE | ⭐⭐⭐⭐⭐ | ⚡⚡⚡ | Structured PDFs |
| 🆓 tesseract | FREE | ⭐⭐⭐ | ⚡⚡⚡⚡ | Simple documents |
| 💎 gemini | ~$2/1K | ⭐⭐⭐⭐⭐ | ⚡⚡⚡⚡ | Critical accuracy |
"Write once, crash anywhere, resume everywhere"
┌─────────────────────────────────────────┐
│ 📄 PDF Input │
└───────────────────┬─────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ 🔀 INTELLIGENT ROUTING │
├─────────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ hybrid │ │ easyocr │ │ marker │ │tesseract │ │ gemini │ │
│ │ DEFAULT │ │ FREE │ │ FREE │ │ FREE │ │ PREMIUM │ │
│ └────┬─────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ 🧠 HYBRID DECISION ENGINE │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────────┐ ┌─────────────┐ │ │
│ │ │ EasyOCR │ ──────▶ │ Confidence Check│ ──────▶ │ Mantra │ │ │
│ │ │ FREE │ │ < 85% ? │ │ Detected? │ │ │
│ │ └─────────────┘ └─────────────────┘ └──────┬──────┘ │ │
│ │ │ │ │ │
│ │ ▼ ▼ │ │
│ │ ┌───────────────────────────────────────┐ │ │
│ │ │ 💎 Gemini 2.0 Flash │ │ │
│ │ │ • thinking_level: "low" │ │ │
│ │ │ • media_resolution: "high" │ │ │
│ │ │ • Token tracking for cost │ │ │
│ │ └───────────────────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ 🛡️ CRASH-SAFE PIPELINE │
├─────────────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ OCR │────▶│ Cache │────▶│ Progress │────▶│ Release │ │
│ │ Process │ │ Atomic Write │ │ Update │ │ Memory │ │
│ └──────────┘ │ page_NNN.txt │ └──────────────┘ └──────────────┘ │
│ └──────────────┘ │
│ │
│ On interrupt (Ctrl+C) or crash: │
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ ✓ All cached pages preserved ✓ Resume skips completed pages │ │
│ │ ✓ No duplicate API charges ✓ Output merged from cache │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ 📝 Markdown Output + 💰 Cost Report │
└─────────────────────────────────────────┘
Intelligent detection of sacred text patterns ensures mantras are always verified with maximum accuracy
|
बीज मन्त्र |
मन्त्र समाप्ति |
श्लोक चिह्न |
विभाग सूचक |
# Process entire manuscript with intelligent routing
python -m ocr_hindi ocr sacred_text.pdf --pages "all"
# Adjust confidence threshold (higher = more Gemini verification)
python -m ocr_hindi ocr sacred_text.pdf --confidence 0.90
# Disable mantra verification for faster processing
python -m ocr_hindi ocr sacred_text.pdf --no-verify-mantras
# Process specific page ranges
python -m ocr_hindi ocr sacred_text.pdf --pages "1-100,200-250"
# Use more workers for faster processing
python -m ocr_hindi ocr sacred_text.pdf --workers 10# EasyOCR — Good Hindi/Devanagari support, no API needed
python -m ocr_hindi ocr book.pdf -e easyocr
# Marker — Best for structured books and PDFs
python -m ocr_hindi ocr book.pdf -e marker
# Tesseract — Fast, requires system installation
python -m ocr_hindi ocr book.pdf -e tesseract# Maximum accuracy for critical manuscripts
python -m ocr_hindi ocr rare_manuscript.pdf -e gemini
# With high concurrency
python -m ocr_hindi ocr rare_manuscript.pdf -e gemini --workers 15# List all available engines with details
python -m ocr_hindi engines
# Validate your setup (dependencies + authentication)
python -m ocr_hindi validate
# View PDF information
python -m ocr_hindi info manuscript.pdf
# Dry run — see what would be processed
python -m ocr_hindi ocr manuscript.pdf --dry-run
# Resume interrupted processing
python -m ocr_hindi ocr manuscript.pdf --resume| Option | Description | Default |
|---|---|---|
-e, --engine |
OCR engine (hybrid, easyocr, marker, tesseract, gemini) |
hybrid |
-p, --pages |
Page range (all, 1-50, 1,5,10-20) |
interactive |
-w, --workers |
Concurrent workers (1-20) | 5 |
-c, --confidence |
Hybrid threshold (0.0-1.0) | 0.85 |
--verify-mantras |
Verify mantra pages with Gemini | true |
-r, --resume |
Resume from previous progress | false |
-n, --dry-run |
Preview without processing | false |
--dpi |
PDF rendering quality | 200 |
your_manuscript/
├── 📄 manuscript.pdf # Original file
├── 📝 manuscript_unicode.md # ✨ Final output (Devanagari text)
├── 📋 ocr_manuscript_20240120_143022.log # Processing log
├── 📊 .ocr_progress_manuscript.json # Resume state
└── 📂 .ocr_cache_manuscript/ # 🛡️ Crash-safe cache
├── page_0001.txt # Individual page cache
├── page_0001.meta.json # Page metadata
├── page_0002.txt
└── ...
| Mode | 1000 Pages | Throughput | Cost | Notes |
|---|---|---|---|---|
| 🔀 Hybrid | ~90 min | ~11 ppm | ~$1 | Best value |
| 🆓 EasyOCR | ~120 min | ~8 ppm | $0 | 100% free |
| 🆓 Marker | ~60 min | ~16 ppm | $0 | Structured PDFs |
| 💎 Gemini | ~45 min | ~22 ppm | ~$10 | Max accuracy |
ppm = pages per minute • Tested on M1 MacBook Pro with 10 workers
❌ "poppler not found"
# macOS
brew install poppler
# Ubuntu/Debian
sudo apt-get install poppler-utils
# Windows - Download from:
# https://github.com/oschwartz10612/poppler-windows/releases❌ "EasyOCR not installed"
uv pip install easyocr
# or
pip install easyocr❌ "Tesseract not installed"
# macOS
brew install tesseract tesseract-lang
# Ubuntu/Debian
sudo apt install tesseract-ocr tesseract-ocr-hin tesseract-ocr-san
# Windows - Download installer from:
# https://github.com/UB-Mannheim/tesseract/wiki❌ Authentication errors
# Verify Vertex AI setup
gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID
# Or use API key instead
export GEMINI_API_KEY="your-api-key-here"
# Test authentication
python -m ocr_hindi validate❌ Rate limiting (429 errors)
# Reduce concurrent workers
python -m ocr_hindi ocr book.pdf --workers 3
# The system will automatically retry with exponential backoff❌ High memory usage
# Reduce workers (each worker holds images in memory)
python -m ocr_hindi ocr book.pdf --workers 2
# Or process in smaller batches
python -m ocr_hindi ocr book.pdf --pages "1-100"
python -m ocr_hindi ocr book.pdf --pages "101-200" --resumeContributions are what make the open source community amazing!
|
🐛 Bug Reports |
💡 Feature Ideas |
🔧 Pull Requests |
📖 Documentation |
# Fork, clone, and create a branch
git clone https://github.com/YOUR_USERNAME/OCR-Devnagari.git
cd OCR-Devnagari
git checkout -b feature/amazing-feature
# Make your changes, then
git commit -m "Add amazing feature"
git push origin feature/amazing-feature
# Open a Pull Request 🎉MIT License — Free for personal and commercial use
See LICENSE for details