GitHub - rajeshkanaka/OCR-Devnagari: Production-grade OCR for Hindi, Sanskrit & Devanagari manuscripts

_{॥ श्री गणेशाय नमः ॥}

_{Processing a 1000-page tantric manuscript with crash-safe resume capability}

🌟 Why OCR-Devnagari?

😫 The Problem

Ancient Sanskrit and Hindi manuscripts—tantras, stotras, and sacred texts—are being lost to time. Existing OCR tools:

Issue	Impact
❌ Can't handle complex conjuncts	संयुक्ताक्षर destroyed
❌ Destroys mantras	ॐ ह्रीं श्रीं corrupted
❌ Costs a fortune	$10+ per manuscript
❌ Crashes lose work	Hours of progress gone

🎯 The Solution

OCR-Devnagari combines local OCR speed with Gemini AI accuracy:

┌─────────────────────────────┐
│                             │
│   📜 1000-page Manuscript   │
│                             │
│   Before: $10+ cost         │
│   After:  $1 cost           │
│                             │
│   ✨ 90% Savings ✨          │
│                             │
│   Zero data loss on         │
│   crash or interrupt        │
│                             │
└─────────────────────────────┘

✨ Features

	Feature	Description
🔀	Multi-Engine Support	5 OCR backends to choose from
🧠	Smart Hybrid Mode	EasyOCR + Gemini for optimal results
🕉️	Mantra Detection	Auto-detect and preserve sacred text
⚡	High Performance	Async concurrent workers
💾	Crash-Safe	Resume from any interruption
📊	Live Progress	Real-time tracking with ETA
🛡️	Graceful Shutdown	Ctrl+C saves all work
🧹	Memory Efficient	Handles 1000+ page PDFs
✅	Response Validation	Rejects invalid OCR results

⚡ Quick Start

📦 Installation

# Clone the repository
git clone https://github.com/rajeshkanaka/OCR-Devnagari.git
cd OCR-Devnagari

# Install with UV (recommended)
uv sync && uv pip install easyocr

# Or with pip
pip install -r requirements.txt && pip install easyocr

🔑 Configure API (for Gemini features)

# Option A: Vertex AI (Recommended for production)
export GOOGLE_CLOUD_PROJECT="your-project"
export GOOGLE_CLOUD_LOCATION="global"
export GOOGLE_GENAI_USE_VERTEXAI=1

# Option B: API Key (Quick setup)
export GEMINI_API_KEY="your-key"

🚀 Run!

# 🔥 Hybrid mode — 90% savings, maximum accuracy
python -m ocr_hindi ocr manuscript.pdf --pages "all"

# 🆓 100% FREE local processing
python -m ocr_hindi ocr manuscript.pdf -e easyocr

# 💎 Premium Gemini mode for critical documents
python -m ocr_hindi ocr manuscript.pdf -e gemini

💰 Cost Comparison

💸 How much can you save?

❌ Traditional Approach

✅ With OCR-Devnagari

Metric	Value
📄 Pages	1000
💸 Cost	~$10-15
🔄 API Calls	1000
⏱️ Time	~45 min
🛡️ On Crash	LOSE ALL

→

Metric	Value
📄 Pages	1000
💸 Cost	~$1-2
🔄 API Calls	~100-150
⏱️ Time	~90 min
🛡️ On Crash	Resume ✓

🏆 Engine Comparison

Engine	Cost	Accuracy	Speed	Best For
🔀 hybrid	~$0.30/1K	⭐⭐⭐⭐⭐	⚡⚡⚡	Recommended
🆓 easyocr	FREE	⭐⭐⭐⭐	⚡⚡	Budget-conscious
🆓 marker	FREE	⭐⭐⭐⭐⭐	⚡⚡⚡	Structured PDFs
🆓 tesseract	FREE	⭐⭐⭐	⚡⚡⚡⚡	Simple documents
💎 gemini	~$2/1K	⭐⭐⭐⭐⭐	⚡⚡⚡⚡	Critical accuracy

🏗️ Architecture

"Write once, crash anywhere, resume everywhere"

                              ┌─────────────────────────────────────────┐
                              │           📄 PDF Input                  │
                              └───────────────────┬─────────────────────┘
                                                  │
                                                  ▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│                              🔀 INTELLIGENT ROUTING                                 │
├─────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                     │
│    ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐     │
│    │  hybrid  │    │ easyocr  │    │  marker  │    │tesseract │    │  gemini  │     │
│    │ DEFAULT  │    │   FREE   │    │   FREE   │    │   FREE   │    │ PREMIUM  │     │
│    └────┬─────┘    └──────────┘    └──────────┘    └──────────┘    └──────────┘     │
│         │                                                                           │
│         ▼                                                                           │
│    ┌─────────────────────────────────────────────────────────────────────────┐      │
│    │                     🧠 HYBRID DECISION ENGINE                           │      │
│    │                                                                         │      │
│    │   ┌─────────────┐         ┌─────────────────┐         ┌─────────────┐   │      │
│    │   │  EasyOCR    │ ──────▶ │ Confidence Check│ ──────▶ │   Mantra    │   │      │
│    │   │    FREE     │         │     < 85% ?     │         │  Detected?  │   │      │
│    │   └─────────────┘         └─────────────────┘         └──────┬──────┘   │      │
│    │                                    │                         │          │      │
│    │                                    ▼                         ▼          │      │
│    │                           ┌───────────────────────────────────────┐     │      │
│    │                           │        💎 Gemini 2.0 Flash            │      │      │
│    │                           │   • thinking_level: "low"             │      │     │
│    │                           │   • media_resolution: "high"          │      │     │
│    │                           │   • Token tracking for cost           │      │     │
│    │                           └───────────────────────────────────────┘      │     │
│    └─────────────────────────────────────────────────────────────────────────┘      │
│                                                                                     │
└─────────────────────────────────────────────────────────────────────────────────────┘
                                                  │
                                                  ▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│                              🛡️ CRASH-SAFE PIPELINE                                 │
├─────────────────────────────────────────────────────────────────────────────────────┤
│                                                                                     │
│    ┌──────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐      │
│    │   OCR    │────▶│    Cache     │────▶│   Progress   │────▶│   Release    │      │
│    │ Process  │     │ Atomic Write │     │   Update     │     │   Memory     │      │
│    └──────────┘     │ page_NNN.txt │     └──────────────┘     └──────────────┘      │
│                     └──────────────┘                                                │
│                                                                                     │
│    On interrupt (Ctrl+C) or crash:                                                  │
│    ┌─────────────────────────────────────────────────────────────────────────┐      │
│    │  ✓ All cached pages preserved    ✓ Resume skips completed pages         │      │
│    │  ✓ No duplicate API charges      ✓ Output merged from cache             │      │ 
│    └─────────────────────────────────────────────────────────────────────────┘      │
│                                                                                     │
└─────────────────────────────────────────────────────────────────────────────────────┘
                                                  │
                                                  ▼
                              ┌─────────────────────────────────────────┐
                              │   📝 Markdown Output + 💰 Cost Report    │
                              └─────────────────────────────────────────┘

📚 Read the Full Architecture Documentation →

🕉️ Mantra Detection

Intelligent detection of sacred text patterns ensures mantras are always verified with maximum accuracy

बीज मन्त्र
_{Seed Syllables}

ॐ    ह्रीं   श्रीं
क्लीं   ऐं    हुं

मन्त्र समाप्ति
_{Sacred Endings}

स्वाहा   नमः   फट्
वौषट्   हुं   ठः

श्लोक चिह्न
_{Verse Markers}

॥१॥  ॥२॥  ॥३॥
॥ इति ॥

विभाग सूचक
_{Section Indicators}

विनियोग  न्यास
ध्यान   कवच

📖 Usage Examples

🔀 Hybrid Mode _{(Recommended)}

# Process entire manuscript with intelligent routing
python -m ocr_hindi ocr sacred_text.pdf --pages "all"

# Adjust confidence threshold (higher = more Gemini verification)
python -m ocr_hindi ocr sacred_text.pdf --confidence 0.90

# Disable mantra verification for faster processing
python -m ocr_hindi ocr sacred_text.pdf --no-verify-mantras

# Process specific page ranges
python -m ocr_hindi ocr sacred_text.pdf --pages "1-100,200-250"

# Use more workers for faster processing
python -m ocr_hindi ocr sacred_text.pdf --workers 10

🆓 Free Local Processing

# EasyOCR — Good Hindi/Devanagari support, no API needed
python -m ocr_hindi ocr book.pdf -e easyocr

# Marker — Best for structured books and PDFs
python -m ocr_hindi ocr book.pdf -e marker

# Tesseract — Fast, requires system installation
python -m ocr_hindi ocr book.pdf -e tesseract

💎 Premium Gemini Mode

# Maximum accuracy for critical manuscripts
python -m ocr_hindi ocr rare_manuscript.pdf -e gemini

# With high concurrency
python -m ocr_hindi ocr rare_manuscript.pdf -e gemini --workers 15

🛠️ Utility Commands

# List all available engines with details
python -m ocr_hindi engines

# Validate your setup (dependencies + authentication)
python -m ocr_hindi validate

# View PDF information
python -m ocr_hindi info manuscript.pdf

# Dry run — see what would be processed
python -m ocr_hindi ocr manuscript.pdf --dry-run

# Resume interrupted processing
python -m ocr_hindi ocr manuscript.pdf --resume

⚙️ Configuration

Option	Description	Default
`-e, --engine`	OCR engine (`hybrid`, `easyocr`, `marker`, `tesseract`, `gemini`)	`hybrid`
`-p, --pages`	Page range (`all`, `1-50`, `1,5,10-20`)	interactive
`-w, --workers`	Concurrent workers (1-20)	`5`
`-c, --confidence`	Hybrid threshold (0.0-1.0)	`0.85`
`--verify-mantras`	Verify mantra pages with Gemini	`true`
`-r, --resume`	Resume from previous progress	`false`
`-n, --dry-run`	Preview without processing	`false`
`--dpi`	PDF rendering quality	`200`

📁 Output Structure

your_manuscript/
├── 📄 manuscript.pdf                        # Original file
├── 📝 manuscript_unicode.md                 # ✨ Final output (Devanagari text)
├── 📋 ocr_manuscript_20240120_143022.log    # Processing log
├── 📊 .ocr_progress_manuscript.json         # Resume state
└── 📂 .ocr_cache_manuscript/                # 🛡️ Crash-safe cache
    ├── page_0001.txt                        #    Individual page cache
    ├── page_0001.meta.json                  #    Page metadata
    ├── page_0002.txt
    └── ...

📊 Performance Benchmarks

Mode	1000 Pages	Throughput	Cost	Notes
🔀 Hybrid	~90 min	~11 ppm	~$1	Best value
🆓 EasyOCR	~120 min	~8 ppm	$0	100% free
🆓 Marker	~60 min	~16 ppm	$0	Structured PDFs
💎 Gemini	~45 min	~22 ppm	~$10	Max accuracy

_{ppm = pages per minute • Tested on M1 MacBook Pro with 10 workers}

🔧 Troubleshooting

❌ "poppler not found"

# macOS
brew install poppler

# Ubuntu/Debian
sudo apt-get install poppler-utils

# Windows - Download from:
# https://github.com/oschwartz10612/poppler-windows/releases

❌ "EasyOCR not installed"

uv pip install easyocr
# or
pip install easyocr

❌ "Tesseract not installed"

# macOS
brew install tesseract tesseract-lang

# Ubuntu/Debian
sudo apt install tesseract-ocr tesseract-ocr-hin tesseract-ocr-san

# Windows - Download installer from:
# https://github.com/UB-Mannheim/tesseract/wiki

❌ Authentication errors

# Verify Vertex AI setup
gcloud auth application-default login
gcloud config set project YOUR_PROJECT_ID

# Or use API key instead
export GEMINI_API_KEY="your-api-key-here"

# Test authentication
python -m ocr_hindi validate

❌ Rate limiting (429 errors)

# Reduce concurrent workers
python -m ocr_hindi ocr book.pdf --workers 3

# The system will automatically retry with exponential backoff

❌ High memory usage

# Reduce workers (each worker holds images in memory)
python -m ocr_hindi ocr book.pdf --workers 2

# Or process in smaller batches
python -m ocr_hindi ocr book.pdf --pages "1-100"
python -m ocr_hindi ocr book.pdf --pages "101-200" --resume

🤝 Contributing

Contributions are what make the open source community amazing!

🐛 Bug Reports
Open an Issue

💡 Feature Ideas
Start a Discussion

🔧 Pull Requests
Fork & Submit PR

📖 Documentation
Help improve docs

# Fork, clone, and create a branch
git clone https://github.com/YOUR_USERNAME/OCR-Devnagari.git
cd OCR-Devnagari
git checkout -b feature/amazing-feature

# Make your changes, then
git commit -m "Add amazing feature"
git push origin feature/amazing-feature

# Open a Pull Request 🎉

📜 License

MIT License — Free for personal and commercial use

See LICENSE for details

🙏 Acknowledgments

This project stands on the shoulders of giants

॥ सर्वे भवन्तु सुखिनः ॥

May all beings be happy

Built with ❤️ for the Sanskrit & Vaidik community- by RajeshKanaka

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github		.github
docs		docs
src/ocr_hindi		src/ocr_hindi
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

🌟 Why OCR-Devnagari?

😫 The Problem

🎯 The Solution

✨ Features

⚡ Quick Start

📦 Installation

🔑 Configure API (for Gemini features)

🚀 Run!

💰 Cost Comparison

💸 How much can you save?

❌ Traditional Approach

✅ With OCR-Devnagari

→

🏆 Engine Comparison

🏗️ Architecture

🕉️ Mantra Detection

📖 Usage Examples

🔀 Hybrid Mode (Recommended)

🆓 Free Local Processing

💎 Premium Gemini Mode

🛠️ Utility Commands

⚙️ Configuration

📁 Output Structure

📊 Performance Benchmarks

🔧 Troubleshooting

🤝 Contributing

📜 License

🙏 Acknowledgments

॥ सर्वे भवन्तु सुखिनः ॥

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

🔀 Hybrid Mode _{(Recommended)}

Packages