zbze_ocr

Tesseract Training Infrastructure for Kabardian Language

Features • Quick Start • Training • Related Projects • Documentation

Overview

zbze_ocr is the training infrastructure and development repository for Tesseract OCR models supporting the Kabardian language (kbd). This repository contains the complete toolchain for training, fine-tuning, evaluating, and deploying high-quality OCR models.

What This Repository Provides

🎓 Complete Training Pipeline - Automated Airflow DAGs for model training
📊 Data Preparation Tools - Corpus cleaning, wordlist generation, image synthesis
📓 Jupyter Notebooks - Interactive workflows for experimentation
🔬 Model Evaluation - Quality metrics, CER/WER analysis, comparison tools
🏗️ Legacy CLI Tools - PDF preprocessing (superseded by zbze_ocr_cli)

Project Ecosystem

This repository is part of a three-project ecosystem:

zbze_ocr (this repo)              → Training Infrastructure
    ├── Models & Configs           ↓
    └── Training Scripts      tesseract-kbd-model → Distributable Models
                                   ↓
                              zbze_ocr_cli → Production OCR Tool

Related Projects

Project	Purpose	When to Use	Repository
zbze_ocr (this)	Training infrastructure, notebooks, development tools	Want to train models, experiment with training data	This repository
tesseract-kbd-model	Distributable Tesseract models for Kabardian	Just need the models	zbze-org/tesseract-kbd-model
zbze_ocr_cli	Production-ready OCR CLI tool	Need to process documents with OCR	zbze-org/zbze_ocr_cli

Features

Training Infrastructure

Automated Training Pipeline (Airflow)
- Complete model training workflow with DAGs
- Automated ground truth generation
- Model evaluation and deployment
- Version tracking and checkpointing
Data Preparation Tools
- Text corpus cleaning (dags/src/text_cleaner.py)
- Synthetic training image generation (dags/src/image_generator.py)
- Ground truth preparation (box/LSTMF files)
- Wordlist and bigram extraction from corpora
Interactive Notebooks (Jupyter)
- 01_create_wlist_for_tesseract.ipynb - Wordlist generation
- 00_extract_collocation.ipynb - Bigram extraction
- 02_generate_image.ipynb - Image generation pipeline
- 12_test_tesseract_models.ipynb - Model evaluation
- Plus 20+ more notebooks for experimentation

Trained Models

Two production-ready models (exported to tesseract-kbd-model):

Model	Size	Accuracy	CER	WER	Best For
kbd.traineddata	16 MB	93.8%	6.2%	11.8%	Modern texts, general purpose
kbd_finetuned.traineddata	26 MB	95.9%	4.1%	8.3%	Historical documents, newspapers

Training Strategy

Base Model Training
- Transfer learning from Russian (rus) base model
- 100,000 iterations (~2-3 days on 8+ core CPU)
- Learning rate: 0.0001
- Training corpus: 600K+ words, 25K lines
Fine-tuning
- Start from kbd base model
- 50,000 iterations (~1-2 days)
- Learning rate: 0.00005 (half the base rate)
- Specialized corpus: historical texts, newspapers

Quick Start

Setup Environment

# Clone repository
git clone https://github.com/zbze-org/zbze_ocr.git
cd zbze_ocr

# Create virtual environment
python3.10 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Install External Tools

macOS:

brew install tesseract imagemagick unpaper

Ubuntu/Debian:

sudo apt-get install tesseract-ocr imagemagick unpaper

Verify Installation:

tesseract --version  # Should be 5.0+
convert --version    # ImageMagick 7.1.1+
unpaper --version    # 7.0.0+

Train Your First Model

# Navigate to tesstrain directory
cd tesseract/tesstrain

# Train base model (this will take 2-3 days)
nohup gmake training \
  MODEL_NAME=kbd \
  START_MODEL=rus \
  MAX_ITERATIONS=100000 \
  TESSDATA=/opt/homebrew/share/tessdata \
  WORDLIST_FILE=kbd.wordlist \
  NUMBERS_FILE=kbd.numbers \
  PUNC_FILE=kbd.punc \
  | ts '[%Y-%m-%d %H:%M:%S]' | tee plot/output.log &

# Monitor training progress
tail -f plot/output.log

# Look for: "New best BCER" to see improvements
grep "New best" plot/output.log

See detailed training guide: tesseract/tesstrain/kbd/docs/training.md

Use Training Notebooks

# Launch Jupyter
cd notebooks
jupyter notebook

# Open notebooks for:
# - Wordlist generation (01_create_wlist_for_tesseract.ipynb)
# - Bigram extraction (00_extract_collocation.ipynb)
# - Training image generation (02_generate_image.ipynb)
# - Model evaluation (12_test_tesseract_models.ipynb)

Training Workflow

Complete End-to-End Process

1. Data Preparation

# Step 1: Clean text corpus
python dags/src/text_cleaner.py --input corpus.txt --output cleaned.txt

# Step 2: Generate wordlist (use Jupyter notebook)
jupyter notebook notebooks/01_create_wlist_for_tesseract.ipynb
# Output: kbd.wordlist (10K most frequent words)

# Step 3: Extract bigrams (use Jupyter notebook)
jupyter notebook notebooks/00_extract_collocation.ipynb
# Output: kbd.bigrams (common word pairs)

# Step 4: Generate synthetic training images
python dags/src/image_generator.py \
    --text cleaned.txt \
    --fonts fonts/ \
    --output ground_truth/

# Step 5: Prepare ground truth (box + lstmf files)
python dags/src/prepare_tessdata.py --dir ground_truth/

2. Model Training

cd tesseract/tesstrain

# Train base model
gmake training \
  MODEL_NAME=kbd \
  START_MODEL=rus \
  MAX_ITERATIONS=100000

# Convert checkpoint to traineddata
gmake traineddata MODEL_NAME=kbd

# Install to system tessdata
cp data/kbd.traineddata /opt/homebrew/share/tessdata/

3. Model Evaluation

# Use evaluation notebook
jupyter notebook notebooks/12_test_tesseract_models.ipynb

# Or use lstmeval command line tool
lstmeval --model data/kbd/kbd.lstm \
         --eval_listfile data/test.txt \
         --verbosity 2
# Output: Character Error Rate (CER), Word Error Rate (WER)

4. Model Distribution

Export trained models to tesseract-kbd-model:

# Copy trained models
cp tesseract/trained_data/kbd.traineddata \
   ../tesseract-kbd-model/trained_data/

cp tesseract/trained_data/kbd_finetuned.traineddata \
   ../tesseract-kbd-model/trained_data/

# Copy configuration files
cp tesseract/tesstrain/kbd/configs/* \
   ../tesseract-kbd-model/configs/

# Generate checksums
cd ../tesseract-kbd-model/trained_data
sha256sum *.traineddata > checksums.txt

Repository Structure

zbze_ocr/
├── dags/                       # Airflow DAGs for automation
│   ├── 00_train_tesseract.py   # Complete training workflow
│   ├── 00_image_generate.py    # Ground truth generation
│   └── src/                    # Python modules
│       ├── text_cleaner.py     # Corpus cleaning
│       ├── image_generator.py  # Synthetic image generation
│       └── prepare_tessdata.py # Ground truth preparation
│
├── notebooks/                  # Jupyter notebooks (25+)
│   ├── 01_create_wlist_for_tesseract.ipynb
│   ├── 00_extract_collocation.ipynb
│   ├── 02_generate_image.ipynb
│   ├── 12_test_tesseract_models.ipynb
│   └── ... (20+ more)
│
├── tesseract/                  # Models and training workspace
│   ├── trained_data/           # Production models
│   │   ├── kbd.traineddata
│   │   └── kbd_finetuned.traineddata
│   └── tesstrain/kbd/          # Active training directory
│       ├── configs/            # Tesseract configurations
│       ├── docs/               # Training documentation
│       └── fonts/              # 48 fonts for training
│
├── cli/                        # Legacy PDF preprocessing (use zbze_ocr_cli instead)
├── docs/                       # Documentation (Russian: main.md)
└── data/                       # Working directory (gitignored)

Documentation

Getting Started

Main Guide (Russian) - Complete setup and usage guide
Training Guide - Step-by-step training instructions
Quality Metrics - Model performance analysis

Training Resources

Tesstrain Commands - Training commands reference
Model History - Model versions and evolution
Jupyter Notebooks - Interactive workflows in notebooks/

For Model Users

If you just want to use the Kabardian OCR models:

📦 Download models: tesseract-kbd-model
🔧 Process documents: zbze_ocr_cli

Airflow Integration

Setup Airflow

# Copy configuration template
cp airflow.cfg.template airflow/airflow.cfg

# Edit configuration as needed
vim airflow/airflow.cfg

# Initialize database
airflow db init

# Start scheduler
airflow scheduler

# Start webserver (in another terminal)
airflow webserver

Available DAGs

train_tesseract_kbd - Complete training pipeline
generate_training_images - Automated image generation

Access Airflow UI at http://localhost:8080

Model Performance

Benchmark Results

Document Type	kbd.traineddata	kbd_finetuned.traineddata
Modern books (2000-2020)	4.5% CER	2.9% CER
Historical texts (1960-1990)	7.8% CER	5.2% CER
Newspapers	6.9% CER	4.8% CER
Degraded scans	~12% CER	~9% CER

Training Data Statistics

Total corpus: ~600,000 words
Unique words: ~85,000
Training lines: ~25,000
Validation lines: ~3,000
Data sources:
- Web corpora (oshamaho.ru, adyghepsale.ru)
- Digital books (1960-2020)
- Manual transcriptions of historical documents

Contributing

Contributions are welcome! This is a development repository for training infrastructure.

How to Contribute

Training Data:

Share cleaned Kabardian text corpora
Provide manual transcriptions of documents
Report issues with training data quality

Model Improvements:

Experiment with training parameters
Test different learning rates or iteration counts
Share results and findings

Tools & Scripts:

Improve data preparation scripts
Add new Jupyter notebooks
Enhance automation workflows

Documentation:

Add guides and tutorials
Translate documentation
Share training experiences

Contribution Guidelines

For model distribution improvements → contribute to tesseract-kbd-model
For OCR tool improvements → contribute to zbze_ocr_cli
For training infrastructure → contribute to this repository

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

Tesseract OCR team for the excellent OCR engine and training tools
tesstrain framework for simplified training workflow
Kabardian language community for corpus data and validation
Web sources: oshamaho.ru, adyghepsale.ru for corpus materials

Contact & Support

Issues: GitHub Issues
Author: Adam Panagov
Email: a.panagoa@gmail.com
Organization: zbze-org

Part of the zbze-org initiative to preserve and digitize Kabardian language materials

⬆ Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
cli		cli
dags		dags
docs		docs
notebooks		notebooks
tesseract		tesseract
usr		usr
web		web
zbze_ocr_cli/logic/cv/tests/artifacts		zbze_ocr_cli/logic/cv/tests/artifacts
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README.ru.md		README.ru.md
airflow.cfg.template		airflow.cfg.template
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
webserver_config.py		webserver_config.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

zbze_ocr

Overview

What This Repository Provides

Project Ecosystem

Related Projects

Features

Training Infrastructure

Trained Models

Training Strategy

Quick Start

Setup Environment

Install External Tools

Train Your First Model

Use Training Notebooks

Training Workflow

Complete End-to-End Process

1. Data Preparation

2. Model Training

3. Model Evaluation

4. Model Distribution

Repository Structure

Documentation

Getting Started

Training Resources

For Model Users

Airflow Integration

Setup Airflow

Available DAGs

Model Performance

Benchmark Results

Training Data Statistics

Contributing

How to Contribute

Contribution Guidelines

License

Acknowledgments

Contact & Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages