Tesseract Training Infrastructure for Kabardian Language
Features • Quick Start • Training • Related Projects • Documentation
🌐 Документация на русском | English
zbze_ocr is the training infrastructure and development repository for Tesseract OCR models supporting the Kabardian language (kbd). This repository contains the complete toolchain for training, fine-tuning, evaluating, and deploying high-quality OCR models.
- 🎓 Complete Training Pipeline - Automated Airflow DAGs for model training
- 📊 Data Preparation Tools - Corpus cleaning, wordlist generation, image synthesis
- 📓 Jupyter Notebooks - Interactive workflows for experimentation
- 🔬 Model Evaluation - Quality metrics, CER/WER analysis, comparison tools
- 🏗️ Legacy CLI Tools - PDF preprocessing (superseded by zbze_ocr_cli)
This repository is part of a three-project ecosystem:
zbze_ocr (this repo) → Training Infrastructure
├── Models & Configs ↓
└── Training Scripts tesseract-kbd-model → Distributable Models
↓
zbze_ocr_cli → Production OCR Tool
| Project | Purpose | When to Use | Repository |
|---|---|---|---|
| zbze_ocr (this) | Training infrastructure, notebooks, development tools | Want to train models, experiment with training data | This repository |
| tesseract-kbd-model | Distributable Tesseract models for Kabardian | Just need the models | zbze-org/tesseract-kbd-model |
| zbze_ocr_cli | Production-ready OCR CLI tool | Need to process documents with OCR | zbze-org/zbze_ocr_cli |
-
Automated Training Pipeline (Airflow)
- Complete model training workflow with DAGs
- Automated ground truth generation
- Model evaluation and deployment
- Version tracking and checkpointing
-
Data Preparation Tools
- Text corpus cleaning (
dags/src/text_cleaner.py) - Synthetic training image generation (
dags/src/image_generator.py) - Ground truth preparation (box/LSTMF files)
- Wordlist and bigram extraction from corpora
- Text corpus cleaning (
-
Interactive Notebooks (Jupyter)
01_create_wlist_for_tesseract.ipynb- Wordlist generation00_extract_collocation.ipynb- Bigram extraction02_generate_image.ipynb- Image generation pipeline12_test_tesseract_models.ipynb- Model evaluation- Plus 20+ more notebooks for experimentation
Two production-ready models (exported to tesseract-kbd-model):
| Model | Size | Accuracy | CER | WER | Best For |
|---|---|---|---|---|---|
| kbd.traineddata | 16 MB | 93.8% | 6.2% | 11.8% | Modern texts, general purpose |
| kbd_finetuned.traineddata | 26 MB | 95.9% | 4.1% | 8.3% | Historical documents, newspapers |
-
Base Model Training
- Transfer learning from Russian (rus) base model
- 100,000 iterations (~2-3 days on 8+ core CPU)
- Learning rate: 0.0001
- Training corpus: 600K+ words, 25K lines
-
Fine-tuning
- Start from kbd base model
- 50,000 iterations (~1-2 days)
- Learning rate: 0.00005 (half the base rate)
- Specialized corpus: historical texts, newspapers
# Clone repository
git clone https://github.com/zbze-org/zbze_ocr.git
cd zbze_ocr
# Create virtual environment
python3.10 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtmacOS:
brew install tesseract imagemagick unpaperUbuntu/Debian:
sudo apt-get install tesseract-ocr imagemagick unpaperVerify Installation:
tesseract --version # Should be 5.0+
convert --version # ImageMagick 7.1.1+
unpaper --version # 7.0.0+# Navigate to tesstrain directory
cd tesseract/tesstrain
# Train base model (this will take 2-3 days)
nohup gmake training \
MODEL_NAME=kbd \
START_MODEL=rus \
MAX_ITERATIONS=100000 \
TESSDATA=/opt/homebrew/share/tessdata \
WORDLIST_FILE=kbd.wordlist \
NUMBERS_FILE=kbd.numbers \
PUNC_FILE=kbd.punc \
| ts '[%Y-%m-%d %H:%M:%S]' | tee plot/output.log &
# Monitor training progress
tail -f plot/output.log
# Look for: "New best BCER" to see improvements
grep "New best" plot/output.logSee detailed training guide: tesseract/tesstrain/kbd/docs/training.md
# Launch Jupyter
cd notebooks
jupyter notebook
# Open notebooks for:
# - Wordlist generation (01_create_wlist_for_tesseract.ipynb)
# - Bigram extraction (00_extract_collocation.ipynb)
# - Training image generation (02_generate_image.ipynb)
# - Model evaluation (12_test_tesseract_models.ipynb)# Step 1: Clean text corpus
python dags/src/text_cleaner.py --input corpus.txt --output cleaned.txt
# Step 2: Generate wordlist (use Jupyter notebook)
jupyter notebook notebooks/01_create_wlist_for_tesseract.ipynb
# Output: kbd.wordlist (10K most frequent words)
# Step 3: Extract bigrams (use Jupyter notebook)
jupyter notebook notebooks/00_extract_collocation.ipynb
# Output: kbd.bigrams (common word pairs)
# Step 4: Generate synthetic training images
python dags/src/image_generator.py \
--text cleaned.txt \
--fonts fonts/ \
--output ground_truth/
# Step 5: Prepare ground truth (box + lstmf files)
python dags/src/prepare_tessdata.py --dir ground_truth/cd tesseract/tesstrain
# Train base model
gmake training \
MODEL_NAME=kbd \
START_MODEL=rus \
MAX_ITERATIONS=100000
# Convert checkpoint to traineddata
gmake traineddata MODEL_NAME=kbd
# Install to system tessdata
cp data/kbd.traineddata /opt/homebrew/share/tessdata/# Use evaluation notebook
jupyter notebook notebooks/12_test_tesseract_models.ipynb
# Or use lstmeval command line tool
lstmeval --model data/kbd/kbd.lstm \
--eval_listfile data/test.txt \
--verbosity 2
# Output: Character Error Rate (CER), Word Error Rate (WER)Export trained models to tesseract-kbd-model:
# Copy trained models
cp tesseract/trained_data/kbd.traineddata \
../tesseract-kbd-model/trained_data/
cp tesseract/trained_data/kbd_finetuned.traineddata \
../tesseract-kbd-model/trained_data/
# Copy configuration files
cp tesseract/tesstrain/kbd/configs/* \
../tesseract-kbd-model/configs/
# Generate checksums
cd ../tesseract-kbd-model/trained_data
sha256sum *.traineddata > checksums.txtzbze_ocr/
├── dags/ # Airflow DAGs for automation
│ ├── 00_train_tesseract.py # Complete training workflow
│ ├── 00_image_generate.py # Ground truth generation
│ └── src/ # Python modules
│ ├── text_cleaner.py # Corpus cleaning
│ ├── image_generator.py # Synthetic image generation
│ └── prepare_tessdata.py # Ground truth preparation
│
├── notebooks/ # Jupyter notebooks (25+)
│ ├── 01_create_wlist_for_tesseract.ipynb
│ ├── 00_extract_collocation.ipynb
│ ├── 02_generate_image.ipynb
│ ├── 12_test_tesseract_models.ipynb
│ └── ... (20+ more)
│
├── tesseract/ # Models and training workspace
│ ├── trained_data/ # Production models
│ │ ├── kbd.traineddata
│ │ └── kbd_finetuned.traineddata
│ └── tesstrain/kbd/ # Active training directory
│ ├── configs/ # Tesseract configurations
│ ├── docs/ # Training documentation
│ └── fonts/ # 48 fonts for training
│
├── cli/ # Legacy PDF preprocessing (use zbze_ocr_cli instead)
├── docs/ # Documentation (Russian: main.md)
└── data/ # Working directory (gitignored)
- Main Guide (Russian) - Complete setup and usage guide
- Training Guide - Step-by-step training instructions
- Quality Metrics - Model performance analysis
- Tesstrain Commands - Training commands reference
- Model History - Model versions and evolution
- Jupyter Notebooks - Interactive workflows in
notebooks/
If you just want to use the Kabardian OCR models:
- 📦 Download models: tesseract-kbd-model
- 🔧 Process documents: zbze_ocr_cli
# Copy configuration template
cp airflow.cfg.template airflow/airflow.cfg
# Edit configuration as needed
vim airflow/airflow.cfg
# Initialize database
airflow db init
# Start scheduler
airflow scheduler
# Start webserver (in another terminal)
airflow webservertrain_tesseract_kbd- Complete training pipelinegenerate_training_images- Automated image generation
Access Airflow UI at http://localhost:8080
| Document Type | kbd.traineddata | kbd_finetuned.traineddata |
|---|---|---|
| Modern books (2000-2020) | 4.5% CER | 2.9% CER |
| Historical texts (1960-1990) | 7.8% CER | 5.2% CER |
| Newspapers | 6.9% CER | 4.8% CER |
| Degraded scans | ~12% CER | ~9% CER |
- Total corpus: ~600,000 words
- Unique words: ~85,000
- Training lines: ~25,000
- Validation lines: ~3,000
- Data sources:
- Web corpora (oshamaho.ru, adyghepsale.ru)
- Digital books (1960-2020)
- Manual transcriptions of historical documents
Contributions are welcome! This is a development repository for training infrastructure.
Training Data:
- Share cleaned Kabardian text corpora
- Provide manual transcriptions of documents
- Report issues with training data quality
Model Improvements:
- Experiment with training parameters
- Test different learning rates or iteration counts
- Share results and findings
Tools & Scripts:
- Improve data preparation scripts
- Add new Jupyter notebooks
- Enhance automation workflows
Documentation:
- Add guides and tutorials
- Translate documentation
- Share training experiences
- For model distribution improvements → contribute to tesseract-kbd-model
- For OCR tool improvements → contribute to zbze_ocr_cli
- For training infrastructure → contribute to this repository
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Tesseract OCR team for the excellent OCR engine and training tools
- tesstrain framework for simplified training workflow
- Kabardian language community for corpus data and validation
- Web sources: oshamaho.ru, adyghepsale.ru for corpus materials
- Issues: GitHub Issues
- Author: Adam Panagov
- Email: a.panagoa@gmail.com
- Organization: zbze-org
Part of the zbze-org initiative to preserve and digitize Kabardian language materials