Semi-automated digitization of historical handwritten tabular records.
Reference implementation for the JCDL'24 paper The BeeProject: Advanced Digitisation and Creation of a Dataset for the Monitoring of Beehives [1]. BeeProject combines feature-based image alignment, Hough-transform grid detection, and cloud OCR to recover structured records from scanned paper forms — a setting where layout is irregular, ink degrades, and modern table-extraction models trained on born-digital PDFs fall over.
The pipeline was developed to digitize beekeeping observation records collected across five German states (Lower Saxony, Hesse, Mecklenburg-Vorpommern, Thuringia, Brandenburg) by the Institute of Bee Protection (JKI) under the MonViA project. The released ground-truth covers 3,819 scans and 30,552 annotated cells from 1998–2017. On this benchmark the pipeline achieves CER ≈ 5% and WER ≈ 13% with TrOCR or Google Vision as the OCR backend, with SIFT-based alignment correctly registering 95% of forms.
---Status: maintained for reproducibility of the JCDL'24 paper. Issues and PRs welcome.
- How it works
- Quick start
- Installation
- CLI reference
- Output format
- OCR credentials
- Project structure
- Dataset
- References
Digitization runs in two steps:
Step 1 — Template extraction (bee extract)
A clean, averaged template is recovered from a batch of handwritten sample scans using feature matching (SIFT or ORB). The Hough line transform then detects the table grid and produces a cell map in JSON.
Step 2 — Digitization (bee digitize)
Each scan is aligned to the template, preprocessed to remove handwriting from the background, and passed to one or more OCR services. Recognized text is mapped back to individual cells and exported as a structured JSON record.
Run the bundled sample dataset in three commands:
git clone https://github.com/mertova/BeeProject.git
cd BeeProject
pip install -e .# Step 1 — extract template and table structure
bee extract \
--dataset resources/play-data/test_data_2014 \
--reference resources/form1/reference.png \
--output resources/play-data/extracted_form
# Step 2 — digitize the filled forms
bee digitize \
--dataset resources/play-data/test_data_2014 \
--output resources/play-data/results \
--credentials resources/credentials/credentials_google.json \
--table resources/play-data/extracted_form/table.jsonRequires Python 3.10 or newer.
pip install -e .This registers the bee command globally in your environment. Verify with:
bee --helpRecovers a clean empty template from a batch of sample scans and detects the table grid.
bee extract -d DIR -r FILE [options]
| Flag | Long form | Type | Default | Description |
|---|---|---|---|---|
-d |
--dataset |
path | required | Directory of sample scan images (.png) |
-r |
--reference |
path | required | Representative reference image |
-o |
--output |
path | ./resources/data/extraction |
Output directory |
-ev |
--eps-v |
int | 15 |
Epsilon for vertical grid lines |
-eh |
--eps-h |
int | 20 |
Epsilon for horizontal grid lines |
-l |
--limit |
int | 15 |
Maximum number of sample images to use |
-a |
--algo |
sift|orb |
sift |
Feature matching algorithm |
--transform / --no-transform |
flag | on | Align samples to reference image | |
--averaging / --no-averaging |
flag | on | Enable pen elimination via averaging |
Outputs written to --output:
| File | Description |
|---|---|
template.png |
Clean averaged form image |
table.json |
Cell map with coordinates for every detected cell |
Aligns, preprocesses, and OCR-processes each scan. Maps recognized text to table cells.
bee digitize -d DIR -o DIR -c FILE -t FILE [options]
| Flag | Long form | Type | Default | Description |
|---|---|---|---|---|
-d |
--dataset |
path | required | Directory of filled form images (.png) |
-o |
--output |
path | required | Output directory for results |
-c |
--credentials |
path | required | OCR credentials .json file |
-t |
--table |
path | required | Table definition .json from the extract step |
--no-transform |
flag | off | Skip alignment to reference | |
-D |
--debug |
flag | off | Save intermediate images to output/debug/ |
Output written to --output:
| File | Description |
|---|---|
out_<dataset>.json |
Digitized records, keyed by image ID |
debug/ |
Preprocessed and annotated images (only with -D) |
{
"template": "resources/play-data/extracted_form/template.png",
"shape": [8, 35],
"cells": [
{ "text": "A0", "pt1": { "x": 0, "y": 0 }, "pt2": { "x": 71, "y": 287 } },
{ "text": "B0", "pt1": { "x": 71, "y": 0 }, "pt2": { "x": 211, "y": 287 } }
]
}Cell identifiers follow spreadsheet notation: column letter + row number (A0, B3, F12, …).
{
"1": {
"google": [
{ "cell": "F3", "text": "12.4", "confidence": 0.91 },
{ "cell": "F4", "text": "11.8", "confidence": 0.87 }
],
"azure": [
{ "cell": "F3", "text": "12.4", "confidence": 0.95 }
]
}
}BeeProject supports three cloud OCR services and one local engine. Place credential files anywhere and point to them with --credentials.
- Create a project at Google Cloud Console
- Enable the Cloud Vision API
- Create a service account and download the JSON key file
- Documentation: Cloud Vision — Handwriting
{
"type": "service_account",
"project_id": "your_project",
"private_key_id": "...",
"private_key": "-----BEGIN RSA PRIVATE KEY-----\n...",
"client_email": "google-vision@your_project.iam.gserviceaccount.com",
"client_id": "...",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://oauth2.googleapis.com/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "",
"universe_domain": "googleapis.com"
}- Create a Computer Vision resource in the Azure Portal
- Copy your subscription key and endpoint
- Guide: Transcribing handwritten text with Azure
{
"microsoft_api_key": {
"SUBSCRIPTION_KEY": "your_subscription_key",
"ENDPOINT": "https://your_resource.cognitiveservices.azure.com/"
}
}- Sign in to the AWS Console
- Set up IAM with Textract permissions
- Generate an access key pair
- Documentation: Getting started with Textract
{
"aws_access_key_id": "YOUR_KEY_ID",
"aws_secret_access_key": "YOUR_SECRET",
"region_name": "eu-central-1"
}Install Tesseract OCR locally. No credential file needed — pass any valid .json file as a placeholder.
BeeProject/
├── resources/
│ ├── credentials/ # OCR service credential files
│ ├── data/ # Extraction and digitization output
│ └── play-data/ # Bundled sample dataset
│ ├── test_data_2014/ # Sample scans
│ └── extracted_form/ # Pre-computed template and table
├── src/
│ ├── cli.py # CLI entry point (bee command)
│ ├── tsr.py # Table structure recognition
│ ├── form_analysis.py # Template extraction pipeline
│ ├── digitize.py # Digitization pipeline
│ ├── geometry/ # Line, rectangle, vertex primitives
│ ├── image_processing/ # Image, form, and reference processing
│ ├── ocr_services/ # Google, Azure, AWS, Tesseract connectors
│ └── table/ # Table and cell data model
├── test/ # Unit tests
├── pyproject.toml # Package config and CLI entry point
└── README.md
The dataset contains beekeeping observation records collected by the Institute of Bee Protection (JKI) from beekeeper associations in Lower Saxony, Hesse, Mecklenburg-Vorpommern, Thuringia, and Brandenburg as part of the MonViA project.
| Resource | Link |
|---|---|
| Sample dataset | GitHub — TheBeeProjectCollection |
| Full dataset | FAIRDOMHub |
[1] Lukrécia Mertová, Severin Polreich, Oleg Lewkowski, and Wolfgang Müller. 2024. The BeeProject: Advanced Digitisation and Creation of a Dataset for the Monitoring of Beehives. In The 2024 ACM/IEEE Joint Conference on Digital Libraries (JCDL ‘24), December 16–20, 2024, Hong Kong, China. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3677389.3702599
[2] Mertová, L., Lewkowski, O., Polreich, S., & Müller, W. (2024). BeeProject-collection [Data set]. FAIRDOMHub. https://doi.org/10.15490/FAIRDOMHUB.1.DATAFILE.7415.1

