|
RECOGNITION 45 Languages, 2 tiers |
PIPELINE 31 Concurrent threads |
OUTPUTS 14 Sidecar schemas |
TOPOLOGIES 7 T1 single-host → T7 air-gap |
HELM 26 K8s templates |
OUTBOUND 0 Calls (air-gap ready) |
EDCOCR is a production-grade Optical Character Recognition platform purpose-built for forensic, legal, and high-volume document processing. It turns scans, PDFs, images, and videos into searchable, auditable outputs — without the hallucinations, drift, or evidence loss that come with generative-AI OCR.
It is the work product of years of pipeline iteration. Every design decision tilts toward one outcome: a usable, defensible document at the end of the pipeline, even when the inputs are awful.
flowchart LR
A[Source Documents<br/>PDF · TIFF · JPEG · Video] -->|Ingest| B[Scheduler]
B -->|Chunk| C[CPU Extractors<br/>8 threads]
C -->|300 DPI Image| D[GPU OCR Workers<br/>12 threads<br/>PaddleOCR + Tesseract]
D -->|Page Result| E[Assembler]
E -->|Searchable PDF<br/>+ Text + Sidecars| F[Output]
E -->|Audit Trail| G[Chain of Custody]
E -->|Metrics| H[Prometheus / Grafana]
style A fill:#0ea5e9,color:#fff
style D fill:#10b981,color:#fff
style F fill:#f59e0b,color:#fff
style G fill:#ef4444,color:#fff
Spin up the full stack with Docker (GPU optional — CPU works too):
git clone https://github.com/mattmre/EDCOCR-PUBLIC.git
cd EDCOCR-PUBLIC
cp .env.example .env # set OCR_API_KEY before starting
docker compose up -dDrop a PDF into ocr_source/ and watch it appear under ocr_output/EXPORT/PDF/.
Or call the REST API directly:
curl -X POST http://localhost:8000/api/v1/jobs \
-H "X-API-Key: $OCR_API_KEY" \
-F "file=@/path/to/document.pdf"Or use the Python SDK:
from edcocr_sdk import Client
client = Client(base_url="http://localhost:8000", api_key="...")
job = client.submit_job("/path/to/document.pdf")
job.wait_until_complete()
print(job.text) # OCR'd plain text
print(job.searchable_pdf_path) # Path to the rendered PDFOr use the TypeScript SDK:
import { Client } from "@edcocr/sdk";
const client = new Client({ baseUrl: "http://localhost:8000", apiKey: "..." });
const job = await client.submitJob("/path/to/document.pdf");
await job.waitUntilComplete();
console.log(job.text);For the full installation walkthrough, see INSTALL.md. For the 5-minute getting-started guide, see docs/02-QUICKSTART-5-MINUTE-SUCCESS.md.
EDCOCR is a layered system. Clients talk to a thin FastAPI ingress that delegates to a Django coordinator; workers are pulled in capability-based from a RabbitMQ broker; outputs land in PDF, plain text, and 14 structured sidecar formats backed by a tamper-evident custody log.
flowchart TB
subgraph Clients["Clients"]
C1[Python SDK]
C2[TypeScript SDK]
C3[REST API direct]
C4[Webhook consumers]
end
subgraph Ingestion["Ingestion Layer"]
API[FastAPI<br/>REST + WebSocket + SSE]
Watcher[File Watcher<br/>local + FTP/SFTP]
Object[Object Storage<br/>S3 · MinIO · Azure · GCS]
end
subgraph Coordination["Coordination Layer"]
Coord[Django Coordinator]
DB[(PostgreSQL)]
Broker[(RabbitMQ)]
Redis[(Redis<br/>Sentinel HA)]
end
subgraph Workers["Worker Layer"]
WG[GPU OCR Workers]
WC[CPU OCR Workers<br/>ONNX]
WN[NLP Workers<br/>NER · UIE]
WX[Compression Workers]
end
subgraph Output["Output Layer"]
OutPDF[Searchable PDFs]
OutTxt[Plain Text]
OutSide[14 Sidecar JSONs<br/>NER · Tables · Classification ·<br/>Handwriting · Language · ...]
Custody[Custody Log<br/>JSONL hash chain]
end
C1 --> API
C2 --> API
C3 --> API
Watcher --> API
Object --> API
API --> Coord
Coord <--> DB
Coord <--> Broker
Coord <--> Redis
Broker --> WG
Broker --> WC
Broker --> WN
Broker --> WX
WG --> Output
WC --> Output
WN --> Output
WX --> OutPDF
Output -->|completion| C4
style Workers fill:#10b981,stroke:#065f46,color:#fff
style Output fill:#f59e0b,stroke:#92400e,color:#fff
style Custody fill:#ef4444,stroke:#7f1d1d,color:#fff
For the full architecture walkthrough — deployment topologies (T1 single-host through T7 air-gap), failure modes, security model, and the custody hash-chain design — see ARCHITECTURE.md.
Four self-contained briefings live under presentation/. Open any HTML file in a browser — no build step, no server, no analytics.
|
Executive Summary For decision-makers The one-pager explaining why a forensic-grade OCR platform exists and what it costs to ignore the difference. ~5 min · Legal, compliance, ops leadership |
Technical Brief For engineers Pipeline internals, deployment topologies, API surface, SDK examples, observability stack, security posture. ~15 min · Integrators, SRE, platform |
White Paper For evaluators Twelve sections covering motivation, design principles, output schema, translation policy, and admissibility posture. ~25 min · Architects, evaluators, counsel |
Use Cases For product / legal Seven worked scenarios with recommended topology, feature flags, and operational outcome — plus where EDCOCR is not a fit. ~10 min · Product, legal, sales engineering |
Plus three interactive decks: presentation/index.html (marketing landing) · presentation/slides.html (keyboard-navigable slides) · presentation/architecture.html (architecture deep-dive).
| Concern | How EDCOCR Handles It |
|---|---|
| Hallucinations | CTC-only recognition (PaddleOCR). No generative model anywhere in the recognition path. |
| Lost evidence | OCR failure never discards the source image. Failed pages survive into the output PDF as image-only pages with an audit entry. |
| Crash recovery | Page-level temp files with deterministic resume. Kill the process mid-job, restart, no rework. |
| Tamper detection | SHA-256 hash-chained JSONL custody log. Append-only, replayable, signature-verifiable. |
| Chain of custody | Every document, every page, every transformation gets a custody event. Filesystem path, hash, processor identity. |
| Language drift | Two-pass adaptive detection (FastText) with per-span BCP-47 sidecar (opt-in). |
| Mixed scripts | Language re-detection without re-running OCR. |
| Privileged content | Privilege detection during structured extraction; soft-warning posture and policy-enforced redaction. |
flowchart LR
P1[Page 1<br/>SHA-256] -->|prev_hash| C1[Custody Event 1<br/>INGEST]
C1 -->|event_hash| C2[Custody Event 2<br/>OCR_COMPLETE]
C2 -->|event_hash| C3[Custody Event 3<br/>TRANSFORM_REDACT]
C3 -->|event_hash| C4[Custody Event 4<br/>EXPORT_PDF]
C4 -->|terminal_hash| V[Verifier:<br/>replay & recompute<br/>each hash]
style P1 fill:#0ea5e9,color:#fff
style C4 fill:#10b981,color:#fff
style V fill:#f59e0b,color:#fff
Each event in custody.py writes a JSONL record whose event_hash is SHA-256(prev_hash || canonical_event_json). Tampering with any record in the chain invalidates every record after it. The chain is append-only and replayable by anyone who has the file.
| EDCOCR | Generative-AI OCR | Open-Source CTC Toolkits | |
|---|---|---|---|
| Recognition model | CTC (PaddleOCR 2.9.1) | LLM-decoded | CTC, varies |
| Hallucination risk | None by design | Documented and material | None by design |
| Audit trail | Hash-chained JSONL custody | None standard | None standard |
| Crash resume | Page-level deterministic | Job-level at best | Manual |
| Air-gap deployable | Yes — pre-baked models | No (calls external APIs) | Yes, but you build it |
| Distributed at scale | Helm chart + KEDA + Celery | Hosted only | DIY |
| Forensic preservation | Image-only fallback embedded | Returns "error" | Returns "error" |
| Per-tenant isolation | Built in | Hosted account boundary | DIY |
| License | Apache 2.0 | Proprietary | Mixed |
| Operational maturity | 9,000+ unit tests, 53 Grafana panels | Black box | Varies |
— The forensic-vs-AI boundary is enforced in code: see
docs/architecture/forensic-ai-boundary-contract.mdandscripts/validate_feature_boundary.py.
- 45 languages in a tiered registry (34 core + 11 extended)
- CTC-only recognition — no hallucinations possible by design
- Adaptive DPI escalation — auto-retry low-confidence pages at 450/600 DPI
- Image preprocessing — OpenCV-based deskew, denoise, binarize for degraded scans
- Smart engine selection — quality-based routing between Tesseract and PaddleOCR
- CJK vertical text — reading-order analysis for vertical Chinese, Japanese, Korean
- 6 concurrent stages, 31 threads — async producer-consumer model
- Page-level crash resume — deterministic recovery from any failure
- 300 DPI default — configurable per-job
- PDF + 18 image formats — TIFF, JPEG, PNG, BMP, GIF, WebP, JP2, etc.
- Video ingestion — sample frames at configurable intervals
- Searchable PDFs with embedded text layer
- Plain text extraction (UTF-8)
- Document Intelligence sidecars — layout regions, table HTML/CSV (opt-in)
- Structured extraction — dates, amounts, names, addresses (UIE + regex)
- Named Entity Recognition — case numbers, Bates numbers, PII/PHI with spatial bboxes
- Document classification — text rules + layout ensemble
- Handwriting detection — confidence + geometry heuristics
- Signature detection — experimental, advisory-only
- Barcode/QR extraction + OMR checkbox detection
- Per-span language sidecar with BCP-47 codes and confidence
- Docker — single-host with GPU passthrough
- Kubernetes — production Helm chart with KEDA autoscaling
- High availability — Redis Sentinel, PostgreSQL backup CronJob, RabbitMQ quorum queues
- Air-gapped — pre-baked language models in Docker images, bundle/deploy scripts
- CPU or GPU — ONNX Runtime + OpenVINO for 4-7x CPU speedup
- Multi-GPU — per-GPU queue affinity with round-robin dispatch
- REST API with API-key auth, rate limiting, Pydantic validation, 413 on oversize
- SSE streaming + WebSocket progress for real-time job updates
- HMAC-SHA256 signed webhooks with retry + SSRF protection
- Python SDK (
pip install edcocr-sdk) - TypeScript SDK (
npm install @edcocr/sdk) - Object storage — S3, MinIO, Azure Blob, GCS with presigned URLs
- Event-driven — Kafka, SNS/SQS hooks
- Distributed tracing — OpenTelemetry
- Prometheus metrics — custom ORM-backed collector, 7 metric families
- Grafana dashboard — 53 panels covering throughput, queues, GPU, costs, SLA
- Alert rules — 5 PrometheusRule alerts shipped in Helm
- Hash-chained audit log — JSONL custody trail
- Per-tenant cost tracking + SLA monitoring
flowchart TB
subgraph T1[T1 · Single GPU]
D1[Docker Compose<br/>1 host, 1 GPU]
end
subgraph T2[T2 · Single CPU]
D2[Docker Compose<br/>1 host, ONNX]
end
subgraph T3[T3 · Multi-GPU]
D3[Per-GPU queues<br/>round-robin dispatch]
end
subgraph T4[T4 · Distributed]
D4[Celery + RabbitMQ<br/>Multi-VPS workers]
end
subgraph T6[T6 · Kubernetes]
D6[Helm + KEDA<br/>Sentinel + quorum queues]
end
subgraph T7[T7 · Air-Gapped]
D7[Pre-baked images<br/>bundle/deploy scripts]
end
style T1 fill:#0ea5e9,color:#fff
style T2 fill:#10b981,color:#fff
style T3 fill:#f59e0b,color:#fff
style T4 fill:#8b5cf6,color:#fff
style T6 fill:#ef4444,color:#fff
style T7 fill:#0ea5e9,color:#fff
See docs/DEPLOYMENT-DECISION-GUIDE.md for the full topology decision tree.
git clone https://github.com/mattmre/EDCOCR-PUBLIC.git
cd EDCOCR-PUBLIC
docker compose up -d --build
docker logs -f ocr_gpu_processor
# Drop PDFs in ./ocr_source/ — searchable PDFs land in ./ocr_output/EXPORT/PDF/helm install edcocr ./helm/ocr-local \
-f helm/ocr-local/values-production.yaml \
--set secrets.djangoSecretKey=$(openssl rand -hex 32)from edcocr_sdk import Client
client = Client(base_url="http://localhost:8000", api_key="...")
job = client.submit_job(file="invoice.pdf")
result = client.wait_for_completion(job.id)
print(result.text)import { Client } from "@edcocr/sdk";
const client = new Client({ baseUrl: "http://localhost:8000", apiKey: "..." });
const job = await client.submitJob({ file: "invoice.pdf" });
const result = await client.waitForCompletion(job.id);
console.log(result.text);See INSTALL.md for the full installation guide and docs/02-QUICKSTART-5-MINUTE-SUCCESS.md for a guided walkthrough.
EDCOCR is built for environments where OCR quality is non-negotiable and document volume is high.
|
Electronic discovery (eDiscovery) Searchable production sets with Bates stamping and chain of custody. Privilege detection during structured extraction. |
Digital forensic investigation Tamper-evident audit trails with replayable hash chains. Image-only fallback ensures no evidence is discarded. |
|
Government records digitization Air-gapped deployment with pre-baked language models. FOIA backlog reduction with multi-language support. |
Healthcare records Per-tenant isolation with PII/PHI spatial extraction. HIPAA-adjacent workflow support. |
|
Insurance claims processing High-volume batch processing with handwriting detection. Structured extraction for dates, amounts, addresses. |
Compliance archiving Long-term retention with deterministic re-OCR. SOC 2 / HIPAA / FedRAMP readiness documentation. |
See presentation/use-cases.html for the visual treatment with topology recommendations, or docs/04-USE-CASES.md for the detailed markdown version.
Reference numbers from a single host with one NVIDIA A6000 (48 GB VRAM):
| Workload | Throughput | Notes |
|---|---|---|
| Clean PDF (text-heavy) | ~120 pages/min | 12 GPU workers, 300 DPI |
| Mixed (text + tables + figures) | ~70 pages/min | Same hardware |
| Scanned with degradation | ~40 pages/min | After preprocessing |
| Video frame extraction | 1 fps default | Configurable |
CPU-only deployments with ONNX Runtime achieve roughly 25-30% of GPU throughput at much lower per-page cost. See docs/cpu-vs-gpu-analysis.md for the full benchmark table and TCO analysis.
Core tier (34, default, air-gapped): English, French, German, Spanish, Italian, Portuguese, Dutch, Swedish, Danish, Finnish, Romanian, Polish, Czech, Hungarian, Turkish, Vietnamese, Russian, Ukrainian, Belarusian, Bulgarian, Simplified Chinese, Traditional Chinese, Japanese, Korean, Arabic, Persian, Urdu, Uyghur, Hindi, Tamil, Telugu, Kannada, Greek, Georgian.
Extended tier (+11, opt-in): Croatian, Slovak, Norwegian, Lithuanian, Latvian, Estonian, Serbian (Latin), Bengali, Marathi, Nepali, Thai.
Activate the extended tier with OCR_LANGUAGE_TIERS=core,extended.
- INSTALL.md — Install on Docker, Kubernetes, or bare metal
- ARCHITECTURE.md — Top-level architecture with diagrams
- docs/DEPLOYMENT-DECISION-GUIDE.md — Pick the right topology (decision tree)
- docs/WHITE-PAPER.md — Technical white paper (Markdown source)
- docs/02-QUICKSTART-5-MINUTE-SUCCESS.md — 5-minute walkthrough
- docs/04-USE-CASES.md — When to use EDCOCR
- docs/00-SYSTEM-BLUEPRINT.md — System architecture
- docs/01-TECH-STACK-DNA.md — Technology stack
- docs/03-INFORMATION-FLOWS.md — End-to-end data flow
- docs/05-INTERACTIVE-WALKTHROUGH.md — Guided tour
- docs/06-CONFIGURATION-REFERENCE.md — All env vars
- docs/API-REFERENCE.md — REST API
- docs/08-SDK-REFERENCE.md — Python + TypeScript SDKs
- docs/07-TRANSFORMS-STAMPING.md — Bates stamping, redaction
- docs/10-MONITORING-OPERATIONS.md — Prometheus + Grafana
- docs/FAILOVER-RUNBOOK.md — HA failover procedures
- docs/09-TROUBLESHOOTING.md — Common issues
- docs/cpu-vs-gpu-analysis.md — Deployment sizing
- docs/11-ML-TRAINING-GUIDE.md — LayoutLMv3 fine-tuning
- docs/benchmarking-methodology.md — Performance benchmarks
- docs/security-audit-checklist.md — Security review
- CONTRIBUTING.md — How to contribute
- DEVELOPMENT.md — Development guide
- SECURITY.md — Reporting security issues
- CHANGELOG.md — Release history
- presentation/executive-summary.html — Decision-maker briefing
- presentation/technical-brief.html — Engineer-audience deep dive
- presentation/white-paper.html — Rendered HTML white paper
- presentation/use-cases.html — Worked scenarios
- presentation/index.html — Marketing landing page
- presentation/slides.html — Slide deck (keyboard nav)
- presentation/architecture.html — Architecture walkthrough
We are explicit about non-goals so nobody buys the wrong tool:
- Pure document understanding without provenance. If you just want a chat-with-your-PDF demo, a generative LLM with built-in OCR will get you there faster.
- Real-time consumer scanning. EDCOCR optimizes for sustained throughput on a queue, not millisecond latency on a phone.
- Fixed-template form auto-fill. Form-field-aware tools that understand a specific tax form's structure will out-extract a generic OCR pipeline.
See presentation/use-cases.html#not-fit for the longer treatment.
Version 4.1.0 — Production-ready public release.
EDCOCR has been deployed in document-volume environments processing 6-7 digit page counts. The pipeline, distributed coordinator, REST API, SDKs, Helm chart, and observability stack are all considered stable. Translation and per-span language detection are feature-flagged and default to OFF.
See CHANGELOG.md for release history and docs/known-issues.md for current open issues.
Apache License 2.0. See LICENSE for the full text and NOTICE for third-party attributions.
EDCOCR ships pre-built integrations with several third-party OCR, NLP, and ML libraries. Each retains its original license; restrictive license families (e.g. NLLB's CC-BY-NC-4.0) are flagged and gated by tenant policy.
EDCOCR is an open, community-driven project. The fastest path forward is more eyes, more deployments, and more contributions from people outside the original team.
Ways to participate:
- File a public issue. Bugs, unexpected behavior, feature ideas, documentation gaps — open an issue from the Issues tab. Templates guide you through what to include.
- Start a discussion. Open-ended questions, design ideas, "how are you running this in production?" — those belong in Discussions. It's the lowest-friction surface for community Q&A.
- Send a pull request. See
CONTRIBUTING.mdfor the contribution workflow, coding conventions, and testing expectations. - Report a security issue privately. Use GitHub Security Advisories — do not file a public issue. See
SECURITY.mdfor the full disclosure policy.
Want to join the team as a regular contributor? Send a direct message to @mattmre on GitHub. There is no application form — just tell us what you want to work on and roughly how much time you have. See §10 "Joining the Team" in CONTRIBUTING.md for what to include.
Every commit, issue, review, and Discussion thread makes the project better. Thank you.
Documentation · API Reference · Architecture · Changelog · Presentation Suite · Discussions
EDCOCR v4.1.0 · Apache License 2.0 · Forensic-grade OCR for the day someone asks "prove it"
