Skip to content

mattmre/EDCOCR-PUBLIC

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
EDCOCR — Forensic-Grade OCR

EDCOCR

Forensic-Grade OCR Platform for Electronic Discovery & Document Processing

License Version CI Container Scan Python Docker Kubernetes PaddleOCR FastAPI Django Discussions PRs Welcome

Zero-hallucination OCR · 45 languages · GPU + CPU · Chain of custody · Distributed at scale

Quick Start  ·  Executive Summary  ·  Technical Brief  ·  White Paper  ·  Use Cases

Documentation  ·  Architecture  ·  API Reference  ·  Live Demo


Most OCR tools were built for digitizing brochures.
EDCOCR was built for the day someone asks "where did this text come from, and can you prove it?"


At a Glance

RECOGNITION
45
Languages, 2 tiers
PIPELINE
31
Concurrent threads
OUTPUTS
14
Sidecar schemas
TOPOLOGIES
7
T1 single-host → T7 air-gap
HELM
26
K8s templates
OUTBOUND
0
Calls (air-gap ready)

What Is EDCOCR?

EDCOCR is a production-grade Optical Character Recognition platform purpose-built for forensic, legal, and high-volume document processing. It turns scans, PDFs, images, and videos into searchable, auditable outputs — without the hallucinations, drift, or evidence loss that come with generative-AI OCR.

It is the work product of years of pipeline iteration. Every design decision tilts toward one outcome: a usable, defensible document at the end of the pipeline, even when the inputs are awful.

flowchart LR
    A[Source Documents<br/>PDF · TIFF · JPEG · Video] -->|Ingest| B[Scheduler]
    B -->|Chunk| C[CPU Extractors<br/>8 threads]
    C -->|300 DPI Image| D[GPU OCR Workers<br/>12 threads<br/>PaddleOCR + Tesseract]
    D -->|Page Result| E[Assembler]
    E -->|Searchable PDF<br/>+ Text + Sidecars| F[Output]
    E -->|Audit Trail| G[Chain of Custody]
    E -->|Metrics| H[Prometheus / Grafana]

    style A fill:#0ea5e9,color:#fff
    style D fill:#10b981,color:#fff
    style F fill:#f59e0b,color:#fff
    style G fill:#ef4444,color:#fff
Loading

Try It in 30 Seconds

Spin up the full stack with Docker (GPU optional — CPU works too):

git clone https://github.com/mattmre/EDCOCR-PUBLIC.git
cd EDCOCR-PUBLIC
cp .env.example .env                # set OCR_API_KEY before starting
docker compose up -d

Drop a PDF into ocr_source/ and watch it appear under ocr_output/EXPORT/PDF/.

Or call the REST API directly:

curl -X POST http://localhost:8000/api/v1/jobs \
  -H "X-API-Key: $OCR_API_KEY" \
  -F "file=@/path/to/document.pdf"

Or use the Python SDK:

from edcocr_sdk import Client

client = Client(base_url="http://localhost:8000", api_key="...")
job = client.submit_job("/path/to/document.pdf")
job.wait_until_complete()
print(job.text)                     # OCR'd plain text
print(job.searchable_pdf_path)      # Path to the rendered PDF

Or use the TypeScript SDK:

import { Client } from "@edcocr/sdk";

const client = new Client({ baseUrl: "http://localhost:8000", apiKey: "..." });
const job = await client.submitJob("/path/to/document.pdf");
await job.waitUntilComplete();
console.log(job.text);

For the full installation walkthrough, see INSTALL.md. For the 5-minute getting-started guide, see docs/02-QUICKSTART-5-MINUTE-SUCCESS.md.


System Overview

EDCOCR is a layered system. Clients talk to a thin FastAPI ingress that delegates to a Django coordinator; workers are pulled in capability-based from a RabbitMQ broker; outputs land in PDF, plain text, and 14 structured sidecar formats backed by a tamper-evident custody log.

flowchart TB
    subgraph Clients["Clients"]
        C1[Python SDK]
        C2[TypeScript SDK]
        C3[REST API direct]
        C4[Webhook consumers]
    end

    subgraph Ingestion["Ingestion Layer"]
        API[FastAPI<br/>REST + WebSocket + SSE]
        Watcher[File Watcher<br/>local + FTP/SFTP]
        Object[Object Storage<br/>S3 · MinIO · Azure · GCS]
    end

    subgraph Coordination["Coordination Layer"]
        Coord[Django Coordinator]
        DB[(PostgreSQL)]
        Broker[(RabbitMQ)]
        Redis[(Redis<br/>Sentinel HA)]
    end

    subgraph Workers["Worker Layer"]
        WG[GPU OCR Workers]
        WC[CPU OCR Workers<br/>ONNX]
        WN[NLP Workers<br/>NER · UIE]
        WX[Compression Workers]
    end

    subgraph Output["Output Layer"]
        OutPDF[Searchable PDFs]
        OutTxt[Plain Text]
        OutSide[14 Sidecar JSONs<br/>NER · Tables · Classification ·<br/>Handwriting · Language · ...]
        Custody[Custody Log<br/>JSONL hash chain]
    end

    C1 --> API
    C2 --> API
    C3 --> API
    Watcher --> API
    Object --> API
    API --> Coord
    Coord <--> DB
    Coord <--> Broker
    Coord <--> Redis
    Broker --> WG
    Broker --> WC
    Broker --> WN
    Broker --> WX
    WG --> Output
    WC --> Output
    WN --> Output
    WX --> OutPDF
    Output -->|completion| C4

    style Workers fill:#10b981,stroke:#065f46,color:#fff
    style Output fill:#f59e0b,stroke:#92400e,color:#fff
    style Custody fill:#ef4444,stroke:#7f1d1d,color:#fff
Loading

For the full architecture walkthrough — deployment topologies (T1 single-host through T7 air-gap), failure modes, security model, and the custody hash-chain design — see ARCHITECTURE.md.


Presentation Suite

Four self-contained briefings live under presentation/. Open any HTML file in a browser — no build step, no server, no analytics.

Executive Summary
For decision-makers

The one-pager explaining why a forensic-grade OCR platform exists and what it costs to ignore the difference.

~5 min · Legal, compliance, ops leadership
Technical Brief
For engineers

Pipeline internals, deployment topologies, API surface, SDK examples, observability stack, security posture.

~15 min · Integrators, SRE, platform
White Paper
For evaluators

Twelve sections covering motivation, design principles, output schema, translation policy, and admissibility posture.

~25 min · Architects, evaluators, counsel
Use Cases
For product / legal

Seven worked scenarios with recommended topology, feature flags, and operational outcome — plus where EDCOCR is not a fit.

~10 min · Product, legal, sales engineering

Plus three interactive decks: presentation/index.html (marketing landing) · presentation/slides.html (keyboard-navigable slides) · presentation/architecture.html (architecture deep-dive).


Why Forensic-Grade?

Concern How EDCOCR Handles It
Hallucinations CTC-only recognition (PaddleOCR). No generative model anywhere in the recognition path.
Lost evidence OCR failure never discards the source image. Failed pages survive into the output PDF as image-only pages with an audit entry.
Crash recovery Page-level temp files with deterministic resume. Kill the process mid-job, restart, no rework.
Tamper detection SHA-256 hash-chained JSONL custody log. Append-only, replayable, signature-verifiable.
Chain of custody Every document, every page, every transformation gets a custody event. Filesystem path, hash, processor identity.
Language drift Two-pass adaptive detection (FastText) with per-span BCP-47 sidecar (opt-in).
Mixed scripts Language re-detection without re-running OCR.
Privileged content Privilege detection during structured extraction; soft-warning posture and policy-enforced redaction.

How the Custody Chain Works

flowchart LR
    P1[Page 1<br/>SHA-256] -->|prev_hash| C1[Custody Event 1<br/>INGEST]
    C1 -->|event_hash| C2[Custody Event 2<br/>OCR_COMPLETE]
    C2 -->|event_hash| C3[Custody Event 3<br/>TRANSFORM_REDACT]
    C3 -->|event_hash| C4[Custody Event 4<br/>EXPORT_PDF]
    C4 -->|terminal_hash| V[Verifier:<br/>replay &amp; recompute<br/>each hash]

    style P1 fill:#0ea5e9,color:#fff
    style C4 fill:#10b981,color:#fff
    style V fill:#f59e0b,color:#fff
Loading

Each event in custody.py writes a JSONL record whose event_hash is SHA-256(prev_hash || canonical_event_json). Tampering with any record in the chain invalidates every record after it. The chain is append-only and replayable by anyone who has the file.


EDCOCR vs the Alternatives

EDCOCR Generative-AI OCR Open-Source CTC Toolkits
Recognition model CTC (PaddleOCR 2.9.1) LLM-decoded CTC, varies
Hallucination risk None by design Documented and material None by design
Audit trail Hash-chained JSONL custody None standard None standard
Crash resume Page-level deterministic Job-level at best Manual
Air-gap deployable Yes — pre-baked models No (calls external APIs) Yes, but you build it
Distributed at scale Helm chart + KEDA + Celery Hosted only DIY
Forensic preservation Image-only fallback embedded Returns "error" Returns "error"
Per-tenant isolation Built in Hosted account boundary DIY
License Apache 2.0 Proprietary Mixed
Operational maturity 9,000+ unit tests, 53 Grafana panels Black box Varies

— The forensic-vs-AI boundary is enforced in code: see docs/architecture/forensic-ai-boundary-contract.md and scripts/validate_feature_boundary.py.


Core Capabilities

Recognition

  • 45 languages in a tiered registry (34 core + 11 extended)
  • CTC-only recognition — no hallucinations possible by design
  • Adaptive DPI escalation — auto-retry low-confidence pages at 450/600 DPI
  • Image preprocessing — OpenCV-based deskew, denoise, binarize for degraded scans
  • Smart engine selection — quality-based routing between Tesseract and PaddleOCR
  • CJK vertical text — reading-order analysis for vertical Chinese, Japanese, Korean

Pipeline

  • 6 concurrent stages, 31 threads — async producer-consumer model
  • Page-level crash resume — deterministic recovery from any failure
  • 300 DPI default — configurable per-job
  • PDF + 18 image formats — TIFF, JPEG, PNG, BMP, GIF, WebP, JP2, etc.
  • Video ingestion — sample frames at configurable intervals

Output

  • Searchable PDFs with embedded text layer
  • Plain text extraction (UTF-8)
  • Document Intelligence sidecars — layout regions, table HTML/CSV (opt-in)
  • Structured extraction — dates, amounts, names, addresses (UIE + regex)
  • Named Entity Recognition — case numbers, Bates numbers, PII/PHI with spatial bboxes
  • Document classification — text rules + layout ensemble
  • Handwriting detection — confidence + geometry heuristics
  • Signature detection — experimental, advisory-only
  • Barcode/QR extraction + OMR checkbox detection
  • Per-span language sidecar with BCP-47 codes and confidence

Deployment

  • Docker — single-host with GPU passthrough
  • Kubernetes — production Helm chart with KEDA autoscaling
  • High availability — Redis Sentinel, PostgreSQL backup CronJob, RabbitMQ quorum queues
  • Air-gapped — pre-baked language models in Docker images, bundle/deploy scripts
  • CPU or GPU — ONNX Runtime + OpenVINO for 4-7x CPU speedup
  • Multi-GPU — per-GPU queue affinity with round-robin dispatch

Integration

  • REST API with API-key auth, rate limiting, Pydantic validation, 413 on oversize
  • SSE streaming + WebSocket progress for real-time job updates
  • HMAC-SHA256 signed webhooks with retry + SSRF protection
  • Python SDK (pip install edcocr-sdk)
  • TypeScript SDK (npm install @edcocr/sdk)
  • Object storage — S3, MinIO, Azure Blob, GCS with presigned URLs
  • Event-driven — Kafka, SNS/SQS hooks
  • Distributed tracing — OpenTelemetry

Observability

  • Prometheus metrics — custom ORM-backed collector, 7 metric families
  • Grafana dashboard — 53 panels covering throughput, queues, GPU, costs, SLA
  • Alert rules — 5 PrometheusRule alerts shipped in Helm
  • Hash-chained audit log — JSONL custody trail
  • Per-tenant cost tracking + SLA monitoring

Deployment Topologies

flowchart TB
    subgraph T1[T1 · Single GPU]
        D1[Docker Compose<br/>1 host, 1 GPU]
    end
    subgraph T2[T2 · Single CPU]
        D2[Docker Compose<br/>1 host, ONNX]
    end
    subgraph T3[T3 · Multi-GPU]
        D3[Per-GPU queues<br/>round-robin dispatch]
    end
    subgraph T4[T4 · Distributed]
        D4[Celery + RabbitMQ<br/>Multi-VPS workers]
    end
    subgraph T6[T6 · Kubernetes]
        D6[Helm + KEDA<br/>Sentinel + quorum queues]
    end
    subgraph T7[T7 · Air-Gapped]
        D7[Pre-baked images<br/>bundle/deploy scripts]
    end

    style T1 fill:#0ea5e9,color:#fff
    style T2 fill:#10b981,color:#fff
    style T3 fill:#f59e0b,color:#fff
    style T4 fill:#8b5cf6,color:#fff
    style T6 fill:#ef4444,color:#fff
    style T7 fill:#0ea5e9,color:#fff
Loading

See docs/DEPLOYMENT-DECISION-GUIDE.md for the full topology decision tree.


Quick Start

Docker (recommended)

git clone https://github.com/mattmre/EDCOCR-PUBLIC.git
cd EDCOCR-PUBLIC
docker compose up -d --build
docker logs -f ocr_gpu_processor

# Drop PDFs in ./ocr_source/ — searchable PDFs land in ./ocr_output/EXPORT/PDF/

Kubernetes

helm install edcocr ./helm/ocr-local \
  -f helm/ocr-local/values-production.yaml \
  --set secrets.djangoSecretKey=$(openssl rand -hex 32)

Python SDK

from edcocr_sdk import Client

client = Client(base_url="http://localhost:8000", api_key="...")
job = client.submit_job(file="invoice.pdf")
result = client.wait_for_completion(job.id)
print(result.text)

TypeScript SDK

import { Client } from "@edcocr/sdk";

const client = new Client({ baseUrl: "http://localhost:8000", apiKey: "..." });
const job = await client.submitJob({ file: "invoice.pdf" });
const result = await client.waitForCompletion(job.id);
console.log(result.text);

See INSTALL.md for the full installation guide and docs/02-QUICKSTART-5-MINUTE-SUCCESS.md for a guided walkthrough.


Use Cases

EDCOCR is built for environments where OCR quality is non-negotiable and document volume is high.

Electronic discovery (eDiscovery)
Searchable production sets with Bates stamping and chain of custody. Privilege detection during structured extraction.
Digital forensic investigation
Tamper-evident audit trails with replayable hash chains. Image-only fallback ensures no evidence is discarded.
Government records digitization
Air-gapped deployment with pre-baked language models. FOIA backlog reduction with multi-language support.
Healthcare records
Per-tenant isolation with PII/PHI spatial extraction. HIPAA-adjacent workflow support.
Insurance claims processing
High-volume batch processing with handwriting detection. Structured extraction for dates, amounts, addresses.
Compliance archiving
Long-term retention with deterministic re-OCR. SOC 2 / HIPAA / FedRAMP readiness documentation.

See presentation/use-cases.html for the visual treatment with topology recommendations, or docs/04-USE-CASES.md for the detailed markdown version.


Performance

Reference numbers from a single host with one NVIDIA A6000 (48 GB VRAM):

Workload Throughput Notes
Clean PDF (text-heavy) ~120 pages/min 12 GPU workers, 300 DPI
Mixed (text + tables + figures) ~70 pages/min Same hardware
Scanned with degradation ~40 pages/min After preprocessing
Video frame extraction 1 fps default Configurable

CPU-only deployments with ONNX Runtime achieve roughly 25-30% of GPU throughput at much lower per-page cost. See docs/cpu-vs-gpu-analysis.md for the full benchmark table and TCO analysis.


Supported Languages

Core tier (34, default, air-gapped): English, French, German, Spanish, Italian, Portuguese, Dutch, Swedish, Danish, Finnish, Romanian, Polish, Czech, Hungarian, Turkish, Vietnamese, Russian, Ukrainian, Belarusian, Bulgarian, Simplified Chinese, Traditional Chinese, Japanese, Korean, Arabic, Persian, Urdu, Uyghur, Hindi, Tamil, Telugu, Kannada, Greek, Georgian.

Extended tier (+11, opt-in): Croatian, Slovak, Norwegian, Lithuanian, Latvian, Estonian, Serbian (Latin), Bengali, Marathi, Nepali, Thai.

Activate the extended tier with OCR_LANGUAGE_TIERS=core,extended.


Documentation Map

Start Here

Architecture & Design

Reference

Operations

Advanced

For Contributors

Presentation Suite


Where EDCOCR Is Not a Fit

We are explicit about non-goals so nobody buys the wrong tool:

  • Pure document understanding without provenance. If you just want a chat-with-your-PDF demo, a generative LLM with built-in OCR will get you there faster.
  • Real-time consumer scanning. EDCOCR optimizes for sustained throughput on a queue, not millisecond latency on a phone.
  • Fixed-template form auto-fill. Form-field-aware tools that understand a specific tax form's structure will out-extract a generic OCR pipeline.

See presentation/use-cases.html#not-fit for the longer treatment.


Project Status

Version 4.1.0 — Production-ready public release.

EDCOCR has been deployed in document-volume environments processing 6-7 digit page counts. The pipeline, distributed coordinator, REST API, SDKs, Helm chart, and observability stack are all considered stable. Translation and per-span language detection are feature-flagged and default to OFF.

See CHANGELOG.md for release history and docs/known-issues.md for current open issues.


License

Apache License 2.0. See LICENSE for the full text and NOTICE for third-party attributions.

EDCOCR ships pre-built integrations with several third-party OCR, NLP, and ML libraries. Each retains its original license; restrictive license families (e.g. NLLB's CC-BY-NC-4.0) are flagged and gated by tenant policy.


Community

EDCOCR is an open, community-driven project. The fastest path forward is more eyes, more deployments, and more contributions from people outside the original team.

Ways to participate:

  • File a public issue. Bugs, unexpected behavior, feature ideas, documentation gaps — open an issue from the Issues tab. Templates guide you through what to include.
  • Start a discussion. Open-ended questions, design ideas, "how are you running this in production?" — those belong in Discussions. It's the lowest-friction surface for community Q&A.
  • Send a pull request. See CONTRIBUTING.md for the contribution workflow, coding conventions, and testing expectations.
  • Report a security issue privately. Use GitHub Security Advisories — do not file a public issue. See SECURITY.md for the full disclosure policy.

Want to join the team as a regular contributor? Send a direct message to @mattmre on GitHub. There is no application form — just tell us what you want to work on and roughly how much time you have. See §10 "Joining the Team" in CONTRIBUTING.md for what to include.


Contributors

Contributors to EDCOCR-PUBLIC

Every commit, issue, review, and Discussion thread makes the project better. Thank you.


Star History

EDCOCR star history

Documentation  ·  API Reference  ·  Architecture  ·  Changelog  ·  Presentation Suite  ·  Discussions

EDCOCR v4.1.0 · Apache License 2.0 · Forensic-grade OCR for the day someone asks "prove it"

About

Public version of OCR pipeline used for agentic OCR, attaching to other systems or running as a standalone OCR service. The aim is to use GPU and CPU in combination. This should provide the best speed increase and allow for the maximum throughput. This repo is setup for large scale OCR processing at scale or local processing on single machines.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

 

Packages

 
 
 

Contributors