EDCOCR

Forensic-Grade OCR Platform for Electronic Discovery & Document Processing

Zero-hallucination OCR · 45 languages · GPU + CPU · Chain of custody · Distributed at scale

Quick Start · Executive Summary · Technical Brief · White Paper · Use Cases

Documentation · Architecture · API Reference · Live Demo

Most OCR tools were built for digitizing brochures.
EDCOCR was built for the day someone asks "where did this text come from, and can you prove it?"

At a Glance

_RECOGNITION
45
_{Languages, 2 tiers}

_PIPELINE
31
_{Concurrent threads}

_OUTPUTS
14
_{Sidecar schemas}

_TOPOLOGIES
7
_{T1 single-host → T7 air-gap}

_HELM
26
_{K8s templates}

_OUTBOUND
0
_{Calls (air-gap ready)}

What Is EDCOCR?

EDCOCR is a production-grade Optical Character Recognition platform purpose-built for forensic, legal, and high-volume document processing. It turns scans, PDFs, images, and videos into searchable, auditable outputs — without the hallucinations, drift, or evidence loss that come with generative-AI OCR.

It is the work product of years of pipeline iteration. Every design decision tilts toward one outcome: a usable, defensible document at the end of the pipeline, even when the inputs are awful.

flowchart LR
    A[Source Documents<br/>PDF · TIFF · JPEG · Video] -->|Ingest| B[Scheduler]
    B -->|Chunk| C[CPU Extractors<br/>8 threads]
    C -->|300 DPI Image| D[GPU OCR Workers<br/>12 threads<br/>PaddleOCR + Tesseract]
    D -->|Page Result| E[Assembler]
    E -->|Searchable PDF<br/>+ Text + Sidecars| F[Output]
    E -->|Audit Trail| G[Chain of Custody]
    E -->|Metrics| H[Prometheus / Grafana]

    style A fill:#0ea5e9,color:#fff
    style D fill:#10b981,color:#fff
    style F fill:#f59e0b,color:#fff
    style G fill:#ef4444,color:#fff

Try It in 30 Seconds

Spin up the full stack with Docker (GPU optional — CPU works too):

git clone https://github.com/mattmre/EDCOCR-PUBLIC.git
cd EDCOCR-PUBLIC
cp .env.example .env                # set OCR_API_KEY before starting
docker compose up -d

Drop a PDF into ocr_source/ and watch it appear under ocr_output/EXPORT/PDF/.

Or call the REST API directly:

curl -X POST http://localhost:8000/api/v1/jobs \
  -H "X-API-Key: $OCR_API_KEY" \
  -F "file=@/path/to/document.pdf"

Or use the Python SDK:

from edcocr_sdk import Client

client = Client(base_url="http://localhost:8000", api_key="...")
job = client.submit_job("/path/to/document.pdf")
job.wait_until_complete()
print(job.text)                     # OCR'd plain text
print(job.searchable_pdf_path)      # Path to the rendered PDF

Or use the TypeScript SDK:

import { Client } from "@edcocr/sdk";

const client = new Client({ baseUrl: "http://localhost:8000", apiKey: "..." });
const job = await client.submitJob("/path/to/document.pdf");
await job.waitUntilComplete();
console.log(job.text);

For the full installation walkthrough, see INSTALL.md. For the 5-minute getting-started guide, see docs/02-QUICKSTART-5-MINUTE-SUCCESS.md.

System Overview

EDCOCR is a layered system. Clients talk to a thin FastAPI ingress that delegates to a Django coordinator; workers are pulled in capability-based from a RabbitMQ broker; outputs land in PDF, plain text, and 14 structured sidecar formats backed by a tamper-evident custody log.

flowchart TB
    subgraph Clients["Clients"]
        C1[Python SDK]
        C2[TypeScript SDK]
        C3[REST API direct]
        C4[Webhook consumers]
    end

    subgraph Ingestion["Ingestion Layer"]
        API[FastAPI<br/>REST + WebSocket + SSE]
        Watcher[File Watcher<br/>local + FTP/SFTP]
        Object[Object Storage<br/>S3 · MinIO · Azure · GCS]
    end

    subgraph Coordination["Coordination Layer"]
        Coord[Django Coordinator]
        DB[(PostgreSQL)]
        Broker[(RabbitMQ)]
        Redis[(Redis<br/>Sentinel HA)]
    end

    subgraph Workers["Worker Layer"]
        WG[GPU OCR Workers]
        WC[CPU OCR Workers<br/>ONNX]
        WN[NLP Workers<br/>NER · UIE]
        WX[Compression Workers]
    end

    subgraph Output["Output Layer"]
        OutPDF[Searchable PDFs]
        OutTxt[Plain Text]
        OutSide[14 Sidecar JSONs<br/>NER · Tables · Classification ·<br/>Handwriting · Language · ...]
        Custody[Custody Log<br/>JSONL hash chain]
    end

    C1 --> API
    C2 --> API
    C3 --> API
    Watcher --> API
    Object --> API
    API --> Coord
    Coord <--> DB
    Coord <--> Broker
    Coord <--> Redis
    Broker --> WG
    Broker --> WC
    Broker --> WN
    Broker --> WX
    WG --> Output
    WC --> Output
    WN --> Output
    WX --> OutPDF
    Output -->|completion| C4

    style Workers fill:#10b981,stroke:#065f46,color:#fff
    style Output fill:#f59e0b,stroke:#92400e,color:#fff
    style Custody fill:#ef4444,stroke:#7f1d1d,color:#fff

For the full architecture walkthrough — deployment topologies (T1 single-host through T7 air-gap), failure modes, security model, and the custody hash-chain design — see ARCHITECTURE.md.

Presentation Suite

Four self-contained briefings live under presentation/. Open any HTML file in a browser — no build step, no server, no analytics.

Executive Summary
_{For decision-makers}

The one-pager explaining why a forensic-grade OCR platform exists and what it costs to ignore the difference.

_{~5 min · Legal, compliance, ops leadership}

Technical Brief
_{For engineers}

Pipeline internals, deployment topologies, API surface, SDK examples, observability stack, security posture.

_{~15 min · Integrators, SRE, platform}

White Paper
_{For evaluators}

Twelve sections covering motivation, design principles, output schema, translation policy, and admissibility posture.

_{~25 min · Architects, evaluators, counsel}

Use Cases
_{For product / legal}

Seven worked scenarios with recommended topology, feature flags, and operational outcome — plus where EDCOCR is not a fit.

_{~10 min · Product, legal, sales engineering}

Plus three interactive decks: presentation/index.html (marketing landing) · presentation/slides.html (keyboard-navigable slides) · presentation/architecture.html (architecture deep-dive).

Why Forensic-Grade?

Concern	How EDCOCR Handles It
Hallucinations	CTC-only recognition (PaddleOCR). No generative model anywhere in the recognition path.
Lost evidence	OCR failure never discards the source image. Failed pages survive into the output PDF as image-only pages with an audit entry.
Crash recovery	Page-level temp files with deterministic resume. Kill the process mid-job, restart, no rework.
Tamper detection	SHA-256 hash-chained JSONL custody log. Append-only, replayable, signature-verifiable.
Chain of custody	Every document, every page, every transformation gets a custody event. Filesystem path, hash, processor identity.
Language drift	Two-pass adaptive detection (FastText) with per-span BCP-47 sidecar (opt-in).
Mixed scripts	Language re-detection without re-running OCR.
Privileged content	Privilege detection during structured extraction; soft-warning posture and policy-enforced redaction.

How the Custody Chain Works

flowchart LR
    P1[Page 1<br/>SHA-256] -->|prev_hash| C1[Custody Event 1<br/>INGEST]
    C1 -->|event_hash| C2[Custody Event 2<br/>OCR_COMPLETE]
    C2 -->|event_hash| C3[Custody Event 3<br/>TRANSFORM_REDACT]
    C3 -->|event_hash| C4[Custody Event 4<br/>EXPORT_PDF]
    C4 -->|terminal_hash| V[Verifier:<br/>replay &amp; recompute<br/>each hash]

    style P1 fill:#0ea5e9,color:#fff
    style C4 fill:#10b981,color:#fff
    style V fill:#f59e0b,color:#fff

Each event in custody.py writes a JSONL record whose event_hash is SHA-256(prev_hash || canonical_event_json). Tampering with any record in the chain invalidates every record after it. The chain is append-only and replayable by anyone who has the file.

EDCOCR vs the Alternatives

	EDCOCR	Generative-AI OCR	Open-Source CTC Toolkits
Recognition model	CTC (PaddleOCR 2.9.1)	LLM-decoded	CTC, varies
Hallucination risk	None by design	Documented and material	None by design
Audit trail	Hash-chained JSONL custody	None standard	None standard
Crash resume	Page-level deterministic	Job-level at best	Manual
Air-gap deployable	Yes — pre-baked models	No (calls external APIs)	Yes, but you build it
Distributed at scale	Helm chart + KEDA + Celery	Hosted only	DIY
Forensic preservation	Image-only fallback embedded	Returns "error"	Returns "error"
Per-tenant isolation	Built in	Hosted account boundary	DIY
License	Apache 2.0	Proprietary	Mixed
Operational maturity	9,000+ unit tests, 53 Grafana panels	Black box	Varies

— The forensic-vs-AI boundary is enforced in code: see docs/architecture/forensic-ai-boundary-contract.md and scripts/validate_feature_boundary.py.

Core Capabilities

Recognition

45 languages in a tiered registry (34 core + 11 extended)
CTC-only recognition — no hallucinations possible by design
Adaptive DPI escalation — auto-retry low-confidence pages at 450/600 DPI
Image preprocessing — OpenCV-based deskew, denoise, binarize for degraded scans
Smart engine selection — quality-based routing between Tesseract and PaddleOCR
CJK vertical text — reading-order analysis for vertical Chinese, Japanese, Korean

Pipeline

6 concurrent stages, 31 threads — async producer-consumer model
Page-level crash resume — deterministic recovery from any failure
300 DPI default — configurable per-job
PDF + 18 image formats — TIFF, JPEG, PNG, BMP, GIF, WebP, JP2, etc.
Video ingestion — sample frames at configurable intervals

Output

Searchable PDFs with embedded text layer
Plain text extraction (UTF-8)
Document Intelligence sidecars — layout regions, table HTML/CSV (opt-in)
Structured extraction — dates, amounts, names, addresses (UIE + regex)
Named Entity Recognition — case numbers, Bates numbers, PII/PHI with spatial bboxes
Document classification — text rules + layout ensemble
Handwriting detection — confidence + geometry heuristics
Signature detection — experimental, advisory-only
Barcode/QR extraction + OMR checkbox detection
Per-span language sidecar with BCP-47 codes and confidence

Deployment

Docker — single-host with GPU passthrough
Kubernetes — production Helm chart with KEDA autoscaling
High availability — Redis Sentinel, PostgreSQL backup CronJob, RabbitMQ quorum queues
Air-gapped — pre-baked language models in Docker images, bundle/deploy scripts
CPU or GPU — ONNX Runtime + OpenVINO for 4-7x CPU speedup
Multi-GPU — per-GPU queue affinity with round-robin dispatch

Integration

REST API with API-key auth, rate limiting, Pydantic validation, 413 on oversize
SSE streaming + WebSocket progress for real-time job updates
HMAC-SHA256 signed webhooks with retry + SSRF protection
Python SDK (pip install edcocr-sdk)
TypeScript SDK (npm install @edcocr/sdk)
Object storage — S3, MinIO, Azure Blob, GCS with presigned URLs
Event-driven — Kafka, SNS/SQS hooks
Distributed tracing — OpenTelemetry

Observability

Prometheus metrics — custom ORM-backed collector, 7 metric families
Grafana dashboard — 53 panels covering throughput, queues, GPU, costs, SLA
Alert rules — 5 PrometheusRule alerts shipped in Helm
Hash-chained audit log — JSONL custody trail
Per-tenant cost tracking + SLA monitoring

Deployment Topologies

flowchart TB
    subgraph T1[T1 · Single GPU]
        D1[Docker Compose<br/>1 host, 1 GPU]
    end
    subgraph T2[T2 · Single CPU]
        D2[Docker Compose<br/>1 host, ONNX]
    end
    subgraph T3[T3 · Multi-GPU]
        D3[Per-GPU queues<br/>round-robin dispatch]
    end
    subgraph T4[T4 · Distributed]
        D4[Celery + RabbitMQ<br/>Multi-VPS workers]
    end
    subgraph T6[T6 · Kubernetes]
        D6[Helm + KEDA<br/>Sentinel + quorum queues]
    end
    subgraph T7[T7 · Air-Gapped]
        D7[Pre-baked images<br/>bundle/deploy scripts]
    end

    style T1 fill:#0ea5e9,color:#fff
    style T2 fill:#10b981,color:#fff
    style T3 fill:#f59e0b,color:#fff
    style T4 fill:#8b5cf6,color:#fff
    style T6 fill:#ef4444,color:#fff
    style T7 fill:#0ea5e9,color:#fff

See docs/DEPLOYMENT-DECISION-GUIDE.md for the full topology decision tree.

Quick Start

Docker (recommended)

git clone https://github.com/mattmre/EDCOCR-PUBLIC.git
cd EDCOCR-PUBLIC
docker compose up -d --build
docker logs -f ocr_gpu_processor

# Drop PDFs in ./ocr_source/ — searchable PDFs land in ./ocr_output/EXPORT/PDF/

Kubernetes

helm install edcocr ./helm/ocr-local \
  -f helm/ocr-local/values-production.yaml \
  --set secrets.djangoSecretKey=$(openssl rand -hex 32)

Python SDK

from edcocr_sdk import Client

client = Client(base_url="http://localhost:8000", api_key="...")
job = client.submit_job(file="invoice.pdf")
result = client.wait_for_completion(job.id)
print(result.text)

TypeScript SDK

import { Client } from "@edcocr/sdk";

const client = new Client({ baseUrl: "http://localhost:8000", apiKey: "..." });
const job = await client.submitJob({ file: "invoice.pdf" });
const result = await client.waitForCompletion(job.id);
console.log(result.text);

See INSTALL.md for the full installation guide and docs/02-QUICKSTART-5-MINUTE-SUCCESS.md for a guided walkthrough.

Use Cases

EDCOCR is built for environments where OCR quality is non-negotiable and document volume is high.

Electronic discovery (eDiscovery) _{Searchable production sets with Bates stamping and chain of custody. Privilege detection during structured extraction.}	Digital forensic investigation _{Tamper-evident audit trails with replayable hash chains. Image-only fallback ensures no evidence is discarded.}
Government records digitization _{Air-gapped deployment with pre-baked language models. FOIA backlog reduction with multi-language support.}	Healthcare records _{Per-tenant isolation with PII/PHI spatial extraction. HIPAA-adjacent workflow support.}
Insurance claims processing _{High-volume batch processing with handwriting detection. Structured extraction for dates, amounts, addresses.}	Compliance archiving _{Long-term retention with deterministic re-OCR. SOC 2 / HIPAA / FedRAMP readiness documentation.}

See presentation/use-cases.html for the visual treatment with topology recommendations, or docs/04-USE-CASES.md for the detailed markdown version.

Performance

Reference numbers from a single host with one NVIDIA A6000 (48 GB VRAM):

Workload	Throughput	Notes
Clean PDF (text-heavy)	~120 pages/min	12 GPU workers, 300 DPI
Mixed (text + tables + figures)	~70 pages/min	Same hardware
Scanned with degradation	~40 pages/min	After preprocessing
Video frame extraction	1 fps default	Configurable

CPU-only deployments with ONNX Runtime achieve roughly 25-30% of GPU throughput at much lower per-page cost. See docs/cpu-vs-gpu-analysis.md for the full benchmark table and TCO analysis.

Supported Languages

Core tier (34, default, air-gapped): English, French, German, Spanish, Italian, Portuguese, Dutch, Swedish, Danish, Finnish, Romanian, Polish, Czech, Hungarian, Turkish, Vietnamese, Russian, Ukrainian, Belarusian, Bulgarian, Simplified Chinese, Traditional Chinese, Japanese, Korean, Arabic, Persian, Urdu, Uyghur, Hindi, Tamil, Telugu, Kannada, Greek, Georgian.

Extended tier (+11, opt-in): Croatian, Slovak, Norwegian, Lithuanian, Latvian, Estonian, Serbian (Latin), Bengali, Marathi, Nepali, Thai.

Activate the extended tier with OCR_LANGUAGE_TIERS=core,extended.

Documentation Map

Start Here

INSTALL.md — Install on Docker, Kubernetes, or bare metal
ARCHITECTURE.md — Top-level architecture with diagrams
docs/DEPLOYMENT-DECISION-GUIDE.md — Pick the right topology (decision tree)
docs/WHITE-PAPER.md — Technical white paper (Markdown source)
docs/02-QUICKSTART-5-MINUTE-SUCCESS.md — 5-minute walkthrough
docs/04-USE-CASES.md — When to use EDCOCR

Architecture & Design

docs/00-SYSTEM-BLUEPRINT.md — System architecture
docs/01-TECH-STACK-DNA.md — Technology stack
docs/03-INFORMATION-FLOWS.md — End-to-end data flow
docs/05-INTERACTIVE-WALKTHROUGH.md — Guided tour

Reference

docs/06-CONFIGURATION-REFERENCE.md — All env vars
docs/API-REFERENCE.md — REST API
docs/08-SDK-REFERENCE.md — Python + TypeScript SDKs
docs/07-TRANSFORMS-STAMPING.md — Bates stamping, redaction

Operations

docs/10-MONITORING-OPERATIONS.md — Prometheus + Grafana
docs/FAILOVER-RUNBOOK.md — HA failover procedures
docs/09-TROUBLESHOOTING.md — Common issues
docs/cpu-vs-gpu-analysis.md — Deployment sizing

Advanced

docs/11-ML-TRAINING-GUIDE.md — LayoutLMv3 fine-tuning
docs/benchmarking-methodology.md — Performance benchmarks
docs/security-audit-checklist.md — Security review

For Contributors

CONTRIBUTING.md — How to contribute
DEVELOPMENT.md — Development guide
SECURITY.md — Reporting security issues
CHANGELOG.md — Release history

Presentation Suite

presentation/executive-summary.html — Decision-maker briefing
presentation/technical-brief.html — Engineer-audience deep dive
presentation/white-paper.html — Rendered HTML white paper
presentation/use-cases.html — Worked scenarios
presentation/index.html — Marketing landing page
presentation/slides.html — Slide deck (keyboard nav)
presentation/architecture.html — Architecture walkthrough

Where EDCOCR Is Not a Fit

We are explicit about non-goals so nobody buys the wrong tool:

Pure document understanding without provenance. If you just want a chat-with-your-PDF demo, a generative LLM with built-in OCR will get you there faster.
Real-time consumer scanning. EDCOCR optimizes for sustained throughput on a queue, not millisecond latency on a phone.
Fixed-template form auto-fill. Form-field-aware tools that understand a specific tax form's structure will out-extract a generic OCR pipeline.

See presentation/use-cases.html#not-fit for the longer treatment.

Project Status

Version 4.1.0 — Production-ready public release.

EDCOCR has been deployed in document-volume environments processing 6-7 digit page counts. The pipeline, distributed coordinator, REST API, SDKs, Helm chart, and observability stack are all considered stable. Translation and per-span language detection are feature-flagged and default to OFF.

See CHANGELOG.md for release history and docs/known-issues.md for current open issues.

License

Apache License 2.0. See LICENSE for the full text and NOTICE for third-party attributions.

EDCOCR ships pre-built integrations with several third-party OCR, NLP, and ML libraries. Each retains its original license; restrictive license families (e.g. NLLB's CC-BY-NC-4.0) are flagged and gated by tenant policy.

Community

EDCOCR is an open, community-driven project. The fastest path forward is more eyes, more deployments, and more contributions from people outside the original team.

Ways to participate:

File a public issue. Bugs, unexpected behavior, feature ideas, documentation gaps — open an issue from the Issues tab. Templates guide you through what to include.
Start a discussion. Open-ended questions, design ideas, "how are you running this in production?" — those belong in Discussions. It's the lowest-friction surface for community Q&A.
Send a pull request. See CONTRIBUTING.md for the contribution workflow, coding conventions, and testing expectations.
Report a security issue privately. Use GitHub Security Advisories — do not file a public issue. See SECURITY.md for the full disclosure policy.

Want to join the team as a regular contributor? Send a direct message to @mattmre on GitHub. There is no application form — just tell us what you want to work on and roughly how much time you have. See §10 "Joining the Team" in CONTRIBUTING.md for what to include.

Contributors

Every commit, issue, review, and Discussion thread makes the project better. Thank you.

Star History

Documentation · API Reference · Architecture · Changelog · Presentation Suite · Discussions

_{EDCOCR v4.1.0 · Apache License 2.0 · Forensic-grade OCR for the day someone asks "prove it"}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github		.github
api		api
benchmark_results		benchmark_results
coordinator		coordinator
corpus		corpus
dashboard		dashboard
docs		docs
frontend		frontend
helm/ocr-local		helm/ocr-local
kafka		kafka
legacy		legacy
ocr_distributed		ocr_distributed
ocr_local		ocr_local
otel		otel
playwright		playwright
presentation		presentation
reprocess		reprocess
schemas		schemas
scripts		scripts
sdk		sdk
terraform		terraform
tests		tests
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.secrets.baseline		.secrets.baseline
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DEVELOPMENT.md		DEVELOPMENT.md
Dockerfile		Dockerfile
Dockerfile.frontend		Dockerfile.frontend
INSTALL.md		INSTALL.md
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
OCR_GPU.py		OCR_GPU.py
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md
adaptive_batch.py		adaptive_batch.py
advanced_preprocessing.py		advanced_preprocessing.py
barcode_extraction.py		barcode_extraction.py
barcode_pipeline.py		barcode_pipeline.py
benchmark_ocr.py		benchmark_ocr.py
benchmark_pipeline.py		benchmark_pipeline.py
classification.py		classification.py
cost_tracking.py		cost_tracking.py
credential_manager.py		credential_manager.py
custody.py		custody.py
custody_hooks.py		custody_hooks.py
docker-compose.yml		docker-compose.yml
download_models.py		download_models.py
dpi_escalation.py		dpi_escalation.py
easyocr_engine.py		easyocr_engine.py
embedding_service.py		embedding_service.py
engine_selection.py		engine_selection.py
entity_consolidator.py		entity_consolidator.py
env_utils.py		env_utils.py
exception_router.py		exception_router.py
extraction.py		extraction.py
feature_flags.py		feature_flags.py
file-watcher.yaml.example		file-watcher.yaml.example
file_watcher.py		file_watcher.py
file_watcher_config.py		file_watcher_config.py
file_watcher_remote.py		file_watcher_remote.py
font_selector.py		font_selector.py
format_loaders.py		format_loaders.py
gpu_optimization.py		gpu_optimization.py
handwriting.py		handwriting.py
healthcheck.sh		healthcheck.sh
language_config.py		language_config.py
language_detection.py		language_detection.py
layoutlm_calibration.py		layoutlm_calibration.py
layoutlm_data.py		layoutlm_data.py
layoutlm_evaluate.py		layoutlm_evaluate.py
layoutlm_finetune.py		layoutlm_finetune.py
layoutlm_labels.py		layoutlm_labels.py
layoutlm_model_registry.py		layoutlm_model_registry.py
layoutlm_structure.py		layoutlm_structure.py
layoutlm_summarization.py		layoutlm_summarization.py
multi_label_classification.py		multi_label_classification.py
ner.py		ner.py
noise_profiling.py		noise_profiling.py
ocr_gpu_async.py		ocr_gpu_async.py
ocr_inference_backend.py		ocr_inference_backend.py
ocr_metrics.py		ocr_metrics.py
omr_detection.py		omr_detection.py
optimize_pdfs.py		optimize_pdfs.py
output_assembler.py		output_assembler.py
package-lock.json		package-lock.json
package.json		package.json
paddle_compat.py		paddle_compat.py
page_cache.py		page_cache.py
page_routing.py		page_routing.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

EDCOCR

Forensic-Grade OCR Platform for Electronic Discovery & Document Processing

Zero-hallucination OCR · 45 languages · GPU + CPU · Chain of custody · Distributed at scale

Most OCR tools were built for digitizing brochures.EDCOCR was built for the day someone asks "where did this text come from, and can you prove it?"

At a Glance

What Is EDCOCR?

Try It in 30 Seconds

System Overview

Presentation Suite

Why Forensic-Grade?

How the Custody Chain Works

EDCOCR vs the Alternatives

Core Capabilities

Recognition

Pipeline

Output

Deployment

Integration

Observability

Deployment Topologies

Quick Start

Docker (recommended)

Kubernetes

Python SDK

TypeScript SDK

Use Cases

Performance

Supported Languages

Documentation Map

Start Here

Architecture & Design

Reference

Operations

Advanced

For Contributors

Presentation Suite

Where EDCOCR Is Not a Fit

Project Status

License

Community

Contributors

Star History

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Most OCR tools were built for digitizing brochures.
EDCOCR was built for the day someone asks "where did this text come from, and can you prove it?"

Packages