Skip to content

miraculix95/pdf-parser-shootout

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pdf-parser-shootout

Three modern document parsers, one French magazine cover. Two of them broke.

I'm building a RAG pipeline. The standard "dump PDFs into a parser, chunk the markdown, embed, store" advice works fine on scientific papers and tax forms — those being the only PDF types most parser benchmarks test on.

So I tried the worst case: a real European magazine. Photo-heavy, multi-column, vertically rotated text, hand-lettered headlines, mixed languages. écoute, ZEIT Sprachen, issue 1/2025. Pages 1–10.

This repo has the outputs. Run the same test yourself in 5 minutes.

Parser Cover-page text recall Reading order on TOC Vertical rotated text Latency (10 pp) Cost (10 pp)
LlamaParse Agentic ✅ complete ✅ correct ✅ extracted ~30 s $0.13
Mistral OCR ⚠ most ✅ correct ❌ missed ~5 s $0.01
Docling (cpu-latest) ❌ lost ❌ scrambled ❌ missed ~216 s $0 (CPU-bound)

The cover that broke them

Cover of écoute 1/2025: Eiffel Tower in snow with text overlay

Look at what the parser has to do here:

  • FRANZÖSISCH runs vertically along the left edge, rotated 90°.
  • A second vertical strip beside it: EINFACH BESSER FRANZÖSISCH.
  • Brand mark écoute in lowercase italics next to a small icon.
  • Red circular NEU badge with multi-line German body inside.
  • Three feature pull-quotes in different sizes, mixed German + French.
  • Hero typography PARIS (sans-serif giant) followed by en hiver (handwritten cursive).
  • Pricing line in 6 pt with multi-country abbreviations.
  • Eiffel Tower photo dominating the background.

Got everything. Vertical labels, hero hierarchy as # H1, semantic **bold** on feature pull-quotes, even the 6 pt pricing footer with bullet separators.

Got most of it. Lost the vertical EINFACH BESSER FRANZÖSISCH label. Replaced the handwritten en hiver with junk OCR'd from the photo background. Collapsed all bold/headline distinctions to plain text — for downstream chunking, that's lost emphasis information.

Cover output is one sentence — the subtitle. Nothing else from page 1 made it through. Docling jumps straight to page 2.

The TOC

Page 4 is a multi-column table of contents — 3 photo tiles up top, 4 sections below. See renders/page-04-toc.png for the layout.

LlamaParse handles it cleanly: photo-tile numbers as # H1, section headers as ## H2, every entry as bold page-number title plus plain-text description.

Mistral OCR gets the content right but loses bold; some M markers OCR'd as H because the font was small.

Docling reorders things: 28 La montagne sans neige lands before TITELTHEMA, even though in the original it sits as a list item under COMPRENDRE LA FRANCE. The headline Pourquoi parle-t-on de la "France moche" ? is split mid-sentence at the line wrap. Every entry is ## H2, flattened to one hierarchy level.

Why this matters

Parser output is the input to your chunker. If the parser drops headings, conflates hierarchy levels, or scrambles reading order, the chunker can't recover it. Bad chunks → bad embeddings → bad retrieval → "your RAG doesn't work" — actual culprit two stages upstream.

For non-editorial PDFs (academic papers, contracts, scans) Mistral OCR is the price-performance sweet spot — fastest, cheapest, EU-hosted. For magazines, brochures, brand books, or anything a designer touched, LlamaParse Agentic is the only one that survives. Docling on this kind of material is currently unusable; on linear academic content it might still be the right answer, but that's a different test.

A practical pipeline routes by document profile:

PDF in
   ↓
[is it editorial / multi-column / photo-heavy?]
   ↓                            ↓
LlamaParse Agentic         Mistral OCR
($0.0125 / page)           ($0.001 / page)

Reproduce in 5 minutes

The source PDF (écoute 1/2025) is copyrighted, so it's not in this repo. Drop in any 10-page PDF you have.

# 1. Mistral OCR (CLI: pip install mistral-ocr-cli or whatever wraps the API)
mistral-ocr your.pdf -o outputs/mistral-ocr/full.md --delete

# 2. LlamaParse Agentic — note the EU base_url if you're on EU
pip install llama-cloud-services
python3 scripts/run_llamaparse.py your.pdf

# 3. Docling-Inference (Docker)
docker run -d --name docling -p 8090:8080 \
  -v hf_cache:/root/.cache/huggingface \
  -v ocr_cache:/root/.EasyOCR \
  ghcr.io/aidotse/docling-inference:cpu-latest

scripts/run_docling.sh your.pdf

Each script is in scripts/. The cost of running all three on 10 pages is about $0.14 plus 4 minutes of CPU time on the Docling container.

Repo layout

outputs/
  mistral-ocr/full.md      ← Mistral OCR output for 10 pages
  llamaparse/full.md       ← LlamaParse Agentic output, with <!-- PAGE N --> markers
  docling/full.md          ← Docling cpu-latest output

per-page/
  llamaparse-pNN.md        ← Per-page splits of LlamaParse output

renders/
  cover.png                ← Page 1 (the cover)
  page-04-toc.png          ← Page 4 (the multi-column TOC)

scripts/
  run_llamaparse.py        ← Reproducibility
  run_docling.sh

One more gotcha worth its own line

LlamaParse has separate US and EU endpoints, and your API key is region-bound. If you signed up at cloud.eu.llamaindex.ai, you must pass base_url="https://api.cloud.eu.llamaindex.ai" to the client — the default points at the US endpoint and silently 401s with a misleading "Invalid API key" message. Burned 10 minutes finding this. Hopefully you don't.

Source attribution

Test material: écoute magazine, issue 1/2025, published by ZEIT Sprachen GmbH, Munich. Used here for technical demonstration of document-parsing tools (transformative use). Source PDF not redistributed.

License

MIT for the scripts and the comparison. The extracted text outputs in outputs/ and per-page/ are derivative of the source magazine and provided here as small excerpts for technical comparison only.

About

Three modern PDF parsers stress-tested on a real European magazine. Editorial layouts, vertical text, multi-column TOCs. LlamaParse vs Mistral OCR vs Docling.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors