Three modern document parsers, one French magazine cover. Two of them broke.
I'm building a RAG pipeline. The standard "dump PDFs into a parser, chunk the markdown, embed, store" advice works fine on scientific papers and tax forms — those being the only PDF types most parser benchmarks test on.
So I tried the worst case: a real European magazine. Photo-heavy, multi-column, vertically rotated text, hand-lettered headlines, mixed languages. écoute, ZEIT Sprachen, issue 1/2025. Pages 1–10.
This repo has the outputs. Run the same test yourself in 5 minutes.
| Parser | Cover-page text recall | Reading order on TOC | Vertical rotated text | Latency (10 pp) | Cost (10 pp) |
|---|---|---|---|---|---|
| LlamaParse Agentic | ✅ complete | ✅ correct | ✅ extracted | ~30 s | $0.13 |
| Mistral OCR | ⚠ most | ✅ correct | ❌ missed | ~5 s | $0.01 |
Docling (cpu-latest) |
❌ lost | ❌ scrambled | ❌ missed | ~216 s | $0 (CPU-bound) |
Look at what the parser has to do here:
FRANZÖSISCHruns vertically along the left edge, rotated 90°.- A second vertical strip beside it:
EINFACH BESSER FRANZÖSISCH. - Brand mark
écoutein lowercase italics next to a small icon. - Red circular
NEUbadge with multi-line German body inside. - Three feature pull-quotes in different sizes, mixed German + French.
- Hero typography
PARIS(sans-serif giant) followed byen hiver(handwritten cursive). - Pricing line in 6 pt with multi-country abbreviations.
- Eiffel Tower photo dominating the background.
LlamaParse → outputs/llamaparse/full.md
Got everything. Vertical labels, hero hierarchy as # H1, semantic **bold** on feature pull-quotes, even the 6 pt pricing footer with bullet separators.
Mistral OCR → outputs/mistral-ocr/full.md
Got most of it. Lost the vertical EINFACH BESSER FRANZÖSISCH label. Replaced the handwritten en hiver with junk OCR'd from the photo background. Collapsed all bold/headline distinctions to plain text — for downstream chunking, that's lost emphasis information.
Docling → outputs/docling/full.md
Cover output is one sentence — the subtitle. Nothing else from page 1 made it through. Docling jumps straight to page 2.
Page 4 is a multi-column table of contents — 3 photo tiles up top, 4 sections below. See renders/page-04-toc.png for the layout.
LlamaParse handles it cleanly: photo-tile numbers as # H1, section headers as ## H2, every entry as bold page-number title plus plain-text description.
Mistral OCR gets the content right but loses bold; some M markers OCR'd as H because the font was small.
Docling reorders things: 28 La montagne sans neige lands before TITELTHEMA, even though in the original it sits as a list item under COMPRENDRE LA FRANCE. The headline Pourquoi parle-t-on de la "France moche" ? is split mid-sentence at the line wrap. Every entry is ## H2, flattened to one hierarchy level.
Parser output is the input to your chunker. If the parser drops headings, conflates hierarchy levels, or scrambles reading order, the chunker can't recover it. Bad chunks → bad embeddings → bad retrieval → "your RAG doesn't work" — actual culprit two stages upstream.
For non-editorial PDFs (academic papers, contracts, scans) Mistral OCR is the price-performance sweet spot — fastest, cheapest, EU-hosted. For magazines, brochures, brand books, or anything a designer touched, LlamaParse Agentic is the only one that survives. Docling on this kind of material is currently unusable; on linear academic content it might still be the right answer, but that's a different test.
A practical pipeline routes by document profile:
PDF in
↓
[is it editorial / multi-column / photo-heavy?]
↓ ↓
LlamaParse Agentic Mistral OCR
($0.0125 / page) ($0.001 / page)
The source PDF (écoute 1/2025) is copyrighted, so it's not in this repo. Drop in any 10-page PDF you have.
# 1. Mistral OCR (CLI: pip install mistral-ocr-cli or whatever wraps the API)
mistral-ocr your.pdf -o outputs/mistral-ocr/full.md --delete
# 2. LlamaParse Agentic — note the EU base_url if you're on EU
pip install llama-cloud-services
python3 scripts/run_llamaparse.py your.pdf
# 3. Docling-Inference (Docker)
docker run -d --name docling -p 8090:8080 \
-v hf_cache:/root/.cache/huggingface \
-v ocr_cache:/root/.EasyOCR \
ghcr.io/aidotse/docling-inference:cpu-latest
scripts/run_docling.sh your.pdfEach script is in scripts/. The cost of running all three on 10 pages is about $0.14 plus 4 minutes of CPU time on the Docling container.
outputs/
mistral-ocr/full.md ← Mistral OCR output for 10 pages
llamaparse/full.md ← LlamaParse Agentic output, with <!-- PAGE N --> markers
docling/full.md ← Docling cpu-latest output
per-page/
llamaparse-pNN.md ← Per-page splits of LlamaParse output
renders/
cover.png ← Page 1 (the cover)
page-04-toc.png ← Page 4 (the multi-column TOC)
scripts/
run_llamaparse.py ← Reproducibility
run_docling.sh
LlamaParse has separate US and EU endpoints, and your API key is region-bound. If you signed up at cloud.eu.llamaindex.ai, you must pass base_url="https://api.cloud.eu.llamaindex.ai" to the client — the default points at the US endpoint and silently 401s with a misleading "Invalid API key" message. Burned 10 minutes finding this. Hopefully you don't.
Test material: écoute magazine, issue 1/2025, published by ZEIT Sprachen GmbH, Munich. Used here for technical demonstration of document-parsing tools (transformative use). Source PDF not redistributed.
MIT for the scripts and the comparison. The extracted text outputs in outputs/ and per-page/ are derivative of the source magazine and provided here as small excerpts for technical comparison only.
