From f4fc60385de68b0c3e096a820a9ceb94ea79f86b Mon Sep 17 00:00:00 2001
From: benchmark-bot <noreply@github.com>
Date: Thu, 16 Apr 2026 05:51:20 +0000
Subject: [PATCH] Update benchmark numbers from llm-crawler-benchmarks

Source: AIMLPM/llm-crawler-benchmarks@5b5cebc
---
 README.md          | 18 +++++++++---------
 docs/BENCHMARKS.md | 44 ++++++++++++++++++++++----------------------
 2 files changed, 31 insertions(+), 31 deletions(-)
diff --git a/README.md b/README.md
index b218f03..ed86b43 100644
--- a/README.md
+++ b/README.md
@@ -239,20 +239,20 @@ Each tool has strengths: FireCrawl excels as a hosted API, Crawl4AI has deep bro
 
 ### Benchmark results (6 tools, April 2026)
 
-**Speed:** markcrawl is fastest (14.0 pages/sec), scrapy+md second (9.3). Playwright-based tools average 1.4-2.1 pages/sec.
+**Speed:** markcrawl is fastest (12.1 pages/sec), scrapy+md second (9.5). Playwright-based tools average 1.4-2.1 pages/sec.
 
-**Output cleanliness:** markcrawl has the lowest nav pollution (15 words vs 133+ for others) — less junk in your embeddings.
+**Output cleanliness:** markcrawl has the lowest nav pollution (14 words vs 208+ for others) — less junk in your embeddings.
 
-**RAG answer quality:** markcrawl scores 4.30/5 on answer quality with the fewest chunks (22,132 total, 2.1x fewer than the most), keeping embedding costs low.
+**RAG answer quality:** markcrawl scores 4.52/5 on answer quality with the fewest chunks (27,051 total, 3.0x fewer than the most), keeping embedding costs low.
 
 | Tool | Chunks/page | Answer Quality (/5) | Annual cost (100K pages, 1K queries/day) |
 |---|---|---|---|
-| **markcrawl** | **15.2** | **4.30** | **$4,505** |
-| scrapy+md | 16.4 | 4.41 | $5,464 |
-| crawl4ai | 22.5 | 4.26 | $6,960 |
-| colly+md | 29.5 | 4.29 | $7,213 |
-| playwright | 31.9 | 4.38 | $7,320 |
-| crawlee | 32.7 | 4.33 | $7,467 |
+| **markcrawl** | **18.6** | **4.52** | **$4,505** |
+| scrapy+md | 29.0 | 4.03 | $5,464 |
+| crawl4ai | 33.2 | 4.43 | $6,960 |
+| colly+md | 55.3 | 4.53 | $7,213 |
+| playwright | 50.6 | 4.42 | $7,320 |
+| crawlee | 51.0 | 4.52 | $7,467 |
 
 Full benchmark data: [docs/BENCHMARKS.md](docs/BENCHMARKS.md) | Methodology: [llm-crawler-benchmarks](https://github.com/AIMLPM/llm-crawler-benchmarks)
 </details>
diff --git a/docs/BENCHMARKS.md b/docs/BENCHMARKS.md
index c247b2a..690b242 100644
--- a/docs/BENCHMARKS.md
+++ b/docs/BENCHMARKS.md
@@ -1,9 +1,9 @@
 <!-- AUTO-GENERATED by sync_markcrawl.py — do not edit manually -->
 # MarkCrawl Benchmarks
 
-> **Summary:** Across 6 open-source crawlers tested on 8 sites, MarkCrawl is the fastest (14.0 pages/sec), produces the cleanest output (15 words of nav pollution vs 133+ for others), generates the fewest chunks (22,132 total, 2.1x fewer than crawlee), the lowest total RAG pipeline cost at every scale tested.
+> **Summary:** Across 6 open-source crawlers tested on 8 sites, MarkCrawl is the fastest (12.1 pages/sec), produces the cleanest output (14 words of nav pollution vs 208+ for others), generates the fewest chunks (27,051 total, 3.0x fewer than colly+md), the lowest total RAG pipeline cost at every scale tested.
 >
-> **Where MarkCrawl is not first:**  Answer quality is 5th (4.30/5, scrapy+md leads at 4.41).   Retrieval Hit@5 is 6th (86% vs 91% for scrapy+md).   Content recall is 7th (64% vs 97% for playwright).
+> **Where MarkCrawl is not first:**  Answer quality is 2nd (4.52/5, colly+md leads at 4.53).   Retrieval Hit@5 is 3rd (83% vs 87% for crawlee).   Content recall is 6th (30% vs 84% for crawlee).
 
 *Last run: April 2026. Reproducible via [llm-crawler-benchmarks](https://github.com/AIMLPM/llm-crawler-benchmarks).*
 
@@ -13,12 +13,12 @@
 
 | Tool | Pages/sec |
 |---|---|
-| **markcrawl** | **14.0** |
-| scrapy+md | 9.3 |
-| colly+md | 6.6 |
-| playwright | 2.1 |
-| crawl4ai | 2.0 |
-| crawlee | 1.8 |
+| **markcrawl** | **12.1** |
+| scrapy+md | 9.5 |
+| colly+md | 4.2 |
+| playwright | 2.2 |
+| crawlee | 1.7 |
+| crawl4ai | 1.5 |
 
 MarkCrawl uses native async I/O (httpx) with concurrent fetching and process-pool HTML extraction. Playwright-based tools (crawl4ai, crawlee) are inherently slower due to full browser rendering per page.
 
@@ -26,27 +26,27 @@ MarkCrawl uses native async I/O (httpx) with concurrent fetching and process-poo
 
 | Tool | Nav pollution (words) | Recall |
 |---|---|---|
-| **markcrawl** | **15** | **64%** |
-| scrapy+md | 133 | 68% |
-| crawl4ai | 311 | 66% |
-| colly+md | 1953 | 96% |
-| playwright | 2037 | 97% |
-| crawlee | 2207 | 97% |
+| **markcrawl** | **14** | **30%** |
+| scrapy+md | 208 | 21% |
+| crawl4ai | 418 | 71% |
+| playwright | 2710 | 82% |
+| colly+md | 2733 | 54% |
+| crawlee | 2839 | 84% |
 
 Nav pollution = boilerplate words (navigation, footer, cookie banners) that leak into extracted content. Lower is better — less junk means cleaner embeddings and fewer wasted tokens.
 
-The tradeoff: playwright captures 97% of page content but includes ~2,037 words of boilerplate per page. MarkCrawl captures 64% with 15 words of pollution. For RAG pipelines, the cleaner output produces better embeddings despite the lower recall.
+The tradeoff: crawlee captures 84% of page content but includes ~2,839 words of boilerplate per page. MarkCrawl captures 30% with 14 words of pollution. For RAG pipelines, the cleaner output produces better embeddings despite the lower recall.
 
 ## RAG answer quality
 
 | Tool | Chunks | Answer Quality (/5) | Hit@5 | Hit@20 |
 |---|---|---|---|---|
-| **markcrawl** | **22,132** | **4.30** | **86%** | **91%** |
-| scrapy+md | 23,854 | 4.41 | 91% | 94% |
-| crawl4ai | 32,735 | 4.26 | 89% | 93% |
-| colly+md | 42,934 | 4.29 | 86% | 92% |
-| playwright | 46,439 | 4.38 | 90% | 94% |
-| crawlee | 47,560 | 4.33 | 88% | 93% |
+| **markcrawl** | **27,051** | **4.52** | **83%** | **88%** |
+| scrapy+md | 42,234 | 4.03 | 56% | 65% |
+| crawl4ai | 48,332 | 4.43 | 82% | 92% |
+| playwright | 73,656 | 4.42 | 86% | 93% |
+| crawlee | 74,281 | 4.52 | 87% | 93% |
+| colly+md | 80,550 | 4.53 | 77% | 88% |
 
 *FireCrawl's self-hosted version did not complete crawls on all sites across multiple attempts. Its scores are on a reduced set and are not directly comparable to tools that completed all sites.
 
@@ -55,7 +55,7 @@ The tradeoff: playwright captures 97% of page content but includes ~2,037 words
 - **Answer Quality** — LLM-judged score for answers generated from retrieved chunks.
 - **Hit@5 / Hit@20** — what percentage of queries find a relevant chunk in the top 5 or 20 results.
 
-**Fewer chunks = lower cost.** Each chunk requires an embedding call and vector storage. MarkCrawl produces 2.1x fewer chunks than crawlee for the same content, cutting embedding and storage costs significantly.
+**Fewer chunks = lower cost.** Each chunk requires an embedding call and vector storage. MarkCrawl produces 3.0x fewer chunks than colly+md for the same content, cutting embedding and storage costs significantly.
 
 ## Total cost of ownership