Skip to content

Replace bespoke crawler and parser with Cloudflare Browser Rendering /crawl endpoint #2

Description

@brylie

Replace both the custom-built web crawler (BaseCrawler + CrawlerRunner) and the HTML-to-Markdown parser (Parser) with Cloudflare's Browser Rendering /crawl endpoint. This removes significant code we currently maintain — async HTTP fetching, link discovery, depth traversal, concurrency management, HTML storage, XPath content extraction, and HTML-to-Markdown conversion — and replaces it all with a single API call that returns Markdown directly.

By eliminating both the crawler and parser, we can focus engineering effort on higher-value work: document vectorization, RAG orchestration, multi-agent workflows, and user-facing features.

All currently configured sites (migri, te_palvelut, kela, vero, dvv) must continue to work with minimal configuration changes.

Motivation

Our current pipeline includes two bespoke stages before vectorization:

Crawler (tapio/crawler/) — async httpx + BeautifulSoup implementation that handles:

  • Concurrent HTTP requests with semaphore-based throttling
  • Recursive link discovery and depth-limited traversal
  • Domain filtering and content-type checking
  • HTML file storage and URL-to-filepath mapping
  • Rate limiting via configurable delays

Parser (tapio/parser/) — HTML-to-Markdown converter that handles:

  • XPath-based content extraction with site-specific selectors
  • HTML-to-Markdown conversion via html2text with per-site configuration
  • YAML frontmatter generation with metadata
  • URL reverse-lookup from url_mappings.json

The Cloudflare /crawl endpoint replaces both stages with a single API call, plus adds:

  • Headless browser rendering — JavaScript-heavy sites are rendered properly
  • Multiple output formats — HTML, Markdown, and structured JSON returned directly
  • Automatic page discovery — From sitemaps, page links, or both
  • Incremental crawling — Skip pages that haven't changed (modifiedSince, maxAge)
  • robots.txt compliance — Honors directives including crawl-delay by default
  • URL pattern filtering — Include/exclude patterns for scoping crawls

By adopting this, we eliminate two entire pipeline stages and their configuration, freeing us to focus on value-added work: document vectorization, RAG orchestration, multi-agent workflows, and user-facing features.

Current architecture to replace

Files to remove or significantly rewrite

File What it does What happens to it
tapio/crawler/crawler.py BaseCrawler class — async httpx crawler with link following, HTML saving, URL mapping Remove entirely
tapio/crawler/runner.py CrawlerRunner — orchestrator that wraps BaseCrawler Rewrite as Cloudflare API client
tapio/crawler/__init__.py Exports BaseCrawler, CrawlerRunner Update exports
tapio/parser/parser.py Parser class — XPath content extraction, HTML-to-Markdown conversion, frontmatter generation Remove entirely
tapio/parser/__init__.py Exports Parser Remove or gut
tests/crawler/test_crawler.py Tests for BaseCrawler Remove and replace with new tests
tests/crawler/test_runner.py Tests for CrawlerRunner Rewrite for new runner
tests/parser/ Tests for Parser Remove entirely

Files to adapt

File What changes
tapio/config/config_models.py Remove ParserConfig, HtmlToMarkdownConfig; adapt CrawlerConfig fields to Cloudflare params (see mapping below)
tapio/config/site_configs.yaml Remove parser_config sections; update crawler_config for each site
tapio/cli.py Remove parse command; update crawl command for async polling workflow; crawl now writes Markdown directly

Files that stay unchanged

  • tapio/vectorstore/ — Vectorization pipeline consumes Markdown files, unaffected
  • tapio/services/ — RAG services are downstream, unaffected
  • tapio/app.py, tapio/factories.py — Gradio app and factory wiring, unaffected

How the current crawler works

The current pipeline is a multi-stage process:

  1. Crawl (tapio/crawler/crawler.py): BaseCrawler takes a SiteConfig, fetches HTML pages via httpx, follows links recursively up to max_depth, and saves raw HTML files to content/{site}/crawled/. It also writes a url_mappings.json file that maps file paths to original URLs.

  2. Parse (tapio/parser/parser.py): Parser reads the saved HTML files, uses XPath selectors from ParserConfig to extract targeted content areas (e.g., //div[@id="main-content"]), converts HTML to Markdown via html2text, and saves results to content/{site}/parsed/.

  3. Vectorize (tapio/vectorstore/): Reads parsed Markdown, generates embeddings, stores in ChromaDB.

  4. RAG App (tapio/app.py): Queries the vector store and generates answers via Ollama.

The Cloudflare /crawl endpoint replaces steps 1 and 2 entirely — it crawls the site and returns Markdown directly, including URL metadata per record.

Configuration mapping

Current CrawlerConfig fields map to Cloudflare /crawl parameters:

Current field Current default Cloudflare parameter Notes
max_depth 1 depth Same concept — max link depth from starting URL
max_concurrent 5 (handled by Cloudflare) No longer needed — Cloudflare manages concurrency
delay_between_requests 1.0 (handled by Cloudflare) No longer needed — Cloudflare respects robots.txt crawl-delay
(new) limit Max pages to crawl (default 10, max 100,000)
(new) formats Request ["markdown"] to get Markdown directly
(new) render true for JS-heavy sites, false for static HTML
(new) source "all", "sitemaps", or "links"
(new) options.includePatterns Wildcard URL patterns to include
(new) options.excludePatterns Wildcard URL patterns to exclude
(new) maxAge Cache duration in seconds
(new) modifiedSince Unix timestamp for incremental crawling

New environment variables

The Cloudflare API requires authentication:

  • CLOUDFLARE_ACCOUNT_ID — Your Cloudflare account ID
  • CLOUDFLARE_API_TOKEN — API token with Browser Rendering - Edit permission

These should be loaded via environment variables (not stored in config files). See API token setup.

Cloudflare /crawl endpoint overview

The endpoint works in two steps:

1. Initiate a crawl job (POST)

curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
  -H 'Authorization: Bearer <apiToken>' \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://migri.fi",
    "limit": 100,
    "depth": 2,
    "formats": ["markdown"],
    "render": true,
    "source": "all"
  }'

Response:

{
  "success": true,
  "result": "c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e"
}

2. Poll for results (GET)

curl -X GET 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}' \
  -H 'Authorization: Bearer <apiToken>'

Response includes status (running, completed, errored, etc.) and an array of records, each containing:

  • url — The crawled page URL
  • status — Per-page status (completed, skipped, disallowed, etc.)
  • markdown — The page content as Markdown (when formats includes "markdown")
  • metadata — HTTP status, page title, final URL

Results are paginated via a cursor parameter when responses exceed 10 MB.

Implementation guidance

Phase 1: Cloudflare API client

Create a new crawler implementation in tapio/crawler/ that wraps the Cloudflare /crawl REST API:

  1. POST to initiate a crawl job with the site's base_url and configuration
  2. Poll the GET endpoint until the job reaches a terminal status (completed, errored, cancelled_*)
  3. Paginate results using the cursor parameter
  4. Save the returned Markdown content to content/{site}/parsed/, prepending YAML frontmatter with the canonical source URL from the response metadata.url field (required for reference and grounding in the RAG pipeline)

Use httpx for HTTP requests (already a project dependency).

Phase 2: Adapt configuration models

Update CrawlerConfig in tapio/config/config_models.py:

  • Remove delay_between_requests and max_concurrent (no longer needed)
  • Remove ParserConfig and HtmlToMarkdownConfig entirely (no longer needed)
  • Keep max_depth (maps to Cloudflare depth)
  • Add limit, render, source, formats, include_patterns, exclude_patterns

Update tapio/config/site_configs.yaml for each of the 5 sites:

  • Remove all parser_config sections (XPath selectors, markdown_config, etc.)
  • Add Cloudflare-appropriate crawler_config defaults

Phase 3: Update CLI and runner

Rewrite CrawlerRunner in tapio/crawler/runner.py to use the new Cloudflare client. Update the crawl CLI command in tapio/cli.py to:

  • Handle the async polling workflow
  • Display progress (pages processed, job status)
  • Support cancellation via Ctrl+C (send DELETE to cancel the job)
  • Remove the parse CLI command entirely (Cloudflare now provides parsed Markdown)
  • Save Markdown files with YAML frontmatter containing the canonical source URL

Phase 4: Remove old crawler and parser code

  • Delete BaseCrawler class and its tests
  • Delete Parser class and its tests
  • Remove ParserConfig, HtmlToMarkdownConfig from config models
  • Remove parse CLI command
  • Remove BeautifulSoup, lxml, and html2text dependencies (no longer needed)
  • Keep httpx (still used by the new Cloudflare client)
  • Update tapio/crawler/__init__.py exports

Phase 5: Source URL frontmatter

Each saved Markdown file must include YAML frontmatter with the canonical source URL so the RAG pipeline can cite and ground answers with proper references:

---
title: "Page Title"
source_url: "https://migri.fi/en/residence-permit"
crawl_timestamp: "2026-03-12T10:30:00Z"
---
# Page content as Markdown...

The source_url comes from the Cloudflare response metadata.url field for each record. The title comes from metadata.title. This metadata is critical for the downstream vectorization and RAG pipeline to provide grounded, referenceable answers.

Design notes

  • Why remove the parser? — Cloudflare returns full-page Markdown directly. While our parser used XPath selectors to extract specific content areas (e.g., //div[@id="main-content"]), maintaining site-specific XPath selectors is fragile and labor-intensive. Cloudflare's URL pattern filtering (includePatterns/excludePatterns) provides a sufficient alternative for scoping content, and full-page Markdown is generally good enough for RAG use cases. Removing the parser eliminates the ParserConfig, HtmlToMarkdownConfig, XPath selectors, html2text configuration, and the entire parse CLI command.
  • Source URL for grounding — Each Markdown file must contain a canonical source_url in its YAML frontmatter, sourced from the Cloudflare response metadata.url field. This is essential for the RAG pipeline to cite original sources and ground answers with verifiable references.
  • Error handling — Cloudflare jobs can timeout (7-day max), hit account limits, or error. The new client must handle all terminal statuses: completed, cancelled_due_to_timeout, cancelled_due_to_limits, cancelled_by_user, errored.
  • Rate limits and billingrender: true crawls consume headless browser time (billed per Cloudflare pricing). render: false is free during beta. Configure per-site based on whether the site needs JS rendering.
  • Free plan limits — Workers Free plan: 10 minutes browser time/day. Workers Paid plan: higher limits.
  • Focus on higher-value work — By offloading crawling and parsing to Cloudflare, the team can focus on document vectorization, RAG orchestration, multi-agent workflows, and user-facing features.

Cloudflare documentation

Acceptance criteria

  • All 5 configured sites (migri, te_palvelut, kela, vero, dvv) can be crawled using the Cloudflare /crawl endpoint
  • The crawl CLI command works with the same interface: uv run python -m tapio.cli crawl migri
  • Depth override via --depth CLI flag still works
  • Custom config path via --config still works
  • Crawl results are saved as Markdown files with YAML frontmatter (including source_url) compatible with the vectorize pipeline
  • Each Markdown file contains the canonical source URL from Cloudflare metadata.url for RAG grounding
  • CrawlerConfig model reflects Cloudflare parameters with sensible defaults
  • New tests cover the Cloudflare API client, polling logic, and error handling
  • Test coverage >= 80% for new code
  • Old BaseCrawler code and its tests are removed
  • Old Parser code, ParserConfig, HtmlToMarkdownConfig, and their tests are removed
  • The parse CLI command is removed
  • Unused dependencies (BeautifulSoup, lxml, html2text) are removed from pyproject.toml
  • Environment variables CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN are documented
  • Code passes uv run ruff check ., uv run mypy ., and uv run pyrefly check

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status
Todo

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions