You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Replace both the custom-built web crawler (BaseCrawler + CrawlerRunner) and the HTML-to-Markdown parser (Parser) with Cloudflare's Browser Rendering /crawl endpoint. This removes significant code we currently maintain — async HTTP fetching, link discovery, depth traversal, concurrency management, HTML storage, XPath content extraction, and HTML-to-Markdown conversion — and replaces it all with a single API call that returns Markdown directly.
By eliminating both the crawler and parser, we can focus engineering effort on higher-value work: document vectorization, RAG orchestration, multi-agent workflows, and user-facing features.
All currently configured sites (migri, te_palvelut, kela, vero, dvv) must continue to work with minimal configuration changes.
Motivation
Our current pipeline includes two bespoke stages before vectorization:
Crawler (tapio/crawler/) — async httpx + BeautifulSoup implementation that handles:
Concurrent HTTP requests with semaphore-based throttling
Recursive link discovery and depth-limited traversal
Domain filtering and content-type checking
HTML file storage and URL-to-filepath mapping
Rate limiting via configurable delays
Parser (tapio/parser/) — HTML-to-Markdown converter that handles:
XPath-based content extraction with site-specific selectors
HTML-to-Markdown conversion via html2text with per-site configuration
YAML frontmatter generation with metadata
URL reverse-lookup from url_mappings.json
The Cloudflare /crawl endpoint replaces both stages with a single API call, plus adds:
Headless browser rendering — JavaScript-heavy sites are rendered properly
Multiple output formats — HTML, Markdown, and structured JSON returned directly
Automatic page discovery — From sitemaps, page links, or both
Incremental crawling — Skip pages that haven't changed (modifiedSince, maxAge)
robots.txt compliance — Honors directives including crawl-delay by default
URL pattern filtering — Include/exclude patterns for scoping crawls
By adopting this, we eliminate two entire pipeline stages and their configuration, freeing us to focus on value-added work: document vectorization, RAG orchestration, multi-agent workflows, and user-facing features.
Current architecture to replace
Files to remove or significantly rewrite
File
What it does
What happens to it
tapio/crawler/crawler.py
BaseCrawler class — async httpx crawler with link following, HTML saving, URL mapping
Remove entirely
tapio/crawler/runner.py
CrawlerRunner — orchestrator that wraps BaseCrawler
Rewrite as Cloudflare API client
tapio/crawler/__init__.py
Exports BaseCrawler, CrawlerRunner
Update exports
tapio/parser/parser.py
Parser class — XPath content extraction, HTML-to-Markdown conversion, frontmatter generation
Remove entirely
tapio/parser/__init__.py
Exports Parser
Remove or gut
tests/crawler/test_crawler.py
Tests for BaseCrawler
Remove and replace with new tests
tests/crawler/test_runner.py
Tests for CrawlerRunner
Rewrite for new runner
tests/parser/
Tests for Parser
Remove entirely
Files to adapt
File
What changes
tapio/config/config_models.py
Remove ParserConfig, HtmlToMarkdownConfig; adapt CrawlerConfig fields to Cloudflare params (see mapping below)
tapio/config/site_configs.yaml
Remove parser_config sections; update crawler_config for each site
tapio/cli.py
Remove parse command; update crawl command for async polling workflow; crawl now writes Markdown directly
tapio/services/ — RAG services are downstream, unaffected
tapio/app.py, tapio/factories.py — Gradio app and factory wiring, unaffected
How the current crawler works
The current pipeline is a multi-stage process:
Crawl (tapio/crawler/crawler.py): BaseCrawler takes a SiteConfig, fetches HTML pages via httpx, follows links recursively up to max_depth, and saves raw HTML files to content/{site}/crawled/. It also writes a url_mappings.json file that maps file paths to original URLs.
Parse (tapio/parser/parser.py): Parser reads the saved HTML files, uses XPath selectors from ParserConfig to extract targeted content areas (e.g., //div[@id="main-content"]), converts HTML to Markdown via html2text, and saves results to content/{site}/parsed/.
Vectorize (tapio/vectorstore/): Reads parsed Markdown, generates embeddings, stores in ChromaDB.
RAG App (tapio/app.py): Queries the vector store and generates answers via Ollama.
The Cloudflare /crawl endpoint replaces steps 1 and 2 entirely — it crawls the site and returns Markdown directly, including URL metadata per record.
Configuration mapping
Current CrawlerConfig fields map to Cloudflare /crawl parameters:
Current field
Current default
Cloudflare parameter
Notes
max_depth
1
depth
Same concept — max link depth from starting URL
max_concurrent
5
(handled by Cloudflare)
No longer needed — Cloudflare manages concurrency
delay_between_requests
1.0
(handled by Cloudflare)
No longer needed — Cloudflare respects robots.txt crawl-delay
(new)
—
limit
Max pages to crawl (default 10, max 100,000)
(new)
—
formats
Request ["markdown"] to get Markdown directly
(new)
—
render
true for JS-heavy sites, false for static HTML
(new)
—
source
"all", "sitemaps", or "links"
(new)
—
options.includePatterns
Wildcard URL patterns to include
(new)
—
options.excludePatterns
Wildcard URL patterns to exclude
(new)
—
maxAge
Cache duration in seconds
(new)
—
modifiedSince
Unix timestamp for incremental crawling
New environment variables
The Cloudflare API requires authentication:
CLOUDFLARE_ACCOUNT_ID — Your Cloudflare account ID
CLOUDFLARE_API_TOKEN — API token with Browser Rendering - Edit permission
These should be loaded via environment variables (not stored in config files). See API token setup.
curl -X GET 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}' \
-H 'Authorization: Bearer <apiToken>'
Response includes status (running, completed, errored, etc.) and an array of records, each containing:
url — The crawled page URL
status — Per-page status (completed, skipped, disallowed, etc.)
markdown — The page content as Markdown (when formats includes "markdown")
metadata — HTTP status, page title, final URL
Results are paginated via a cursor parameter when responses exceed 10 MB.
Implementation guidance
Phase 1: Cloudflare API client
Create a new crawler implementation in tapio/crawler/ that wraps the Cloudflare /crawl REST API:
POST to initiate a crawl job with the site's base_url and configuration
Poll the GET endpoint until the job reaches a terminal status (completed, errored, cancelled_*)
Paginate results using the cursor parameter
Save the returned Markdown content to content/{site}/parsed/, prepending YAML frontmatter with the canonical source URL from the response metadata.url field (required for reference and grounding in the RAG pipeline)
Use httpx for HTTP requests (already a project dependency).
Phase 2: Adapt configuration models
Update CrawlerConfig in tapio/config/config_models.py:
Remove delay_between_requests and max_concurrent (no longer needed)
Remove ParserConfig and HtmlToMarkdownConfig entirely (no longer needed)
Rewrite CrawlerRunner in tapio/crawler/runner.py to use the new Cloudflare client. Update the crawl CLI command in tapio/cli.py to:
Handle the async polling workflow
Display progress (pages processed, job status)
Support cancellation via Ctrl+C (send DELETE to cancel the job)
Remove the parse CLI command entirely (Cloudflare now provides parsed Markdown)
Save Markdown files with YAML frontmatter containing the canonical source URL
Phase 4: Remove old crawler and parser code
Delete BaseCrawler class and its tests
Delete Parser class and its tests
Remove ParserConfig, HtmlToMarkdownConfig from config models
Remove parse CLI command
Remove BeautifulSoup, lxml, and html2text dependencies (no longer needed)
Keep httpx (still used by the new Cloudflare client)
Update tapio/crawler/__init__.py exports
Phase 5: Source URL frontmatter
Each saved Markdown file must include YAML frontmatter with the canonical source URL so the RAG pipeline can cite and ground answers with proper references:
The source_url comes from the Cloudflare response metadata.url field for each record. The title comes from metadata.title. This metadata is critical for the downstream vectorization and RAG pipeline to provide grounded, referenceable answers.
Design notes
Why remove the parser? — Cloudflare returns full-page Markdown directly. While our parser used XPath selectors to extract specific content areas (e.g., //div[@id="main-content"]), maintaining site-specific XPath selectors is fragile and labor-intensive. Cloudflare's URL pattern filtering (includePatterns/excludePatterns) provides a sufficient alternative for scoping content, and full-page Markdown is generally good enough for RAG use cases. Removing the parser eliminates the ParserConfig, HtmlToMarkdownConfig, XPath selectors, html2text configuration, and the entire parse CLI command.
Source URL for grounding — Each Markdown file must contain a canonical source_url in its YAML frontmatter, sourced from the Cloudflare response metadata.url field. This is essential for the RAG pipeline to cite original sources and ground answers with verifiable references.
Error handling — Cloudflare jobs can timeout (7-day max), hit account limits, or error. The new client must handle all terminal statuses: completed, cancelled_due_to_timeout, cancelled_due_to_limits, cancelled_by_user, errored.
Rate limits and billing — render: true crawls consume headless browser time (billed per Cloudflare pricing). render: false is free during beta. Configure per-site based on whether the site needs JS rendering.
Focus on higher-value work — By offloading crawling and parsing to Cloudflare, the team can focus on document vectorization, RAG orchestration, multi-agent workflows, and user-facing features.
Replace both the custom-built web crawler (
BaseCrawler+CrawlerRunner) and the HTML-to-Markdown parser (Parser) with Cloudflare's Browser Rendering /crawl endpoint. This removes significant code we currently maintain — async HTTP fetching, link discovery, depth traversal, concurrency management, HTML storage, XPath content extraction, and HTML-to-Markdown conversion — and replaces it all with a single API call that returns Markdown directly.By eliminating both the crawler and parser, we can focus engineering effort on higher-value work: document vectorization, RAG orchestration, multi-agent workflows, and user-facing features.
All currently configured sites (migri, te_palvelut, kela, vero, dvv) must continue to work with minimal configuration changes.
Motivation
Our current pipeline includes two bespoke stages before vectorization:
Crawler (
tapio/crawler/) — async httpx + BeautifulSoup implementation that handles:Parser (
tapio/parser/) — HTML-to-Markdown converter that handles:url_mappings.jsonThe Cloudflare
/crawlendpoint replaces both stages with a single API call, plus adds:modifiedSince,maxAge)By adopting this, we eliminate two entire pipeline stages and their configuration, freeing us to focus on value-added work: document vectorization, RAG orchestration, multi-agent workflows, and user-facing features.
Current architecture to replace
Files to remove or significantly rewrite
tapio/crawler/crawler.pyBaseCrawlerclass — async httpx crawler with link following, HTML saving, URL mappingtapio/crawler/runner.pyCrawlerRunner— orchestrator that wrapsBaseCrawlertapio/crawler/__init__.pyBaseCrawler,CrawlerRunnertapio/parser/parser.pyParserclass — XPath content extraction, HTML-to-Markdown conversion, frontmatter generationtapio/parser/__init__.pyParsertests/crawler/test_crawler.pyBaseCrawlertests/crawler/test_runner.pyCrawlerRunnertests/parser/ParserFiles to adapt
tapio/config/config_models.pyParserConfig,HtmlToMarkdownConfig; adaptCrawlerConfigfields to Cloudflare params (see mapping below)tapio/config/site_configs.yamlparser_configsections; updatecrawler_configfor each sitetapio/cli.pyparsecommand; updatecrawlcommand for async polling workflow; crawl now writes Markdown directlyFiles that stay unchanged
tapio/vectorstore/— Vectorization pipeline consumes Markdown files, unaffectedtapio/services/— RAG services are downstream, unaffectedtapio/app.py,tapio/factories.py— Gradio app and factory wiring, unaffectedHow the current crawler works
The current pipeline is a multi-stage process:
Crawl (
tapio/crawler/crawler.py):BaseCrawlertakes aSiteConfig, fetches HTML pages via httpx, follows links recursively up tomax_depth, and saves raw HTML files tocontent/{site}/crawled/. It also writes aurl_mappings.jsonfile that maps file paths to original URLs.Parse (
tapio/parser/parser.py):Parserreads the saved HTML files, uses XPath selectors fromParserConfigto extract targeted content areas (e.g.,//div[@id="main-content"]), converts HTML to Markdown via html2text, and saves results tocontent/{site}/parsed/.Vectorize (
tapio/vectorstore/): Reads parsed Markdown, generates embeddings, stores in ChromaDB.RAG App (
tapio/app.py): Queries the vector store and generates answers via Ollama.The Cloudflare
/crawlendpoint replaces steps 1 and 2 entirely — it crawls the site and returns Markdown directly, including URL metadata per record.Configuration mapping
Current
CrawlerConfigfields map to Cloudflare/crawlparameters:max_depthdepthmax_concurrentdelay_between_requestslimitformats["markdown"]to get Markdown directlyrendertruefor JS-heavy sites,falsefor static HTMLsource"all","sitemaps", or"links"options.includePatternsoptions.excludePatternsmaxAgemodifiedSinceNew environment variables
The Cloudflare API requires authentication:
CLOUDFLARE_ACCOUNT_ID— Your Cloudflare account IDCLOUDFLARE_API_TOKEN— API token with Browser Rendering - Edit permissionThese should be loaded via environment variables (not stored in config files). See API token setup.
Cloudflare /crawl endpoint overview
The endpoint works in two steps:
1. Initiate a crawl job (POST)
Response:
{ "success": true, "result": "c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e" }2. Poll for results (GET)
Response includes status (
running,completed,errored, etc.) and an array ofrecords, each containing:url— The crawled page URLstatus— Per-page status (completed,skipped,disallowed, etc.)markdown— The page content as Markdown (whenformatsincludes"markdown")metadata— HTTP status, page title, final URLResults are paginated via a
cursorparameter when responses exceed 10 MB.Implementation guidance
Phase 1: Cloudflare API client
Create a new crawler implementation in
tapio/crawler/that wraps the Cloudflare/crawlREST API:base_urland configurationcompleted,errored,cancelled_*)cursorparametercontent/{site}/parsed/, prepending YAML frontmatter with the canonical source URL from the responsemetadata.urlfield (required for reference and grounding in the RAG pipeline)Use
httpxfor HTTP requests (already a project dependency).Phase 2: Adapt configuration models
Update
CrawlerConfigintapio/config/config_models.py:delay_between_requestsandmax_concurrent(no longer needed)ParserConfigandHtmlToMarkdownConfigentirely (no longer needed)max_depth(maps to Cloudflaredepth)limit,render,source,formats,include_patterns,exclude_patternsUpdate
tapio/config/site_configs.yamlfor each of the 5 sites:parser_configsections (XPath selectors, markdown_config, etc.)crawler_configdefaultsPhase 3: Update CLI and runner
Rewrite
CrawlerRunnerintapio/crawler/runner.pyto use the new Cloudflare client. Update thecrawlCLI command intapio/cli.pyto:parseCLI command entirely (Cloudflare now provides parsed Markdown)Phase 4: Remove old crawler and parser code
BaseCrawlerclass and its testsParserclass and its testsParserConfig,HtmlToMarkdownConfigfrom config modelsparseCLI commandtapio/crawler/__init__.pyexportsPhase 5: Source URL frontmatter
Each saved Markdown file must include YAML frontmatter with the canonical source URL so the RAG pipeline can cite and ground answers with proper references:
The
source_urlcomes from the Cloudflare responsemetadata.urlfield for each record. Thetitlecomes frommetadata.title. This metadata is critical for the downstream vectorization and RAG pipeline to provide grounded, referenceable answers.Design notes
//div[@id="main-content"]), maintaining site-specific XPath selectors is fragile and labor-intensive. Cloudflare's URL pattern filtering (includePatterns/excludePatterns) provides a sufficient alternative for scoping content, and full-page Markdown is generally good enough for RAG use cases. Removing the parser eliminates theParserConfig,HtmlToMarkdownConfig, XPath selectors, html2text configuration, and the entire parse CLI command.source_urlin its YAML frontmatter, sourced from the Cloudflare responsemetadata.urlfield. This is essential for the RAG pipeline to cite original sources and ground answers with verifiable references.completed,cancelled_due_to_timeout,cancelled_due_to_limits,cancelled_by_user,errored.render: truecrawls consume headless browser time (billed per Cloudflare pricing).render: falseis free during beta. Configure per-site based on whether the site needs JS rendering.Cloudflare documentation
Acceptance criteria
/crawlendpointcrawlCLI command works with the same interface:uv run python -m tapio.cli crawl migri--depthCLI flag still works--configstill workssource_url) compatible with the vectorize pipelinemetadata.urlfor RAG groundingCrawlerConfigmodel reflects Cloudflare parameters with sensible defaultsBaseCrawlercode and its tests are removedParsercode,ParserConfig,HtmlToMarkdownConfig, and their tests are removedparseCLI command is removedpyproject.tomlCLOUDFLARE_ACCOUNT_IDandCLOUDFLARE_API_TOKENare documenteduv run ruff check .,uv run mypy ., anduv run pyrefly check