Replace bespoke crawler and parser with Cloudflare Browser Rendering /crawl endpoint

Replace both the custom-built web crawler (`BaseCrawler` + `CrawlerRunner`) and the HTML-to-Markdown parser (`Parser`) with Cloudflare's Browser Rendering [/crawl endpoint](https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/). This removes significant code we currently maintain — async HTTP fetching, link discovery, depth traversal, concurrency management, HTML storage, XPath content extraction, and HTML-to-Markdown conversion — and replaces it all with a single API call that returns Markdown directly.

By eliminating both the crawler and parser, we can focus engineering effort on higher-value work: document vectorization, RAG orchestration, multi-agent workflows, and user-facing features.

All currently configured sites (migri, te_palvelut, kela, vero, dvv) must continue to work with minimal configuration changes.

## Motivation

Our current pipeline includes two bespoke stages before vectorization:

**Crawler** (`tapio/crawler/`) — async httpx + BeautifulSoup implementation that handles:
- Concurrent HTTP requests with semaphore-based throttling
- Recursive link discovery and depth-limited traversal
- Domain filtering and content-type checking
- HTML file storage and URL-to-filepath mapping
- Rate limiting via configurable delays

**Parser** (`tapio/parser/`) — HTML-to-Markdown converter that handles:
- XPath-based content extraction with site-specific selectors
- HTML-to-Markdown conversion via html2text with per-site configuration
- YAML frontmatter generation with metadata
- URL reverse-lookup from `url_mappings.json`

The Cloudflare `/crawl` endpoint replaces both stages with a single API call, plus adds:

- **Headless browser rendering** — JavaScript-heavy sites are rendered properly
- **Multiple output formats** — HTML, Markdown, and structured JSON returned directly
- **Automatic page discovery** — From sitemaps, page links, or both
- **Incremental crawling** — Skip pages that haven't changed (`modifiedSince`, `maxAge`)
- **robots.txt compliance** — Honors directives including crawl-delay by default
- **URL pattern filtering** — Include/exclude patterns for scoping crawls

By adopting this, we eliminate two entire pipeline stages and their configuration, freeing us to focus on value-added work: document vectorization, RAG orchestration, multi-agent workflows, and user-facing features.

## Current architecture to replace

### Files to remove or significantly rewrite

| File                            | What it does                                                                                   | What happens to it                    |
| ------------------------------- | ---------------------------------------------------------------------------------------------- | ------------------------------------- |
| `tapio/crawler/crawler.py`      | `BaseCrawler` class — async httpx crawler with link following, HTML saving, URL mapping        | **Remove entirely**                   |
| `tapio/crawler/runner.py`       | `CrawlerRunner` — orchestrator that wraps `BaseCrawler`                                        | **Rewrite** as Cloudflare API client  |
| `tapio/crawler/__init__.py`     | Exports `BaseCrawler`, `CrawlerRunner`                                                         | **Update exports**                    |
| `tapio/parser/parser.py`        | `Parser` class — XPath content extraction, HTML-to-Markdown conversion, frontmatter generation | **Remove entirely**                   |
| `tapio/parser/__init__.py`      | Exports `Parser`                                                                               | **Remove or gut**                     |
| `tests/crawler/test_crawler.py` | Tests for `BaseCrawler`                                                                        | **Remove and replace** with new tests |
| `tests/crawler/test_runner.py`  | Tests for `CrawlerRunner`                                                                      | **Rewrite** for new runner            |
| `tests/parser/`                 | Tests for `Parser`                                                                             | **Remove entirely**                   |

### Files to adapt

| File                             | What changes                                                                                                         |
| -------------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| `tapio/config/config_models.py`  | Remove `ParserConfig`, `HtmlToMarkdownConfig`; adapt `CrawlerConfig` fields to Cloudflare params (see mapping below) |
| `tapio/config/site_configs.yaml` | Remove `parser_config` sections; update `crawler_config` for each site                                               |
| `tapio/cli.py`                   | Remove `parse` command; update `crawl` command for async polling workflow; crawl now writes Markdown directly        |

### Files that stay unchanged

- `tapio/vectorstore/` — Vectorization pipeline consumes Markdown files, unaffected
- `tapio/services/` — RAG services are downstream, unaffected
- `tapio/app.py`, `tapio/factories.py` — Gradio app and factory wiring, unaffected

## How the current crawler works

The current pipeline is a multi-stage process:

1. **Crawl** (`tapio/crawler/crawler.py`): `BaseCrawler` takes a `SiteConfig`, fetches HTML pages via httpx, follows links recursively up to `max_depth`, and saves raw HTML files to `content/{site}/crawled/`. It also writes a `url_mappings.json` file that maps file paths to original URLs.

2. **Parse** (`tapio/parser/parser.py`): `Parser` reads the saved HTML files, uses XPath selectors from `ParserConfig` to extract targeted content areas (e.g., `//div[@id="main-content"]`), converts HTML to Markdown via html2text, and saves results to `content/{site}/parsed/`.

3. **Vectorize** (`tapio/vectorstore/`): Reads parsed Markdown, generates embeddings, stores in ChromaDB.

4. **RAG App** (`tapio/app.py`): Queries the vector store and generates answers via Ollama.

The Cloudflare `/crawl` endpoint replaces steps 1 and 2 entirely — it crawls the site and returns Markdown directly, including URL metadata per record.

## Configuration mapping

Current `CrawlerConfig` fields map to Cloudflare `/crawl` parameters:

| Current field            | Current default | Cloudflare parameter      | Notes                                                         |
| ------------------------ | --------------- | ------------------------- | ------------------------------------------------------------- |
| `max_depth`              | 1               | `depth`                   | Same concept — max link depth from starting URL               |
| `max_concurrent`         | 5               | *(handled by Cloudflare)* | No longer needed — Cloudflare manages concurrency             |
| `delay_between_requests` | 1.0             | *(handled by Cloudflare)* | No longer needed — Cloudflare respects robots.txt crawl-delay |
| *(new)*                  | —               | `limit`                   | Max pages to crawl (default 10, max 100,000)                  |
| *(new)*                  | —               | `formats`                 | Request `["markdown"]` to get Markdown directly               |
| *(new)*                  | —               | `render`                  | `true` for JS-heavy sites, `false` for static HTML            |
| *(new)*                  | —               | `source`                  | `"all"`, `"sitemaps"`, or `"links"`                           |
| *(new)*                  | —               | `options.includePatterns` | Wildcard URL patterns to include                              |
| *(new)*                  | —               | `options.excludePatterns` | Wildcard URL patterns to exclude                              |
| *(new)*                  | —               | `maxAge`                  | Cache duration in seconds                                     |
| *(new)*                  | —               | `modifiedSince`           | Unix timestamp for incremental crawling                       |

### New environment variables

The Cloudflare API requires authentication:

- `CLOUDFLARE_ACCOUNT_ID` — Your Cloudflare account ID
- `CLOUDFLARE_API_TOKEN` — API token with **Browser Rendering - Edit** permission

These should be loaded via environment variables (not stored in config files). See [API token setup](https://developers.cloudflare.com/browser-rendering/rest-api/).

## Cloudflare /crawl endpoint overview

The endpoint works in two steps:

### 1. Initiate a crawl job (POST)

```bash
curl -X POST 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl' \
  -H 'Authorization: Bearer <apiToken>' \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://migri.fi",
    "limit": 100,
    "depth": 2,
    "formats": ["markdown"],
    "render": true,
    "source": "all"
  }'
```

Response:
```json
{
  "success": true,
  "result": "c7f8s2d9-a8e7-4b6e-8e4d-3d4a1b2c3f4e"
}
```

### 2. Poll for results (GET)

```bash
curl -X GET 'https://api.cloudflare.com/client/v4/accounts/{account_id}/browser-rendering/crawl/{job_id}' \
  -H 'Authorization: Bearer <apiToken>'
```

Response includes status (`running`, `completed`, `errored`, etc.) and an array of `records`, each containing:
- `url` — The crawled page URL
- `status` — Per-page status (`completed`, `skipped`, `disallowed`, etc.)
- `markdown` — The page content as Markdown (when `formats` includes `"markdown"`)
- `metadata` — HTTP status, page title, final URL

Results are paginated via a `cursor` parameter when responses exceed 10 MB.

## Implementation guidance

### Phase 1: Cloudflare API client

Create a new crawler implementation in `tapio/crawler/` that wraps the Cloudflare `/crawl` REST API:

1. **POST** to initiate a crawl job with the site's `base_url` and configuration
2. **Poll** the GET endpoint until the job reaches a terminal status (`completed`, `errored`, `cancelled_*`)
3. **Paginate** results using the `cursor` parameter
4. **Save** the returned Markdown content to `content/{site}/parsed/`, prepending YAML frontmatter with the canonical source URL from the response `metadata.url` field (required for reference and grounding in the RAG pipeline)

Use `httpx` for HTTP requests (already a project dependency).

### Phase 2: Adapt configuration models

Update `CrawlerConfig` in `tapio/config/config_models.py`:

- Remove `delay_between_requests` and `max_concurrent` (no longer needed)
- Remove `ParserConfig` and `HtmlToMarkdownConfig` entirely (no longer needed)
- Keep `max_depth` (maps to Cloudflare `depth`)
- Add `limit`, `render`, `source`, `formats`, `include_patterns`, `exclude_patterns`

Update `tapio/config/site_configs.yaml` for each of the 5 sites:

- Remove all `parser_config` sections (XPath selectors, markdown_config, etc.)
- Add Cloudflare-appropriate `crawler_config` defaults

### Phase 3: Update CLI and runner

Rewrite `CrawlerRunner` in `tapio/crawler/runner.py` to use the new Cloudflare client. Update the `crawl` CLI command in `tapio/cli.py` to:

- Handle the async polling workflow
- Display progress (pages processed, job status)
- Support cancellation via Ctrl+C (send DELETE to cancel the job)
- Remove the `parse` CLI command entirely (Cloudflare now provides parsed Markdown)
- Save Markdown files with YAML frontmatter containing the canonical source URL

### Phase 4: Remove old crawler and parser code

- Delete `BaseCrawler` class and its tests
- Delete `Parser` class and its tests
- Remove `ParserConfig`, `HtmlToMarkdownConfig` from config models
- Remove `parse` CLI command
- Remove BeautifulSoup, lxml, and html2text dependencies (no longer needed)
- Keep httpx (still used by the new Cloudflare client)
- Update `tapio/crawler/__init__.py` exports

### Phase 5: Source URL frontmatter

Each saved Markdown file must include YAML frontmatter with the canonical source URL so the RAG pipeline can cite and ground answers with proper references:

```yaml
---
title: "Page Title"
source_url: "https://migri.fi/en/residence-permit"
crawl_timestamp: "2026-03-12T10:30:00Z"
---
# Page content as Markdown...
```

The `source_url` comes from the Cloudflare response `metadata.url` field for each record. The `title` comes from `metadata.title`. This metadata is critical for the downstream vectorization and RAG pipeline to provide grounded, referenceable answers.

## Design notes

- **Why remove the parser?** — Cloudflare returns full-page Markdown directly. While our parser used XPath selectors to extract specific content areas (e.g., `//div[@id="main-content"]`), maintaining site-specific XPath selectors is fragile and labor-intensive. Cloudflare's URL pattern filtering (`includePatterns`/`excludePatterns`) provides a sufficient alternative for scoping content, and full-page Markdown is generally good enough for RAG use cases. Removing the parser eliminates the `ParserConfig`, `HtmlToMarkdownConfig`, XPath selectors, html2text configuration, and the entire parse CLI command.
- **Source URL for grounding** — Each Markdown file must contain a canonical `source_url` in its YAML frontmatter, sourced from the Cloudflare response `metadata.url` field. This is essential for the RAG pipeline to cite original sources and ground answers with verifiable references.
- **Error handling** — Cloudflare jobs can timeout (7-day max), hit account limits, or error. The new client must handle all terminal statuses: `completed`, `cancelled_due_to_timeout`, `cancelled_due_to_limits`, `cancelled_by_user`, `errored`.
- **Rate limits and billing** — `render: true` crawls consume headless browser time (billed per [Cloudflare pricing](https://developers.cloudflare.com/browser-rendering/platform/pricing/)). `render: false` is free during beta. Configure per-site based on whether the site needs JS rendering.
- **Free plan limits** — Workers Free plan: 10 minutes browser time/day. Workers Paid plan: higher limits.
- **Focus on higher-value work** — By offloading crawling and parsing to Cloudflare, the team can focus on document vectorization, RAG orchestration, multi-agent workflows, and user-facing features.

## Cloudflare documentation

- [/crawl endpoint documentation](https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/) — Full API reference with examples
- [Browser Rendering REST API — Before you begin](https://developers.cloudflare.com/browser-rendering/rest-api/) — API token setup
- [Browser Rendering pricing](https://developers.cloudflare.com/browser-rendering/platform/pricing/) — Billing details
- [Crawl endpoint limits](https://developers.cloudflare.com/browser-rendering/platform/limits/) — Free vs. paid plan limits
- [robots.txt and sitemaps best practices](https://developers.cloudflare.com/browser-rendering/rest-api/crawl-endpoint/#robotstxt-and-bot-protection) — Bot behavior
- [Browser Rendering FAQ and troubleshooting](https://developers.cloudflare.com/browser-rendering/platform/faq/) — Common issues

## Acceptance criteria

- [ ] All 5 configured sites (migri, te_palvelut, kela, vero, dvv) can be crawled using the Cloudflare `/crawl` endpoint
- [ ] The `crawl` CLI command works with the same interface: `uv run python -m tapio.cli crawl migri`
- [ ] Depth override via `--depth` CLI flag still works
- [ ] Custom config path via `--config` still works
- [ ] Crawl results are saved as Markdown files with YAML frontmatter (including `source_url`) compatible with the vectorize pipeline
- [ ] Each Markdown file contains the canonical source URL from Cloudflare `metadata.url` for RAG grounding
- [ ] `CrawlerConfig` model reflects Cloudflare parameters with sensible defaults
- [ ] New tests cover the Cloudflare API client, polling logic, and error handling
- [ ] Test coverage >= 80% for new code
- [ ] Old `BaseCrawler` code and its tests are removed
- [ ] Old `Parser` code, `ParserConfig`, `HtmlToMarkdownConfig`, and their tests are removed
- [ ] The `parse` CLI command is removed
- [ ] Unused dependencies (BeautifulSoup, lxml, html2text) are removed from `pyproject.toml`
- [ ] Environment variables `CLOUDFLARE_ACCOUNT_ID` and `CLOUDFLARE_API_TOKEN` are documented
- [ ] Code passes `uv run ruff check .`, `uv run mypy .`, and `uv run pyrefly check`


File	What it does	What happens to it
`tapio/crawler/crawler.py`	`BaseCrawler` class — async httpx crawler with link following, HTML saving, URL mapping	Remove entirely
`tapio/crawler/runner.py`	`CrawlerRunner` — orchestrator that wraps `BaseCrawler`	Rewrite as Cloudflare API client
`tapio/crawler/__init__.py`	Exports `BaseCrawler`, `CrawlerRunner`	Update exports
`tapio/parser/parser.py`	`Parser` class — XPath content extraction, HTML-to-Markdown conversion, frontmatter generation	Remove entirely
`tapio/parser/__init__.py`	Exports `Parser`	Remove or gut
`tests/crawler/test_crawler.py`	Tests for `BaseCrawler`	Remove and replace with new tests
`tests/crawler/test_runner.py`	Tests for `CrawlerRunner`	Rewrite for new runner
`tests/parser/`	Tests for `Parser`	Remove entirely

File	What changes
`tapio/config/config_models.py`	Remove `ParserConfig`, `HtmlToMarkdownConfig`; adapt `CrawlerConfig` fields to Cloudflare params (see mapping below)
`tapio/config/site_configs.yaml`	Remove `parser_config` sections; update `crawler_config` for each site
`tapio/cli.py`	Remove `parse` command; update `crawl` command for async polling workflow; crawl now writes Markdown directly

Current field	Current default	Cloudflare parameter	Notes
`max_depth`	1	`depth`	Same concept — max link depth from starting URL
`max_concurrent`	5	(handled by Cloudflare)	No longer needed — Cloudflare manages concurrency
`delay_between_requests`	1.0	(handled by Cloudflare)	No longer needed — Cloudflare respects robots.txt crawl-delay
(new)	—	`limit`	Max pages to crawl (default 10, max 100,000)
(new)	—	`formats`	Request `["markdown"]` to get Markdown directly
(new)	—	`render`	`true` for JS-heavy sites, `false` for static HTML
(new)	—	`source`	`"all"`, `"sitemaps"`, or `"links"`
(new)	—	`options.includePatterns`	Wildcard URL patterns to include
(new)	—	`options.excludePatterns`	Wildcard URL patterns to exclude
(new)	—	`maxAge`	Cache duration in seconds
(new)	—	`modifiedSince`	Unix timestamp for incremental crawling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace bespoke crawler and parser with Cloudflare Browser Rendering /crawl endpoint #2

Motivation

Current architecture to replace

Files to remove or significantly rewrite

Files to adapt

Files that stay unchanged

How the current crawler works

Configuration mapping

New environment variables

Cloudflare /crawl endpoint overview

1. Initiate a crawl job (POST)

2. Poll for results (GET)

Implementation guidance

Phase 1: Cloudflare API client

Phase 2: Adapt configuration models

Phase 3: Update CLI and runner

Phase 4: Remove old crawler and parser code

Phase 5: Source URL frontmatter

Design notes

Cloudflare documentation

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Replace bespoke crawler and parser with Cloudflare Browser Rendering /crawl endpoint #2

Description

Motivation

Current architecture to replace

Files to remove or significantly rewrite

Files to adapt

Files that stay unchanged

How the current crawler works

Configuration mapping

New environment variables

Cloudflare /crawl endpoint overview

1. Initiate a crawl job (POST)

2. Poll for results (GET)

Implementation guidance

Phase 1: Cloudflare API client

Phase 2: Adapt configuration models

Phase 3: Update CLI and runner

Phase 4: Remove old crawler and parser code

Phase 5: Source URL frontmatter

Design notes

Cloudflare documentation

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions