Skip to content

feat(index): add crawl source — auto-discover and index all pages on a site#105

Open
aafaq-rashid-comprinno wants to merge 2 commits into
StarTrail-org:mainfrom
aafaq-rashid-comprinno:feat/crawl-source-clean
Open

feat(index): add crawl source — auto-discover and index all pages on a site#105
aafaq-rashid-comprinno wants to merge 2 commits into
StarTrail-org:mainfrom
aafaq-rashid-comprinno:feat/crawl-source-clean

Conversation

@aafaq-rashid-comprinno

@aafaq-rashid-comprinno aafaq-rashid-comprinno commented Jun 24, 2026

Copy link
Copy Markdown

Summary

Adds a crawl source type that automatically discovers all pages on a website via BFS link-following.

Usage

source:
  type: crawl
  start_url: https://comprinno.net/
  max_pages: 150
  max_depth: 3
  exclude_patterns:
    - "?p="
    - "/page/"
    - "/2025/"
    - "/2026/"

embed:
  model: Qwen/Qwen3-VL-Embedding-2B
  device: auto

output: ./comprinno_index

Then:

pixelrag index build

This crawls comprinno.net (~150 pages), screenshots every page, embeds all chunks with Qwen3-VL, and builds a searchable FAISS index — fully automated.

Changes (from main only)

  • New file: index/src/pixelrag_index/sources/crawl.py — BFS crawler with domain filtering
  • sources/__init__.py: Register CrawlSource in SOURCES dict
  • config.py: Fix bug where Path().expanduser() mangled URLs containing ://

Features

  • BFS traversal with max_pages and max_depth limits
  • Same-domain only by default (stay_on_domain: true)
  • Auto-filters junk URLs: feeds, wp-json, wp-content, assets, xmlrpc
  • Custom exclude_patterns list for site-specific filtering
  • No new dependencies (stdlib urllib + re only)

Tested on comprinno.net

  • Discovers 150+ real content pages (services, case studies, blogs)
  • Filters out WordPress shortlinks (?p=), pagination, date archives
  • All 23 existing tests pass
  • Lint clean

…a site

New source adapter that BFS-crawls a website from a start URL, follows
same-domain links up to a configurable depth/max_pages, and yields each
discovered page for the index pipeline.

Also fixes a bug in make_source() where Path().expanduser() mangled URLs
by collapsing '//' to '/' (e.g. https://x.com → https:/x.com).

Usage:
  source:
    type: crawl
    start_url: https://example.com/
    max_pages: 50
    max_depth: 3

Features:
- BFS traversal with configurable depth and page limit
- Stays on same domain by default
- Auto-filters non-content URLs (feeds, wp-json, assets)
- Custom exclude_patterns for site-specific filtering
- No new dependencies (stdlib urllib + regex)
@vercel

vercel Bot commented Jun 24, 2026

Copy link
Copy Markdown

@aafaq-rashid-comprinno is attempting to deploy a commit to the andylizf's projects Team on Vercel.

A member of the Team first needs to authorize it.

8 tests covering BFS discovery, domain filtering, max_pages/depth
limits, exclude_patterns, asset filtering, and URL mangling fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant