feat(index): add crawl source — auto-discover and index all pages on a site by aafaq-rashid-comprinno · Pull Request #105 · StarTrail-org/PixelRAG

aafaq-rashid-comprinno · 2026-06-24T15:46:01Z

Summary

Adds a crawl source type that automatically discovers all pages on a website via BFS link-following.

Usage

source:
  type: crawl
  start_url: https://comprinno.net/
  max_pages: 150
  max_depth: 3
  exclude_patterns:
    - "?p="
    - "/page/"
    - "/2025/"
    - "/2026/"

embed:
  model: Qwen/Qwen3-VL-Embedding-2B
  device: auto

output: ./comprinno_index

Then:

pixelrag index build

This crawls comprinno.net (~150 pages), screenshots every page, embeds all chunks with Qwen3-VL, and builds a searchable FAISS index — fully automated.

Changes (from main only)

New file: index/src/pixelrag_index/sources/crawl.py — BFS crawler with domain filtering
sources/__init__.py: Register CrawlSource in SOURCES dict
config.py: Fix bug where Path().expanduser() mangled URLs containing ://

Features

BFS traversal with max_pages and max_depth limits
Same-domain only by default (stay_on_domain: true)
Auto-filters junk URLs: feeds, wp-json, wp-content, assets, xmlrpc
Custom exclude_patterns list for site-specific filtering
No new dependencies (stdlib urllib + re only)

Tested on comprinno.net

Discovers 150+ real content pages (services, case studies, blogs)
Filters out WordPress shortlinks (?p=), pagination, date archives
All 23 existing tests pass
Lint clean

…a site New source adapter that BFS-crawls a website from a start URL, follows same-domain links up to a configurable depth/max_pages, and yields each discovered page for the index pipeline. Also fixes a bug in make_source() where Path().expanduser() mangled URLs by collapsing '//' to '/' (e.g. https://x.com → https:/x.com). Usage: source: type: crawl start_url: https://example.com/ max_pages: 50 max_depth: 3 Features: - BFS traversal with configurable depth and page limit - Stays on same domain by default - Auto-filters non-content URLs (feeds, wp-json, assets) - Custom exclude_patterns for site-specific filtering - No new dependencies (stdlib urllib + regex)

vercel · 2026-06-24T15:46:06Z

@aafaq-rashid-comprinno is attempting to deploy a commit to the andylizf's projects Team on Vercel.

A member of the Team first needs to authorize it.

8 tests covering BFS discovery, domain filtering, max_pages/depth limits, exclude_patterns, asset filtering, and URL mangling fix.

test: add crawl source tests

2f65a44

8 tests covering BFS discovery, domain filtering, max_pages/depth limits, exclude_patterns, asset filtering, and URL mangling fix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(index): add crawl source — auto-discover and index all pages on a site#105

feat(index): add crawl source — auto-discover and index all pages on a site#105
aafaq-rashid-comprinno wants to merge 2 commits into
StarTrail-org:mainfrom
aafaq-rashid-comprinno:feat/crawl-source-clean

aafaq-rashid-comprinno commented Jun 24, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

aafaq-rashid-comprinno commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Usage

Changes (from main only)

Features

Tested on comprinno.net

Uh oh!

vercel Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aafaq-rashid-comprinno commented Jun 24, 2026 •

edited

Loading