Skip to content

feat: add headless crawl mode for cron/automation usage#77

Open
MiquelGomezCorral wants to merge 1 commit into
PhialsBasement:mainfrom
MiquelGomezCorral:feat/headless-crawl-mode
Open

feat: add headless crawl mode for cron/automation usage#77
MiquelGomezCorral wants to merge 1 commit into
PhialsBasement:mainfrom
MiquelGomezCorral:feat/headless-crawl-mode

Conversation

@MiquelGomezCorral

Copy link
Copy Markdown

What

Adds a --crawl URL flag that runs a crawl without starting the web server, then exits. No browser, no Flask, no HTTP round-trips needed.

Also adds:

  • --port / -p — configure port without env var
  • --no-browser — suppress the auto-open browser tab on normal server start

Why

Makes it possible to drive LibreCrawl from a cron job or CI script without running a persistent web server or scripting HTTP API calls:

# cron: crawl nightly, save JSON report
0 2 * * * python main.py --crawl https://example.com --output /reports/nightly.json

# pipe to jq
python main.py --crawl https://example.com | jq '[.data[] | select(.status_code >= 400)]'

# CSV
python main.py --crawl https://example.com --crawl-format csv --output report.csv

Progress is printed to stdout; results go to --output file or stdout if omitted. Exit code is 0 on success, 1 on error.

Changes

  • main.py only — one file, no new dependencies
  • Reuses the existing WebCrawler, SettingsManager, and export helpers already in the codebase
  • Normal server mode is completely unaffected

New --crawl URL flag bypasses the web server entirely:
runs WebCrawler directly, prints progress to stdout,
and exits with results (JSON or CSV via --crawl-format).

--output saves to file instead of stdout.
--no-browser and --port flags also added so the normal
server startup is more composable in automated envs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant