Fetchlee is a configurable web crawler built with TypeScript and Puppeteer. It is focused on link discovery + metadata extraction pipelines driven by JSON task files.
- Headless/non-headless crawling with Puppeteer
- Persistent URL frontier based on SQLite (
queued/processing/visited/failed) - Configurable crawl rules and metadata extraction rules
- Page interaction engine (
click,type,scroll,waitFor, etc.) - Optional metadata persistence to ArangoDB
- Optional Tor-based crawling mode
This project is under active development. Task format and crawler behavior are still evolving.
- Node.js 18+
- npm
- (Optional) Tor + Privoxy for
--use_tor - (Optional) ArangoDB for
--use_database
npm installUse ts-node directly:
npx ts-node main.ts crawl \
-c my_collection \
-o ./output \
-e ./tasks/sample_task.json \
-l ./your_links_file.txt \
--headlessBuild TypeScript:
npm run buildRun compiled JavaScript:
node dist/main.js crawl \
-c my_collection \
-o ./output \
-e ./tasks/sample_task.json \
-l ./your_links_file.txt \
--headlessYou can also use:
npm run start -- crawl -c my_collection -o ./output -e ./tasks/sample_task.json -l ./your_links_file.txt --headlessGlobal options:
-c, --coll_name <string>: collection name (default:default_host_name)-o, --output <path>: output directory-e, --task <path>: task JSON path-l, --links <path>: seed links file path-h, --HELP: extended help
crawl command options:
--headless: run browser in headless mode-t, --use_tor: route crawling through Tor (requires Tor/Privoxy setup)-d, --delay <number>: delay between processed URLs in ms--use_database: save extracted metadata to ArangoDB--frontier_state <path>: path to SQLite frontier DB--clear_history: clear URLs history for selected collection in frontier--browser_config <path>: path to browser JSON config
parsing command currently exists as a placeholder and is not fully implemented.
A task file defines:
crawl_rules: URL pattern matching and next-link extraction rulesmetadata_extraction: extraction rules for target pageslinks_transformation(optional): URL transformation rulesinteractions(optional): pre/post interaction steps
Example task files:
tasks/sample_task.jsontasks/emerald_test_task.json
A plain text file with one URL per line.
Example:
https://www.emerald.com/journals/pages/journals_a-zCreate .env in project root:
DATABASE_TYPE=arango
ARANGO_URL=http://localhost:8529
ARANGO_DB=crawler_db
ARANGO_COLLECTION=crawled_data
ARANGO_USER=root
ARANGO_PASSWORD=your_passwordThen run crawler with --use_database.
Install packages:
sudo apt-get install tor privoxyAdd to /etc/tor/torrc:
ControlPort 9051
CookieAuthentication 0(If needed) allow cookie access:
sudo chmod +r /run/tor/control.authcookieAdd to /etc/privoxy/config:
forward-socks5 / 127.0.0.1:9050 .Start services:
sudo service tor start
sudo service privoxy startFor each collection, Fetchlee creates:
jsons/: extracted metadatahtmls/: saved HTML pagesremaining_links.txt: copied seed file
- Respect website terms of service and robots policies.
- Add sensible delays and scope your crawl patterns responsibly.