Fetchlee

Fetchlee is a configurable web crawler built with TypeScript and Puppeteer. It is focused on link discovery + metadata extraction pipelines driven by JSON task files.

Features

Headless/non-headless crawling with Puppeteer
Persistent URL frontier based on SQLite (queued/processing/visited/failed)
Configurable crawl rules and metadata extraction rules
Page interaction engine (click, type, scroll, waitFor, etc.)
Optional metadata persistence to ArangoDB
Optional Tor-based crawling mode

Project Status

This project is under active development. Task format and crawler behavior are still evolving.

Requirements

Node.js 18+
npm
(Optional) Tor + Privoxy for --use_tor
(Optional) ArangoDB for --use_database

Installation

npm install

Running In Development

Use ts-node directly:

npx ts-node main.ts crawl \
  -c my_collection \
  -o ./output \
  -e ./tasks/sample_task.json \
  -l ./your_links_file.txt \
  --headless

Building And Running Production Build

Build TypeScript:

npm run build

Run compiled JavaScript:

node dist/main.js crawl \
  -c my_collection \
  -o ./output \
  -e ./tasks/sample_task.json \
  -l ./your_links_file.txt \
  --headless

You can also use:

npm run start -- crawl -c my_collection -o ./output -e ./tasks/sample_task.json -l ./your_links_file.txt --headless

CLI

Global options:

-c, --coll_name <string>: collection name (default: default_host_name)
-o, --output <path>: output directory
-e, --task <path>: task JSON path
-l, --links <path>: seed links file path
-h, --HELP: extended help

crawl command options:

--headless: run browser in headless mode
-t, --use_tor: route crawling through Tor (requires Tor/Privoxy setup)
-d, --delay <number>: delay between processed URLs in ms
--use_database: save extracted metadata to ArangoDB
--frontier_state <path>: path to SQLite frontier DB
--clear_history: clear URLs history for selected collection in frontier
--browser_config <path>: path to browser JSON config

parsing command currently exists as a placeholder and is not fully implemented.

Task Configuration

A task file defines:

crawl_rules: URL pattern matching and next-link extraction rules
metadata_extraction: extraction rules for target pages
links_transformation (optional): URL transformation rules
interactions (optional): pre/post interaction steps

Example task files:

tasks/sample_task.json
tasks/emerald_test_task.json

Seed Links File

A plain text file with one URL per line.

Example:

https://www.emerald.com/journals/pages/journals_a-z

ArangoDB Configuration

Create .env in project root:

DATABASE_TYPE=arango

ARANGO_URL=http://localhost:8529
ARANGO_DB=crawler_db
ARANGO_COLLECTION=crawled_data
ARANGO_USER=root
ARANGO_PASSWORD=your_password

Then run crawler with --use_database.

Tor And Privoxy Setup (Optional)

Install packages:

sudo apt-get install tor privoxy

Tor config

Add to /etc/tor/torrc:

ControlPort 9051
CookieAuthentication 0

(If needed) allow cookie access:

sudo chmod +r /run/tor/control.authcookie

Privoxy config

Add to /etc/privoxy/config:

forward-socks5 / 127.0.0.1:9050 .

Start services:

sudo service tor start
sudo service privoxy start

Output

For each collection, Fetchlee creates:

jsons/: extracted metadata
htmls/: saved HTML pages
remaining_links.txt: copied seed file

Notes

Respect website terms of service and robots policies.
Add sensible delays and scope your crawl patterns responsibly.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
src		src
tasks		tasks
.gitignore		.gitignore
README.md		README.md
config.ts		config.ts
main.ts		main.ts
package.json		package.json
tsconfig.json		tsconfig.json
your_links_file.txt		your_links_file.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fetchlee

Features

Project Status

Requirements

Installation

Running In Development

Building And Running Production Build

CLI

Task Configuration

Seed Links File

ArangoDB Configuration

Tor And Privoxy Setup (Optional)

Tor config

Privoxy config

Output

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fetchlee

Features

Project Status

Requirements

Installation

Running In Development

Building And Running Production Build

CLI

Task Configuration

Seed Links File

ArangoDB Configuration

Tor And Privoxy Setup (Optional)

Tor config

Privoxy config

Output

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages