crawley

A pythonic crawling / scraping framework for Python 3, built on `asyncio` + `httpx`.

crawley lets you crawl websites and extract structured data with a tiny, declarative API. This is the modernized release: the legacy eventlet / elixir stack has been replaced by asyncio, httpx and SQLAlchemy 2.x.

📖 Documentation: https://jmg.github.io/crawley/ — or run mkdocs serve locally (see Development).

Features

High speed asynchronous crawler powered by asyncio + httpx.
A modern, ergonomic scraping API (fetch, Document, CSS/XPath, extract).
Extract data with your favourite tool: XPath, CSS selectors or PyQuery (a jQuery-like API).
Politeness built in: robots.txt, per-host rate limiting and retries with exponential backoff.
Persist to relational databases (SQLite, PostgreSQL, MySQL, Oracle) via SQLAlchemy 2.x, to MongoDB / CouchDB, or export to JSON / XML / CSV.
Cookie handling and proxies out of the box.
A small DSL to define scrapers declaratively.
Command line tools (crawley startproject, crawley run, ...).
Optional visual scraping browser (PySide6).

Requirements

Python 3.9+

Install

~$ pip install crawley            # core (httpx, lxml, pyquery, cssselect)
~$ pip install "crawley[sql]"     # + SQLAlchemy for relational storage
~$ pip install "crawley[mongo]"   # + pymongo
~$ pip install "crawley[gui]"     # + PySide6 visual browser
~$ pip install "crawley[dev]"     # tests + linters

From a checkout:

~$ pip install -e ".[dev]"

Quick start (as a library)

import asyncio
from crawley.crawlers import BaseCrawler
from crawley.scrapers import BaseScraper
from crawley.extractors import XPathExtractor


class QuotesScraper(BaseScraper):
    # only pages matching these patterns are scraped ("%" is a wildcard)
    matching_urls = ["%quotes.toscrape.com%"]

    def scrape(self, response):
        for quote in response.html.xpath("//div[@class='quote']"):
            text = quote.xpath(".//span[@class='text']")[0].text
            author = quote.xpath(".//small[@class='author']")[0].text
            print(author, "->", text)


class QuotesCrawler(BaseCrawler):
    start_urls = ["https://quotes.toscrape.com/"]
    scrapers = [QuotesScraper]
    max_depth = 2
    extractor = XPathExtractor      # or CSSExtractor / PyQueryExtractor


# Synchronous entry point:
QuotesCrawler().run()

# ...or await it from your own event loop:
# asyncio.run(QuotesCrawler().start())

Need a one-off request?

from crawley.toolbox import request

response = request("https://example.com")
print(response.status_code, response.html.xpath("//title")[0].text)

Modern scraping API (`crawley.scraping`)

For "just scrape this page" use cases there's a small, ergonomic API (à la parsel / requests-html) built on the same httpx + lxml stack. Selectors accept an optional ::text or ::attr(name) suffix.

from crawley.scraping import fetch

doc = fetch("https://quotes.toscrape.com/")

doc.title                       # -> "Quotes to Scrape"
doc.css_first("h1").text        # first match (an Element)
doc.css("span.text::text")      # list of texts
doc.css("a::attr(href)")        # list of (absolute) hrefs
doc.links()                     # de-duplicated absolute links

# Declarative extraction: a string selector -> one value, [selector] -> a list
doc.extract({
    "quote":  "span.text::text",
    "author": "small.author::text",
    "tags":   ["a.tag::text"],
})

Fetch many pages concurrently, or scrape an url in one call:

import asyncio
from crawley.scraping import afetch_all, scrape

scrape("https://example.com", {"title": "h1::text"})

docs = asyncio.run(afetch_all(["https://a.com", "https://b.com"]))

The same shortcuts (response.css, response.css_first, response.extract, response.doc) are available on the crawler's response object inside scrape().

Spiders (callbacks, items, rules, JS)

For full crawls there's a Scrapy-style Spider: yield Requests (or response.follow(...)) to navigate and dicts/Items to emit data, with item pipelines, rule-based crawling and optional JavaScript rendering. See docs/spiders.md.

from crawley.spider import Spider

class BlogSpider(Spider):
    start_urls = ["https://example.com/blog/"]

    def parse(self, response):                       # default callback
        for href in response.css("a.post::attr(href)"):
            yield response.follow(href, callback=self.parse_post)
        nxt = response.css_first("a.next::attr(href)")
        if nxt:
            yield response.follow(nxt)               # follows pagination

    def parse_post(self, response):
        yield {"title": response.css_first("h1").text, "url": response.url}

BlogSpider().run()

Item pipelines: crawley.pipelines.ItemPipeline + DropItem.
Rule-based: CrawlSpider + Rule(LinkExtractor(allow=..., deny=...)).
Sitemaps: SitemapSpider(sitemap_urls=[...]).
JavaScript: render_js = True (install crawley[js] + playwright install).

Quick start (as a framework / CLI)

1. Start a new project

~$ crawley startproject myproject
~$ cd myproject

2. Write your models (`myproject/models.py`)

from crawley.persistance import Entity, UrlEntity, Field, Unicode

class Package(Entity):
    updated = Field(Unicode(255))
    package = Field(Unicode(255))
    description = Field(Unicode(255))

3. Write your scrapers (`myproject/crawlers.py`)

from crawley.crawlers import BaseCrawler
from crawley.scrapers import BaseScraper
from crawley.extractors import XPathExtractor
from models import *

class pypiScraper(BaseScraper):
    matching_urls = ["%"]

    def scrape(self, response):
        for tr in response.html.xpath("//table/tr"):
            Package(package=tr[1].text, description=tr[2].text)

class pypiCrawler(BaseCrawler):
    start_urls = ["https://pypi.org/"]
    scrapers = [pypiScraper]
    max_depth = 0
    extractor = XPathExtractor

4. Configure `settings.py` and run

~$ crawley run

Other commands: crawley syncdb, crawley migratedb, crawley shell <url>, crawley browser <url>.

Extractors

Extractor	`response.html` is...	Query with
`XPathExtractor`	an `lxml` tree	`.xpath(...)`
`CSSExtractor`	an `lxml` tree	`.getroot().cssselect(...)`
`PyQueryExtractor`	a `PyQuery` object	`pq("div.foo")`
`RawExtractor`	the raw html `str`	anything you like

Politeness

Crawl responsibly with a few class attributes (see docs/politeness.md):

class PoliteCrawler(BaseCrawler):
    start_urls = ["https://example.com/"]
    respect_robots = True             # honour robots.txt (+ Crawl-delay)
    crawl_delay = 1.0                 # >= 1s between requests to the same host
    max_concurrency_per_host = 2      # at most 2 concurrent requests per host
    max_retries = 3                   # retry 429/5xx + network errors...
    retry_backoff = 0.5               # ...with exponential backoff + jitter

Retries honour the Retry-After header, and on_robots_blocked(url) lets you react to disallowed urls.

Development

~$ pip install -e ".[dev]"
~$ pytest          # run the (hermetic) test suite
~$ ruff check crawley
~$ pip install -e ".[docs]" && mkdocs serve   # docs preview

The test suite spins up a local HTTP server, so it never hits the network.

Examples

Runnable, documented scripts live in examples/:

File	Shows
`01_scraping_quickstart.py`	The scraping API: `fetch`, CSS/XPath, `extract`.
`02_crawler.py`	A crawler that follows pagination.
`03_polite_crawler.py`	`robots.txt`, rate limiting and retries.
`04_persistence_json.py`	Persisting scraped data to JSON.
`05_concurrent_fetch.py`	Concurrent fetching with `afetch_all`.

~$ python examples/01_scraping_quickstart.py

Every example is exercised by the test suite against a local server, so they stay in sync with the code.

License

GPL v3

Name		Name	Last commit message	Last commit date
Latest commit History 607 Commits
.github/workflows		.github/workflows
crawley		crawley
docs		docs
examples		examples
tests		tests
.coverage		.coverage
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

crawley

A pythonic crawling / scraping framework for Python 3, built on `asyncio` + `httpx`.

Features

Requirements

Install

Quick start (as a library)

Modern scraping API (`crawley.scraping`)

Spiders (callbacks, items, rules, JS)

Quick start (as a framework / CLI)

1. Start a new project

2. Write your models (`myproject/models.py`)

3. Write your scrapers (`myproject/crawlers.py`)

4. Configure `settings.py` and run

Extractors

Politeness

Development

Examples

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

crawley

A pythonic crawling / scraping framework for Python 3, built on asyncio + httpx.

Features

Requirements

Install

Quick start (as a library)

Modern scraping API (crawley.scraping)

Spiders (callbacks, items, rules, JS)

Quick start (as a framework / CLI)

1. Start a new project

2. Write your models (myproject/models.py)

3. Write your scrapers (myproject/crawlers.py)

4. Configure settings.py and run

Extractors

Politeness

Development

Examples

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

A pythonic crawling / scraping framework for Python 3, built on `asyncio` + `httpx`.

Modern scraping API (`crawley.scraping`)

2. Write your models (`myproject/models.py`)

3. Write your scrapers (`myproject/crawlers.py`)

4. Configure `settings.py` and run

Packages