Skip to content

Scheduled purge of "hasn't been crawled in N days" #168

Description

@seanstory

Problem Description

You can schedule partial crawls, and can run with purge_crawl_enabled: false. This can be helpful if you have two parts to your site, one that's small and changes frequently (frequent schedule, restrictive crawl rules) and one that's more static (infrequent schedule, broader crawl rules). But if the two do not overlap, you'd want to disable purge crawl for both, to ensure you're not deleting one set while crawling the other. But then this leaves you in a situation where you have old stale pages that need to be cleaned out.

Rather than running a full crawl with a purge phase, we could instead have the ability to schedule a purge-only crawl. Effectively just a cron to do a delete_by_query to match docs where last_crawled_at is older than <date-math>.

Proposed Solution

Add a way to do bin/crawl schedule <purge-task>

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions