Problem Description
You can schedule partial crawls, and can run with purge_crawl_enabled: false. This can be helpful if you have two parts to your site, one that's small and changes frequently (frequent schedule, restrictive crawl rules) and one that's more static (infrequent schedule, broader crawl rules). But if the two do not overlap, you'd want to disable purge crawl for both, to ensure you're not deleting one set while crawling the other. But then this leaves you in a situation where you have old stale pages that need to be cleaned out.
Rather than running a full crawl with a purge phase, we could instead have the ability to schedule a purge-only crawl. Effectively just a cron to do a delete_by_query to match docs where last_crawled_at is older than <date-math>.
Proposed Solution
Add a way to do bin/crawl schedule <purge-task>
Problem Description
You can schedule partial crawls, and can run with
purge_crawl_enabled: false. This can be helpful if you have two parts to your site, one that's small and changes frequently (frequent schedule, restrictive crawl rules) and one that's more static (infrequent schedule, broader crawl rules). But if the two do not overlap, you'd want to disable purge crawl for both, to ensure you're not deleting one set while crawling the other. But then this leaves you in a situation where you have old stale pages that need to be cleaned out.Rather than running a full crawl with a purge phase, we could instead have the ability to schedule a purge-only crawl. Effectively just a cron to do a
delete_by_queryto match docs wherelast_crawled_atis older than<date-math>.Proposed Solution
Add a way to do
bin/crawl schedule <purge-task>