Skip to content

fix: Add is_finished and correct is_empty semantics in RequestQueueClient#1982

Open
Mantisus wants to merge 7 commits into
apify:masterfrom
Mantisus:queue-client-is-finished
Open

fix: Add is_finished and correct is_empty semantics in RequestQueueClient#1982
Mantisus wants to merge 7 commits into
apify:masterfrom
Mantisus:queue-client-is-finished

Conversation

@Mantisus

Copy link
Copy Markdown
Collaborator

Description

  • Adds an is_finished method to the RequestQueueClient interface. It's not a breaking change. The base class defaults is_finished to is_empty when it isn't overridden. Splits the queue predicates. is_empty means there are no requests available to fetch. is_finished means all requests are fully processed (nothing to fetch and nothing in progress).
  • Implements is_finished in all built-in RequestQueueClient.
  • Updates is_empty in all built-in RequestQueueClient to match the new meaning.
  • Affects the AutoscaledPool. It now skips scheduling new tasks when no requests are available to fetch. For example, while the only request is being processed.

Issues

Testing

  • Added a test for RequestQueue that checks the is_empty and is_finished values through the queue lifecycle. It covers all built-in clients.
  • Added a regression test that checks the AutoscaledPool doesn't schedule new tasks while the only request is being processed.

@Mantisus Mantisus requested review from janbuchar and vdusek June 21, 2026 21:23
@Mantisus Mantisus self-assigned this Jun 21, 2026

@janbuchar janbuchar left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sound, the only thing I'm mildly concerned about is the removal of the is_empty return value caching — I'd feel better if I knew some historical context and why it's safe to do this.

Comment thread src/crawlee/storage_clients/_file_system/_request_queue_client.py

@vdusek vdusek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Could you please implement apify/apify-sdk-python#987 as well?

That way, we can verify that it works on the platform too.

I'd like this to be part of SDK v4, built on top of Crawlee v1.8 with this included.

@Mantisus

Mantisus commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks! Could you please implement apify/apify-sdk-python#987 as well?

Sure.

@janbuchar janbuchar left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks about right to me. @vdusek please review it too 🙂

@vdusek vdusek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments

Comment thread src/crawlee/storage_clients/_base/_request_queue_client.py
Comment thread src/crawlee/storage_clients/_redis/_request_queue_client.py Outdated
Comment thread src/crawlee/storage_clients/_file_system/_request_queue_client.py Outdated
Comment thread src/crawlee/storage_clients/_memory/_request_queue_client.py Outdated
Comment thread tests/unit/crawlers/_basic/test_basic_crawler.py Outdated
Mantisus and others added 3 commits June 23, 2026 14:35
Co-authored-by: Vlada Dusek <v.dusek96@gmail.com>
@Mantisus Mantisus requested a review from vdusek June 23, 2026 12:28

@vdusek vdusek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments (three nits and one thing that might have a performance impact)

@override
async def is_finished(self) -> bool:
# If the queue is not empty, it is not finished
if not await self.is_empty():

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_finished calls await self.is_empty() here, which opens a DB session, and then opens a second session below for get_metadata() and the remaining queries. Since the autoscaled pool polls is_finished/is_empty while scheduling requests, this could have a performance impact.

Maybe we could handle the empty check directly inside is_finished, using only a single session?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting a session from the pool is quite cheap, but it can actually impact performance if the pool is small. However, this will only become apparent when the queue is empty but not yet finished.

But merging is_empty and get_metadata into a single session in is_finished will make the method harder to read.

If we need to optimize these methods, I would consider removing the _add_buffer_record calls in is_empty and is_finished, just to update accessed_at in the metadata. This will have a greater impact on performance.

Comment thread src/crawlee/storage_clients/_redis/_request_queue_client.py Outdated
Comment thread tests/unit/crawlers/_basic/test_basic_crawler.py
Comment thread tests/unit/crawlers/_basic/test_basic_crawler.py

@vdusek vdusek left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably use a fix prefix for the PR title. It does add a new method, true, but only because its absence is the root cause of the bug. Something like this:

fix: add is_finished and correct is_empty semantics in RequestQueueClient

would be better in my opinion.

@Mantisus Mantisus changed the title feat: add is_finished method to RequestQueueClient interface fix: add is_finished method to RequestQueueClient interface Jun 25, 2026
@Mantisus Mantisus changed the title fix: add is_finished method to RequestQueueClient interface fix: Add is_finished method to RequestQueueClient interface Jun 25, 2026
@Mantisus Mantisus changed the title fix: Add is_finished method to RequestQueueClient interface fix: Add is_finished and correct is_empty semantics in RequestQueueClient Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add an is_finished method to the StorageClient interface Crawler can get softlocked when consuming the RequestQueue

4 participants