Skip to content

fix(agent): stop idempotency waiters from blocking thread-pool workers#2932

Merged
zastrowm merged 3 commits into
strands-agents:mainfrom
BV-Venky:fix/idempotency-waiter-deadlock
Jun 24, 2026
Merged

fix(agent): stop idempotency waiters from blocking thread-pool workers#2932
zastrowm merged 3 commits into
strands-agents:mainfrom
BV-Venky:fix/idempotency-waiter-deadlock

Conversation

@BV-Venky

@BV-Venky BV-Venky commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Description

Duplicate agent invocations that share an idempotency_token (THROW mode) wait for the in-flight primary to finish and then receive its result. That wait was implemented as asyncio.to_thread(event.wait), which parks a worker from the event loop's default thread pool for the entire duration of the primary - one parked worker per waiting duplicate, doing no work.

This makes a retry storm dangerous when enough number of retries get triggered

  • Thread/resource pressure - every waiting duplicate ties up a thread until the primary completes.
  • Deadlock - when duplicates fill the default pool (min(32, cpu+4) workers), the primary can no longer obtain a worker for its own asyncio.to_thread model call (e.g. BedrockModel). The primary then never completes, so it never wakes the waiters, and the event loop wedges permanently with no timeout. On an 8-core host this triggers at ~12 concurrent same-token calls on a shared loop.

The fix removes the parked thread entirely: a waiting duplicate now registers an asyncio.Future and awaits it. On completion the primary resolves each waiter's future on its own loop via call_soon_threadsafe. No waiter holds a thread, so duplicates can no longer starve the executor the primary depends on.

Related Issues

Documentation PR

Not required — no public API or documented behavior change.

Type of Change

Bug fix

Public API Changes

None. From the caller's perspective behavior is unchanged: a duplicate still blocks until the primary finishes and then yields the same result (and still fires callback_handler with it). The change is purely how the wait is implemented internally.

Testing

Existing idempotency unit and integration tests pass unchanged; added async unit tests covering cross-thread future resolution, the register-after-settle race, and waiter cancellation. Separately verified in a resource-capped container that previously-hanging configurations now complete and that waiting duplicates park zero threads.

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have reviewed and understand every line of code in this PR
  • My change is focused and reasonably small; I have split unrelated work into separate PRs
  • I have added any necessary tests that prove my fix is effective
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

@github-actions github-actions Bot added size/m bug Something isn't working area-async Related to asynchronous flows or multi-threading area-agent Related to the agent class or general agent questions python Pull requests that update python code labels Jun 24, 2026
@BV-Venky

Copy link
Copy Markdown
Contributor Author

CC: @zastrowm

Comment thread strands-py/src/strands/agent/_concurrency.py Outdated
Comment thread strands-py/src/strands/agent/_concurrency.py Outdated
Comment thread strands-py/src/strands/agent/_concurrency.py Outdated
@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@BV-Venky

Copy link
Copy Markdown
Contributor Author

@zastrowm . Have refactored basis your suggestions.
Please Re-review and let me know if anything else needs to be changed

zastrowm
zastrowm previously approved these changes Jun 24, 2026
Comment thread strands-py/src/strands/agent/_concurrency.py
@github-actions

Copy link
Copy Markdown
Contributor

Assessment: Approve

Clean, well-scoped fix. Replacing asyncio.to_thread(event.wait) with a concurrent.futures.Future bridged per-loop via asyncio.wrap_future correctly removes the parked-worker-per-duplicate behavior that could starve the executor the primary depends on. The cross-loop and cancellation semantics are sound and the rationale is well documented in the docstring.

Review notes
  • Design: The shield(wrap_future(...)) pattern is the right tool here — each waiter binds to its own loop, and shielding prevents one cancelled duplicate from stranding the rest. I verified the closed-loop edge case (a waiter whose loop closes before the primary settles) is already guarded by CPython's _chain_future (if dest_loop.is_closed(): return), so there's no log noise or error path to worry about.
  • Testing: Good coverage of the new behavior — cross-thread resolution, register-after-settle, and cancel-one-doesn't-strand-others. One gap noted inline: settle()'s idempotency guards are unreachable via the controller and untested (the Codecov miss).
  • API/standards: All symbols are private (_-prefixed); no public API change. Lint clean, naming/logging/docstrings follow repo conventions.

Nicely done — the explanatory docstring on _InflightInvocation makes the concurrency model easy to follow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-agent Related to the agent class or general agent questions area-async Related to asynchronous flows or multi-threading bug Something isn't working python Pull requests that update python code size/m

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants