fix(discogs): in-flight dedup in @async_cached (LML#544 / follow-up to #537)#545
Merged
Conversation
…ght Future (LML#544) The wrapper's check-then-set straddled an await of the underlying network call: concurrent L1 misses for the same key all entered the fallthrough seam independently, multiplying API and semaphore pressure. A per-cache dict[str, asyncio.Future] now coalesces followers onto the leader's fetch. Exception, None, skip_cache, and evict_cached paths preserve existing semantics; followers receive the leader's value (with cached=True applied when the result shape supports it) or re-raise its exception. Follow-up to LML#537's L1 race finding.
…ion, cache cleanup, telemetry, test hardening Fixes a cross-request cancellation cascade: the leader's CancelledError was being broadcast via future.set_exception to followers in unrelated request contexts (e.g., one client's timeout poisoning concurrent requests on the same key). Now cancellation cancels the future, followers detect it and either retry (if their own task wasn't cancelled) or propagate. Also drops _lml_inflight in clear_all_caches so orphan futures don't survive a clear, reorders set_result before cache write so L1 write failures don't poison followers, narrows BaseException to Exception (with separate CancelledError arm) so SystemExit/KeyboardInterrupt/GeneratorExit propagate cleanly, guards set_result/set_exception with future.done(), records a memory_cache_inflight_join event so dedup firings are observable, tightens cached_count assertion to ==n-1, switches sleep-based test timing to asyncio.Event, asserts exception-instance identity on raise broadcast, and adds tests for leader cancellation and clear_all_caches inflight reset. Plan doc rewritten to reflect the corrected cancellation semantics.
…t-shape, write-fail telemetry, plan doc, robust test sync Converts the follower's cancellation-retry from recursion (return await wrapper(...)) to a while-True loop, eliminating the RecursionError risk under sustained cancellation cascades and making the retry contract local. Adds memory_cache_inflight_join, memory_cache_inflight_retry_after_cancel, and memory_cache_write_failed to init_cache_stats(extra_keys=...) at both lookup-router entrypoints so PostHog/Sentry payload shapes stay stable across requests. Records memory_cache_write_failed in the L1 write-failure except clause so silent cache-layer corruption surfaces in dashboards. Rewrites the evict_cached paragraph in the plan doc to describe what actually happens (leader write-back repopulates L1 after evict). Replaces the single sleep(0) in test_leader_cancellation_does_not_cascade_to_followers with a bounded poll on the leader future's awaiter callbacks for deterministic sync, and adds test_follower_own_cancellation_propagates to pin the cancelling()>0 re-raise branch.
…ates follower cancellations from siblings Wraps the follower's await in asyncio.shield(existing) so one follower's external task cancellation does not call _fut_waiter.cancel() on the SHARED leader future and cascade CancelledError to every other follower. Without shield, surviving followers retry, find the cancelled-and-still-pinned future in the in-flight map (the leader has not yet popped it), await it — which on a done future does NOT yield to the event loop — receive CancelledError, and `continue` again, producing a synchronous infinite loop that wedges the event loop until process kill. Reproduced in situ: asyncio.wait_for on a second follower never returns. Adds test_cancelling_one_follower_does_not_cascade_to_siblings (two followers, cancel one, assert the other receives the leader's value within 1s wall-clock). Verified: without shield the new test hangs; with shield it passes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #544. Follow-up to #537.
Summary
asyncio.Futurecoalescing inside@async_cached. Concurrent callers for the same key share one fetch; followers receive the leader's value (or re-raise its exception). The wrapped function now runs once per key per in-flight window.discogs/memory_cache.py. No call-site changes. New test classTestAsyncCachedInFlightDedupcovers concurrent same-key coalescing, distinct-key isolation, exception propagation, None semantics,cached=Trueon followers,skip_cacheboundary, in-flight cleanup, and evict-mid-flight safety.Design + risk analysis in
docs/plans/lml-537-l1-inflight-dedup.md.Test plan
tests/unit/test_memory_cache.py::TestAsyncCachedInFlightDedup).test_memory_cache.pyunit tests pass unchanged.tests/integration/test_memory_cache.pyintegration tests pass unchanged.discogs/service.py,discogs/fallthrough.py,discogs/cache_service.py,discogs/router.py, request telemetry).ruff check+ruff format --checkclean.mypy discogs/memory_cache.pyclean.lml.discogs.semaphorep95 and thecache hit / missper-method histogram in Sentry for the next 24h. Expect duplicate-release_id traces (same id within seconds) to drop.Out of scope
get_releasepredicate widening on artwork columns (investigate: Discogs cache hit ratio plateaued at ~50% (search_releases_by_track at 34%) after write_release fix #537 Cause Add comprehensive test suite with 97% coverage #1) — separate ticket.cache_warm_histogram.pywriter-source split,cache_miss_provenance.pyat-miss-time anchoring) — captured in investigate: Discogs cache hit ratio plateaued at ~50% (search_releases_by_track at 34%) after write_release fix #537 thread.