fix(discogs): in-flight dedup in @async_cached (LML#544 / follow-up to #537) by jakebromberg · Pull Request #545 · WXYC/library-metadata-lookup

jakebromberg · 2026-06-13T01:22:53Z

Closes #544. Follow-up to #537.

Summary

Adds per-cache in-flight asyncio.Future coalescing inside @async_cached. Concurrent callers for the same key share one fetch; followers receive the leader's value (or re-raise its exception). The wrapped function now runs once per key per in-flight window.
Closes the non-atomic check-then-set race documented in investigate: Discogs cache hit ratio plateaued at ~50% (search_releases_by_track at 34%) after write_release fix #537's latest comment as Cause Artwork fallback: use Discogs artist/label images when album cover is missing #2. Production traces show release 28184356 fetched 3× at 13:05:41/44/45 and release 3657075 fetched 3× at 13:05:34/39/40 — both within the L1 TTL window. The race produces those duplicates; this PR coalesces them.
Single-file change to discogs/memory_cache.py. No call-site changes. New test class TestAsyncCachedInFlightDedup covers concurrent same-key coalescing, distinct-key isolation, exception propagation, None semantics, cached=True on followers, skip_cache boundary, in-flight cleanup, and evict-mid-flight safety.

Design + risk analysis in docs/plans/lml-537-l1-inflight-dedup.md.

Test plan

8 new unit tests pass (tests/unit/test_memory_cache.py::TestAsyncCachedInFlightDedup).
All 46 existing test_memory_cache.py unit tests pass unchanged.
All 7 tests/integration/test_memory_cache.py integration tests pass unchanged.
All 286 discogs-adjacent unit tests pass (no regression in discogs/service.py, discogs/fallthrough.py, discogs/cache_service.py, discogs/router.py, request telemetry).
Full unit test suite green (2463 / 2463).
ruff check + ruff format --check clean.
mypy discogs/memory_cache.py clean.
Verified in staging: post-merge, watch lml.discogs.semaphore p95 and the cache hit / miss per-method histogram in Sentry for the next 24h. Expect duplicate-release_id traces (same id within seconds) to drop.

Out of scope

get_release predicate widening on artwork columns (investigate: Discogs cache hit ratio plateaued at ~50% (search_releases_by_track at 34%) after write_release fix #537 Cause Add comprehensive test suite with 97% coverage #1) — separate ticket.
Bumping Discogs semaphore from 5 (investigate: Discogs cache hit ratio plateaued at ~50% (search_releases_by_track at 34%) after write_release fix #537 Cause feat: artwork fallback to artist/label images #3) — separate ticket.
Probe methodology fixes (cache_warm_histogram.py writer-source split, cache_miss_provenance.py at-miss-time anchoring) — captured in investigate: Discogs cache hit ratio plateaued at ~50% (search_releases_by_track at 34%) after write_release fix #537 thread.

…ght Future (LML#544) The wrapper's check-then-set straddled an await of the underlying network call: concurrent L1 misses for the same key all entered the fallthrough seam independently, multiplying API and semaphore pressure. A per-cache dict[str, asyncio.Future] now coalesces followers onto the leader's fetch. Exception, None, skip_cache, and evict_cached paths preserve existing semantics; followers receive the leader's value (with cached=True applied when the result shape supports it) or re-raise its exception. Follow-up to LML#537's L1 race finding.

…ion, cache cleanup, telemetry, test hardening Fixes a cross-request cancellation cascade: the leader's CancelledError was being broadcast via future.set_exception to followers in unrelated request contexts (e.g., one client's timeout poisoning concurrent requests on the same key). Now cancellation cancels the future, followers detect it and either retry (if their own task wasn't cancelled) or propagate. Also drops _lml_inflight in clear_all_caches so orphan futures don't survive a clear, reorders set_result before cache write so L1 write failures don't poison followers, narrows BaseException to Exception (with separate CancelledError arm) so SystemExit/KeyboardInterrupt/GeneratorExit propagate cleanly, guards set_result/set_exception with future.done(), records a memory_cache_inflight_join event so dedup firings are observable, tightens cached_count assertion to ==n-1, switches sleep-based test timing to asyncio.Event, asserts exception-instance identity on raise broadcast, and adds tests for leader cancellation and clear_all_caches inflight reset. Plan doc rewritten to reflect the corrected cancellation semantics.

…t-shape, write-fail telemetry, plan doc, robust test sync Converts the follower's cancellation-retry from recursion (return await wrapper(...)) to a while-True loop, eliminating the RecursionError risk under sustained cancellation cascades and making the retry contract local. Adds memory_cache_inflight_join, memory_cache_inflight_retry_after_cancel, and memory_cache_write_failed to init_cache_stats(extra_keys=...) at both lookup-router entrypoints so PostHog/Sentry payload shapes stay stable across requests. Records memory_cache_write_failed in the L1 write-failure except clause so silent cache-layer corruption surfaces in dashboards. Rewrites the evict_cached paragraph in the plan doc to describe what actually happens (leader write-back repopulates L1 after evict). Replaces the single sleep(0) in test_leader_cancellation_does_not_cascade_to_followers with a bounded poll on the leader future's awaiter callbacks for deterministic sync, and adds test_follower_own_cancellation_propagates to pin the cancelling()>0 re-raise branch.

…ates follower cancellations from siblings Wraps the follower's await in asyncio.shield(existing) so one follower's external task cancellation does not call _fut_waiter.cancel() on the SHARED leader future and cascade CancelledError to every other follower. Without shield, surviving followers retry, find the cancelled-and-still-pinned future in the in-flight map (the leader has not yet popped it), await it — which on a done future does NOT yield to the event loop — receive CancelledError, and `continue` again, producing a synchronous infinite loop that wedges the event loop until process kill. Reproduced in situ: asyncio.wait_for on a second follower never returns. Adds test_cancelling_one_follower_does_not_cascade_to_siblings (two followers, cancel one, assert the other receives the leader's value within 1s wall-clock). Verified: without shield the new test hangs; with shield it passes.

jakebromberg mentioned this pull request Jun 13, 2026

investigate: Discogs cache hit ratio plateaued at ~50% (search_releases_by_track at 34%) after write_release fix #537

Open

jakebromberg added 3 commits June 12, 2026 18:49

jakebromberg merged commit 9815abf into main Jun 13, 2026
11 checks passed

jakebromberg deleted the lml-537-l1-dedup branch June 13, 2026 03:24

jakebromberg mentioned this pull request Jun 13, 2026

fix(memory_cache): close async_cached check-then-set race with per-key in-flight dedup #541

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(discogs): in-flight dedup in @async_cached (LML#544 / follow-up to #537)#545

fix(discogs): in-flight dedup in @async_cached (LML#544 / follow-up to #537)#545
jakebromberg merged 4 commits into
mainfrom
lml-537-l1-dedup

jakebromberg commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jakebromberg commented Jun 13, 2026

Summary

Test plan

Out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant