fix: version-aware datatree cache (stale-on-append)#122
Open
lhoupert wants to merge 2 commits into
Open
Conversation
Datatree opens were cached keyed only by src_path, in-process via
functools.cache and in Redis, so an append to an existing store was invisible
until TTL and flapped 500/400 across replicas with no working invalidation.
Fold a store-version token into the cache key: HEAD the root zarr.json (zarr v3
consolidated metadata) and key on {src_path}#{etag}, in both the in-process LRU
and Redis. An append rewrites zarr.json -> ETag changes -> new key -> fresh
read; every replica computes the same key; stale keys self-expire. Probe
failure falls back to the bare src_path key (a render never fails because the
probe did). This also defeats the in-process functools.cache pinning.
Performance guards (kept here because the probe runs per render, so without them
the fix would regress prod):
- _get_store is memoized -- no Boto3CredentialProvider/S3Store rebuild per request.
- _store_version is throttled to one HEAD per version_probe_ttl per path,
collapsing the per-tile HEAD storm a single viewport would cause.
- the in-process memo is bounded lru_cache(maxsize=64) instead of unbounded
functools.cache, so version-keying can't leak one datatree per append.
Config: version_probe_ttl added to EOPFCacheSettings; helm now also emits
TITILER_EOPF_CACHE_METADATA_TTL (the datatree TTL the reader reads) from
cache.ttl.datasets, closing the config mismatch the issue flagged.
Split out of #119 per review (cache correctness + performance only; the
datetime sel fix and the perf-invariant test suite ship separately).
Refs #118 (cache half; issue closes when the sel fix lands too).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Contributor
Author
|
FYI @emmanuelmathot |
Guard-clause early-return for the no-redis path instead of nesting the whole body under `if redis configured`. Inlines the single-use pool local and only computes cache_key on the redis path (it was unused when redis is disabled). Pure readability — behavior identical; 103 tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Collaborator
|
TBH I personally thinks that this adds a lot of complexity to the code for really corner cases. With the latest version of titiler-eopf, I believe we could almost remove the REDIS cache and only have a per-replica TTL (set to max 1h). It's unlikely that data change will appear often IMO |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes the cache half of #118 (the datetime
selfix is in a separate PR).Split out of #119 per review — this PR is the irreducible correctness fix plus the performance guards that keep the per-render version probe cheap. The deterministic perf-invariant test suite ships separately (stacked PR) so this one stays focused.
Problem
Datatree opens were cached keyed only by
src_path— in-process viafunctools.cacheand in Redis — so an append to an existing store was invisible until TTL and flapped 500/400 across ~40 replicas with no working invalidation (/cache/invalidateSCAN 504s).Fix
Fold a store-version token into the cache key: HEAD the root
zarr.json(zarr v3 consolidated metadata) and key on{src_path}#{etag}, in both the in-process LRU and Redis. An append rewriteszarr.json→ ETag changes → new key → fresh read; every replica computes the same key; stale keys self-expire. A failed probe falls back to the baresrc_pathkey (a render never fails because the probe did). This also defeats the in-processfunctools.cachepinning — the stickiest stale layer per replica.Performance (kept here on purpose — the probe runs per render)
_get_storeis memoized → noBoto3CredentialProvider/S3Storerebuild per request._store_versionis throttled to one HEAD perversion_probe_ttlper path, collapsing the per-tile HEAD storm a viewport would cause.lru_cache(maxsize=64)instead of unboundedfunctools.cache, so version-keying can't leak one datatree per append.Config
version_probe_ttladded toEOPFCacheSettings.TITILER_EOPF_CACHE_METADATA_TTL(the datatree TTL the reader actually reads) fromcache.ttl.datasets, closing the config mismatch the issue flagged.Tests
src_pathfallback; HEAD-failure → None.uv run pytest→ 103 passed; pre-commit (ruff, ruff-format, isort, mypy) clean.Refs #118 (cache half; close the issue once the sel PR also lands).
🤖 Generated with Claude Code