Skip to content

fix: version-aware datatree cache (stale-on-append)#122

Open
lhoupert wants to merge 2 commits into
mainfrom
fix/issue-118-stale-cache
Open

fix: version-aware datatree cache (stale-on-append)#122
lhoupert wants to merge 2 commits into
mainfrom
fix/issue-118-stale-cache

Conversation

@lhoupert

Copy link
Copy Markdown
Contributor

Closes the cache half of #118 (the datetime sel fix is in a separate PR).

Split out of #119 per review — this PR is the irreducible correctness fix plus the performance guards that keep the per-render version probe cheap. The deterministic perf-invariant test suite ships separately (stacked PR) so this one stays focused.

Problem

Datatree opens were cached keyed only by src_path — in-process via functools.cache and in Redis — so an append to an existing store was invisible until TTL and flapped 500/400 across ~40 replicas with no working invalidation (/cache/invalidate SCAN 504s).

Fix

Fold a store-version token into the cache key: HEAD the root zarr.json (zarr v3 consolidated metadata) and key on {src_path}#{etag}, in both the in-process LRU and Redis. An append rewrites zarr.json → ETag changes → new key → fresh read; every replica computes the same key; stale keys self-expire. A failed probe falls back to the bare src_path key (a render never fails because the probe did). This also defeats the in-process functools.cache pinning — the stickiest stale layer per replica.

Performance (kept here on purpose — the probe runs per render)

  • _get_store is memoized → no Boto3CredentialProvider/S3Store rebuild per request.
  • _store_version is throttled to one HEAD per version_probe_ttl per path, collapsing the per-tile HEAD storm a viewport would cause.
  • the in-process memo is a bounded lru_cache(maxsize=64) instead of unbounded functools.cache, so version-keying can't leak one datatree per append.

Config

  • version_probe_ttl added to EOPFCacheSettings.
  • Helm now also emits TITILER_EOPF_CACHE_METADATA_TTL (the datatree TTL the reader actually reads) from cache.ttl.datasets, closing the config mismatch the issue flagged.

Tests

  • Append → new key + reread; probe-failure → src_path fallback; HEAD-failure → None.
  • uv run pytest → 103 passed; pre-commit (ruff, ruff-format, isort, mypy) clean.

Refs #118 (cache half; close the issue once the sel PR also lands).

🤖 Generated with Claude Code

Datatree opens were cached keyed only by src_path, in-process via
functools.cache and in Redis, so an append to an existing store was invisible
until TTL and flapped 500/400 across replicas with no working invalidation.

Fold a store-version token into the cache key: HEAD the root zarr.json (zarr v3
consolidated metadata) and key on {src_path}#{etag}, in both the in-process LRU
and Redis. An append rewrites zarr.json -> ETag changes -> new key -> fresh
read; every replica computes the same key; stale keys self-expire. Probe
failure falls back to the bare src_path key (a render never fails because the
probe did). This also defeats the in-process functools.cache pinning.

Performance guards (kept here because the probe runs per render, so without them
the fix would regress prod):
- _get_store is memoized -- no Boto3CredentialProvider/S3Store rebuild per request.
- _store_version is throttled to one HEAD per version_probe_ttl per path,
  collapsing the per-tile HEAD storm a single viewport would cause.
- the in-process memo is bounded lru_cache(maxsize=64) instead of unbounded
  functools.cache, so version-keying can't leak one datatree per append.

Config: version_probe_ttl added to EOPFCacheSettings; helm now also emits
TITILER_EOPF_CACHE_METADATA_TTL (the datatree TTL the reader reads) from
cache.ttl.datasets, closing the config mismatch the issue flagged.

Split out of #119 per review (cache correctness + performance only; the
datetime sel fix and the perf-invariant test suite ship separately).

Refs #118 (cache half; issue closes when the sel fix lands too).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@lhoupert

Copy link
Copy Markdown
Contributor Author

FYI @emmanuelmathot

Guard-clause early-return for the no-redis path instead of nesting the whole
body under `if redis configured`. Inlines the single-use pool local and only
computes cache_key on the redis path (it was unused when redis is disabled).
Pure readability — behavior identical; 103 tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@vincentsarago

Copy link
Copy Markdown
Collaborator

TBH I personally thinks that this adds a lot of complexity to the code for really corner cases. With the latest version of titiler-eopf, I believe we could almost remove the REDIS cache and only have a per-replica TTL (set to max 1h).

It's unlikely that data change will appear often IMO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants