fix(cluster-labels): content-based vocab cache invalidation (survive reinstall)#323
Open
lstein wants to merge 1 commit into
Open
fix(cluster-labels): content-based vocab cache invalidation (survive reinstall)#323lstein wants to merge 1 commit into
lstein wants to merge 1 commit into
Conversation
…reinstall) The vocab embedding cache and per-album labels cache were invalidated by comparing file mtimes against the source vocab files. A `pip` reinstall rewrites the bundled `cluster_vocab.txt` with a fresh mtime even when its contents are byte-for-byte identical, so the next startup discarded a valid cache and forced a multi-second CLIP re-encode of the whole vocabulary. Switch both caches to purely content-based invalidation: - `_read_cached_vocab`: drop the mtime gate. The exact phrase set is already stamped in the `.npz` and compared, which catches every real change (edits, additions, user-extras deletion/rename) regardless of mtime direction. - `_read_cached_labels`: drop the vocab-mtime gate; rely on the stored `vocab_fingerprint` content hash already compared below. The umap.npz and source-embeddings mtime checks stay (those track album-local regeneration). The per-image in-process LRU keeps its mtime key on purpose: it is process-scoped, so a reinstall never reuses it anyway. Tests: repurpose the two "mtime bump -> rebuild" tests to assert "content edit -> rebuild", and add regression tests proving a pure `touch` with identical content does NOT rebuild either cache. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The encoded tagging vocabulary was being rebuilt from scratch on first startup after a reinstall, even though the cached embeddings were still on disk and valid.
Root cause: both the vocab-embedding cache (
_read_cached_vocab) and the per-album labels cache (_read_cached_labels) invalidated by comparing the cache file's mtime against the source vocab files. Apipreinstall rewrites the bundledcluster_vocab.txtwith a fresh mtime even when its contents are byte-for-byte identical, so the mtime gate tripped and forced a multi-second CLIP re-encode of the entire vocabulary — despite the content being unchanged.(The build was triggered legitimately, by a surviving per-device
autotaggingEnabledpreference or a second browser tab — that part is working as designed. This PR fixes the wasteful rebuild.)Change
Switch both caches to purely content-based invalidation:
_read_cached_vocab— drop the mtime gate. The exact phrase set is already stamped into the.npzand compared (set(phrases) != set(current_phrases)), which catches every real change (edits, additions, user-extras deletion/rename) regardless of mtime direction._read_cached_labels— drop the vocab-mtime gate; rely on the storedvocab_fingerprint(sha256 content hash) already compared just below. Theumap.npzand source-embeddings mtime checks stay, since those track album-local regeneration, not reinstalls.The per-image in-process LRU (
compute_image_label) intentionally keeps its mtime key: it's process-scoped, so a reinstall never reuses it anyway, and mtime is cheaper than hashing the vocabulary on that hot path.Nothing here deletes the cache when autotagging is turned off — once encoded, the vocabulary stays cached across reinstalls for later use.
Tests
touchwith identical content does not rebuild either the vocab-embedding cache or the per-album labels cache.All green locally: 384 backend + 356 frontend tests pass,
make lintclean.🤖 Generated with Claude Code