fix: anchor note GUIDs to Puzzle ID to prevent duplicates on re-import#19
Merged
Conversation
Successive Anki imports of updated .apkg files were creating duplicate cards instead of updating existing ones. The root cause was that genanki's default GUID hashes all note fields together, so any change to Rating or Popularity generated a new GUID and Anki treated the note as new. Add PuzzleNote, a genanki.Note subclass that overrides guid to call genanki.guid_for(self.fields[0]) — keying solely on the Puzzle ID. Model ID (1757360269638) and deck IDs (SHA1 of deck name) were already stable across builds and required no changes. Add tests/build_apkg_test.py covering: - GUID stability when mutable fields change - GUID uniqueness across different Puzzle IDs - Model ID constant value - Deck ID determinism and distinctness per sub-deck - Integration: note GUIDs are consistent across two builds from the same SAMPLE_CARDS Migration note: this is a one-time breaking change for existing collections. Users must delete their old puzzle decks and reimport; subsequent deliveries will update correctly. https://claude.ai/code/session_01VAUnQCt5CM2TVpRbsQbSBL
The new build_apkg_test.py imports build_apkg which depends on genanki. genanki is declared in requirements-build.txt (which already extends requirements.txt via -r), so switching the install target makes genanki available to the full test suite without duplicating the dependency. https://claude.ai/code/session_01VAUnQCt5CM2TVpRbsQbSBL
Each sub-deck and the parent deck now carry a plain-text description computed at build time from the puzzle rows, mirroring what lichess_optimized_puzzles_datasets reports at generation time: 847 puzzles Rating: 1000–1100, average 1048 Popularity: average 84% 23 themes: fork (89) · pin (76) · mateIn1 (65) · deflection (52) · ... Implementation: - _build_description(rows) aggregates count, ELO range/average, popularity average, and top-15 themes sorted by frequency. - build_from_csvs now reads each CSV into a list first so stats can be computed before the genanki.Deck object is created. All rows are also accumulated for the parent deck's aggregate description. - An explicit parent deck (♟️ Optimized Chess Puzzles) is created with the aggregate description across all sub-decks; genanki.Deck accepts a description= keyword argument. - build_sample follows the same pattern with SAMPLE_CARDS. - 12 new tests cover description content and edge cases (missing fields, theme sorting, singular/plural grammar). https://claude.ai/code/session_01VAUnQCt5CM2TVpRbsQbSBL
build_from_csvs had 21 local variables (R0914) after adding n_subdecks. Replaced it with len(decks) - 1 inline in the final print statement. https://claude.ai/code/session_01VAUnQCt5CM2TVpRbsQbSBL
Coverage = unique themes in sampled deck / unique themes in full ELO
tranche before filtering, computed at CSV generation time.
lichess_optimized_puzzles_datasets.py:
- report_theme_coverage() now returns a stats dict
{selected, unique_themes_sample, unique_themes_tranche, coverage_pct}
in addition to printing the existing report.
- extract_tranches() collects those dicts and writes puzzles_stats.json
alongside the puzzle CSVs so build_apkg.py can read them.
build_apkg.py:
- Add _load_deck_stats(csv_dir) helper to read puzzles_stats.json.
- _build_description() accepts an optional coverage float and appends
"(74.3% of tranche themes)" inline with the theme list when present.
- build_from_csvs() loads stats once and passes coverage_pct per deck.
- Move media path to module constant MEDIA_PATH to stay within pylint's
20-local-variable limit (was 22 after new locals were added).
5 new tests cover coverage display, absence when None, and JSON loading.
https://claude.ai/code/session_01VAUnQCt5CM2TVpRbsQbSBL
Instead of writing puzzles_stats.json and reading it back in a second run, build_full() passes coverage stats directly from extract_tranches() to build_from_csvs() in memory. lichess_optimized_puzzles_datasets.py: - extract_tranches() return type is now Dict[str, Dict] (was None). - return all_stats added at the end (json.dump is kept for standalone use). build_apkg.py: - build_from_csvs() gains an optional deck_stats parameter; when provided it is used directly, skipping _load_deck_stats() and the JSON file. - build_full(csv_dir, output) chains download_puzzle_db(), decompress_zst(), extract_tranches(), and build_from_csvs() in one process. lichess module is imported lazily so pandas/chess are not loaded in normal mode. - --full CLI flag triggers the full pipeline. The JSON sidecar (puzzles_stats.json) is still written by extract_tranches for backward-compat with separate invocations; the full pipeline simply bypasses it. 2 new tests: apkg produced correctly with injected stats, and _load_deck_stats is not called when deck_stats is provided. https://claude.ai/code/session_01VAUnQCt5CM2TVpRbsQbSBL
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PuzzleNote(genanki.Note)subclass that overridesguidto hash onlyfields[0](the Puzzle ID), instead of all fields_row_to_note()to returnPuzzleNoteinstead ofgenanki.Notetests/build_apkg_test.pywith 12 tests covering GUID stability, Model ID constancy, Deck ID determinism, and integrationProblem
When a Lichess puzzle's Rating or Popularity changes between builds, genanki's default GUID (which hashes all fields) changes too. Anki then treats the re-imported note as new, duplicating every updated puzzle in the user's collection on each delivery.
Root cause
Three identifiers must be stable across builds for Anki to update rather than duplicate:
1757360269638), no change neededMigration note
This is a one-time breaking change for users with an existing collection. Their notes carry the old multi-field GUID. The first import with the new package will not match those GUIDs and will create duplicates once. Users must:
♟️ Optimized Chess Puzzlesdecks.apkgAll subsequent deliveries will update cleanly.
Test plan
tests/build_apkg_test.pypass (pytest tests/build_apkg_test.py -v)test_guid_is_stable_across_builds: same PuzzleID with different Rating/Popularity → same GUIDtest_different_puzzle_ids_give_different_guids: distinct Puzzle IDs → distinct GUIDstest_model_id_is_hardcoded:MODEL_ID == 1757360269638test_all_subdeck_ids_are_stableandtest_all_subdeck_ids_are_distinct: deck IDs reproducible and uniquetest_apkg_contains_expected_entries: output zip containscollection.anki2,collection.anki21,metahttps://claude.ai/code/session_01VAUnQCt5CM2TVpRbsQbSBL
Generated by Claude Code