Skip to content

fix: anchor note GUIDs to Puzzle ID to prevent duplicates on re-import#19

Merged
SKOHscripts merged 6 commits into
mainfrom
claude/anki-guid-puzzle-id-4pdvG
May 28, 2026
Merged

fix: anchor note GUIDs to Puzzle ID to prevent duplicates on re-import#19
SKOHscripts merged 6 commits into
mainfrom
claude/anki-guid-puzzle-id-4pdvG

Conversation

@SKOHscripts

Copy link
Copy Markdown
Owner

Summary

  • Add PuzzleNote(genanki.Note) subclass that overrides guid to hash only fields[0] (the Puzzle ID), instead of all fields
  • Update _row_to_note() to return PuzzleNote instead of genanki.Note
  • Add tests/build_apkg_test.py with 12 tests covering GUID stability, Model ID constancy, Deck ID determinism, and integration

Problem

When a Lichess puzzle's Rating or Popularity changes between builds, genanki's default GUID (which hashes all fields) changes too. Anki then treats the re-imported note as new, duplicating every updated puzzle in the user's collection on each delivery.

Root cause

Three identifiers must be stable across builds for Anki to update rather than duplicate:

  1. Note GUID — was hashing all fields (bug fixed here)
  2. Model ID — already hardcoded (1757360269638), no change needed
  3. Deck IDs — already derived from SHA1 of deck name, no change needed

Migration note

This is a one-time breaking change for users with an existing collection. Their notes carry the old multi-field GUID. The first import with the new package will not match those GUIDs and will create duplicates once. Users must:

  1. Delete their existing ♟️ Optimized Chess Puzzles decks
  2. Reimport the new .apkg

All subsequent deliveries will update cleanly.

Test plan

  • All 12 tests in tests/build_apkg_test.py pass (pytest tests/build_apkg_test.py -v)
  • test_guid_is_stable_across_builds: same PuzzleID with different Rating/Popularity → same GUID
  • test_different_puzzle_ids_give_different_guids: distinct Puzzle IDs → distinct GUIDs
  • test_model_id_is_hardcoded: MODEL_ID == 1757360269638
  • test_all_subdeck_ids_are_stable and test_all_subdeck_ids_are_distinct: deck IDs reproducible and unique
  • test_apkg_contains_expected_entries: output zip contains collection.anki2, collection.anki21, meta

https://claude.ai/code/session_01VAUnQCt5CM2TVpRbsQbSBL


Generated by Claude Code

claude added 6 commits May 28, 2026 08:01
Successive Anki imports of updated .apkg files were creating duplicate
cards instead of updating existing ones.  The root cause was that
genanki's default GUID hashes all note fields together, so any change
to Rating or Popularity generated a new GUID and Anki treated the note
as new.

Add PuzzleNote, a genanki.Note subclass that overrides guid to call
genanki.guid_for(self.fields[0]) — keying solely on the Puzzle ID.
Model ID (1757360269638) and deck IDs (SHA1 of deck name) were already
stable across builds and required no changes.

Add tests/build_apkg_test.py covering:
- GUID stability when mutable fields change
- GUID uniqueness across different Puzzle IDs
- Model ID constant value
- Deck ID determinism and distinctness per sub-deck
- Integration: note GUIDs are consistent across two builds from
  the same SAMPLE_CARDS

Migration note: this is a one-time breaking change for existing
collections.  Users must delete their old puzzle decks and reimport;
subsequent deliveries will update correctly.

https://claude.ai/code/session_01VAUnQCt5CM2TVpRbsQbSBL
The new build_apkg_test.py imports build_apkg which depends on genanki.
genanki is declared in requirements-build.txt (which already extends
requirements.txt via -r), so switching the install target makes genanki
available to the full test suite without duplicating the dependency.

https://claude.ai/code/session_01VAUnQCt5CM2TVpRbsQbSBL
Each sub-deck and the parent deck now carry a plain-text description
computed at build time from the puzzle rows, mirroring what
lichess_optimized_puzzles_datasets reports at generation time:

  847 puzzles
  Rating: 1000–1100, average 1048
  Popularity: average 84%

  23 themes: fork (89) · pin (76) · mateIn1 (65) · deflection (52) · ...

Implementation:
- _build_description(rows) aggregates count, ELO range/average,
  popularity average, and top-15 themes sorted by frequency.
- build_from_csvs now reads each CSV into a list first so stats can be
  computed before the genanki.Deck object is created.  All rows are also
  accumulated for the parent deck's aggregate description.
- An explicit parent deck (♟️ Optimized Chess Puzzles) is created with
  the aggregate description across all sub-decks; genanki.Deck accepts
  a description= keyword argument.
- build_sample follows the same pattern with SAMPLE_CARDS.
- 12 new tests cover description content and edge cases (missing fields,
  theme sorting, singular/plural grammar).

https://claude.ai/code/session_01VAUnQCt5CM2TVpRbsQbSBL
build_from_csvs had 21 local variables (R0914) after adding n_subdecks.
Replaced it with len(decks) - 1 inline in the final print statement.

https://claude.ai/code/session_01VAUnQCt5CM2TVpRbsQbSBL
Coverage = unique themes in sampled deck / unique themes in full ELO
tranche before filtering, computed at CSV generation time.

lichess_optimized_puzzles_datasets.py:
- report_theme_coverage() now returns a stats dict
  {selected, unique_themes_sample, unique_themes_tranche, coverage_pct}
  in addition to printing the existing report.
- extract_tranches() collects those dicts and writes puzzles_stats.json
  alongside the puzzle CSVs so build_apkg.py can read them.

build_apkg.py:
- Add _load_deck_stats(csv_dir) helper to read puzzles_stats.json.
- _build_description() accepts an optional coverage float and appends
  "(74.3% of tranche themes)" inline with the theme list when present.
- build_from_csvs() loads stats once and passes coverage_pct per deck.
- Move media path to module constant MEDIA_PATH to stay within pylint's
  20-local-variable limit (was 22 after new locals were added).

5 new tests cover coverage display, absence when None, and JSON loading.

https://claude.ai/code/session_01VAUnQCt5CM2TVpRbsQbSBL
Instead of writing puzzles_stats.json and reading it back in a second
run, build_full() passes coverage stats directly from extract_tranches()
to build_from_csvs() in memory.

lichess_optimized_puzzles_datasets.py:
- extract_tranches() return type is now Dict[str, Dict] (was None).
- return all_stats added at the end (json.dump is kept for standalone use).

build_apkg.py:
- build_from_csvs() gains an optional deck_stats parameter; when provided
  it is used directly, skipping _load_deck_stats() and the JSON file.
- build_full(csv_dir, output) chains download_puzzle_db(), decompress_zst(),
  extract_tranches(), and build_from_csvs() in one process.  lichess module
  is imported lazily so pandas/chess are not loaded in normal mode.
- --full CLI flag triggers the full pipeline.

The JSON sidecar (puzzles_stats.json) is still written by extract_tranches
for backward-compat with separate invocations; the full pipeline simply
bypasses it.

2 new tests: apkg produced correctly with injected stats, and
_load_deck_stats is not called when deck_stats is provided.

https://claude.ai/code/session_01VAUnQCt5CM2TVpRbsQbSBL
@SKOHscripts SKOHscripts marked this pull request as ready for review May 28, 2026 08:37
@SKOHscripts SKOHscripts merged commit fe4f194 into main May 28, 2026
14 checks passed
@SKOHscripts SKOHscripts deleted the claude/anki-guid-puzzle-id-4pdvG branch May 28, 2026 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants