Problem
Content is stored at {YYYY}/{MM}/{DD}/{url_hash}.{ext} where url_hash = SHA256(url)[:16]. This has two issues:
- Collision: Same URL fetched twice on the same day writes to the same file path, silently overwriting content. If the page changed between fetches, the first source's archived content is lost.
- Wrong abstraction: The hash is based on the URL, not the content — it's neither deduplication nor collision-safe.
Solution
Switch to content-addressable storage (CAS), like Git's object store:
- Hash the actual content (SHA-256 of file bytes), not the URL
- Path structure:
{hash[:2]}/{hash[2:4]}/{hash}.{ext} (2 levels of 2-char dirs)
- Natural deduplication: identical content stored once, different content never collides
- Idempotent writes: check if file exists before writing
Changes needed
- Drop
url_hash column entirely (the URL itself is already stored)
- Add
content_hash and html_content_hash columns (set after fetch, when content is available)
- Replace
path_root property with cas_path() deriving paths from content hashes
- Replace
save_archived_content/read_archived_content with CAS equivalents
- Alembic migration including file migration from old paths to new CAS paths
- Update API schema, frontend types, and tests
Problem
Content is stored at
{YYYY}/{MM}/{DD}/{url_hash}.{ext}whereurl_hash = SHA256(url)[:16]. This has two issues:Solution
Switch to content-addressable storage (CAS), like Git's object store:
{hash[:2]}/{hash[2:4]}/{hash}.{ext}(2 levels of 2-char dirs)Changes needed
url_hashcolumn entirely (the URL itself is already stored)content_hashandhtml_content_hashcolumns (set after fetch, when content is available)path_rootproperty withcas_path()deriving paths from content hashessave_archived_content/read_archived_contentwith CAS equivalents