Skip to content

Migrate to content-addressable storage for archived sources #145

@monneyboi

Description

@monneyboi

Problem

Content is stored at {YYYY}/{MM}/{DD}/{url_hash}.{ext} where url_hash = SHA256(url)[:16]. This has two issues:

  1. Collision: Same URL fetched twice on the same day writes to the same file path, silently overwriting content. If the page changed between fetches, the first source's archived content is lost.
  2. Wrong abstraction: The hash is based on the URL, not the content — it's neither deduplication nor collision-safe.

Solution

Switch to content-addressable storage (CAS), like Git's object store:

  • Hash the actual content (SHA-256 of file bytes), not the URL
  • Path structure: {hash[:2]}/{hash[2:4]}/{hash}.{ext} (2 levels of 2-char dirs)
  • Natural deduplication: identical content stored once, different content never collides
  • Idempotent writes: check if file exists before writing

Changes needed

  • Drop url_hash column entirely (the URL itself is already stored)
  • Add content_hash and html_content_hash columns (set after fetch, when content is available)
  • Replace path_root property with cas_path() deriving paths from content hashes
  • Replace save_archived_content/read_archived_content with CAS equivalents
  • Alembic migration including file migration from old paths to new CAS paths
  • Update API schema, frontend types, and tests

Metadata

Metadata

Assignees

No one assigned

    Labels

    loomPoliloom core project issues
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions