Skip to content

fix(index): write article_id into tile manifests, stop guessing from dir names#83

Open
andylizf wants to merge 2 commits into
mainfrom
fix/article-id-manifest
Open

fix(index): write article_id into tile manifests, stop guessing from dir names#83
andylizf wants to merge 2 commits into
mainfrom
fix/article-id-manifest

Conversation

@andylizf

Copy link
Copy Markdown
Contributor

Problem

The embed pipeline extracted article_id by parsing tile directory names ("3104240.png.tiles"int("3104240")). This worked for URLs (dirs named by position index) but broke for:

  • PDFs: directory named after filename stem (e.g. "report.png.tiles"int("report") → ValueError → GPU silently skips, CPU uses hash fallback → ID misaligned with articles.json)
  • Local files: same problem (filename stems aren't numeric)

Root cause: article_id was never explicitly communicated from the pipeline to embed — embed had to reverse-engineer it from the filesystem.

Fix

  1. Pipeline writes article_id into tile manifests (tiles.json + chunks.json) after rendering. This is the authoritative source — embed reads it from the manifest, not the directory name.

  2. render_pdf gains a stem parameter so the pipeline names PDF tile directories by position index (like URLs), making directory names consistent. Standalone pixelshot CLI still defaults to the filename stem.

  3. Backward compatible: embed falls back to directory name parsing when article_id is absent from the manifest (existing large-scale indexes like the Wikipedia corpus where dirs are already numeric).

Follows the sidecar-metadata pattern used by ColPali, LEANN (passage_id_scheme + ids.txt), and Rulin Shao's MassiveDS.

What changed

File Change
pipelines.py Write article_id into manifests after rendering; pass stem=str(idx) to render_pdf
pdf.py + render.py Add stem parameter to render_pdf
embed.py Read article_id from tiles.json/chunks.json first, fallback to dir name
embed_cpu.py Same
tests/test_article_id.py 5 tests: manifest wins, fallback works, non-numeric handled, override, multi-article

…dir names

The embed pipeline extracted article_id by parsing the tile directory name
(e.g. "3104240.png.tiles" → int("3104240")). This broke for PDFs (directory
named after the filename stem, e.g. "report.png.tiles" → int("report") fails)
and for any non-numeric directory name. GPU embed skipped the tile silently;
CPU embed used a hash fallback that produced IDs misaligned with articles.json.

Root cause: article_id was never explicitly communicated from the pipeline to
the embed stage — embed had to reverse-engineer it from the filesystem.

Fix: the pipeline now writes article_id into tiles.json and chunks.json after
rendering. Embed reads it from the manifest first, falling back to directory
name parsing for backward compatibility with existing large-scale indexes
(e.g. the Wikipedia corpus where dir names are already numeric).

Also: render_pdf gains a stem parameter so the pipeline can name PDF tile
directories by position index (like URLs), making directory names consistent
across all source types.

Follows the same sidecar-metadata pattern used by ColPali (JSONL manifest
mapping FAISS IDs to doc metadata) and LEANN (passage_id_scheme +
ids.txt/offset map). See also Rulin Shao's MassiveDS (shared shard IDs).
@vercel

vercel Bot commented Jun 23, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
web Ready Ready Preview, Comment Jun 24, 2026 3:37am

…path

Addresses review of #83 — the manifest contract was a leaky abstraction on the
default GPU chunks path; it only worked because dir names happened to be numeric.

Three fixes:

1. chunk.py now propagates article_id from tiles.json into chunks.json. Before,
   the pipeline wrote article_id into chunks.json at Stage 1 — but for URLs that
   file doesn't exist until Stage 2 (chunk.py), so the write was a silent no-op,
   and chunk.py rebuilt chunks.json without article_id. The GPU embedder
   (scan_shard_chunks, reads only chunks.json) therefore never saw it and always
   fell back to dir-name parsing → non-numeric dirs (PDF/local rendered
   standalone) were silently skipped. chunk.py is the right place: it already
   reads tiles.json, so it carries article_id forward.

2. scan_shard_chunks (GPU) now falls back to the sibling tiles.json before the
   dir name, matching the CPU embedder — defense in depth for chunks.json built
   before this change.

3. embed_cpu non-numeric fallback uses a stable sha1-based id instead of the
   builtin hash(), which is salted per-process (PYTHONHASHSEED) and produced a
   different article_id every build → index misaligned with articles.json and
   non-reproducible.

Tests rewritten to exercise the real flow (render tiles.json -> real chunk ->
real GPU/CPU scan) instead of a hand-built chunks.json the pipeline never emits,
and to assert the fallback id is the exact stable value.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant