fix(index): write article_id into tile manifests, stop guessing from dir names by andylizf · Pull Request #83 · StarTrail-org/PixelRAG

andylizf · 2026-06-23T13:00:41Z

Problem

The embed pipeline extracted article_id by parsing tile directory names ("3104240.png.tiles" → int("3104240")). This worked for URLs (dirs named by position index) but broke for:

PDFs: directory named after filename stem (e.g. "report.png.tiles" → int("report") → ValueError → GPU silently skips, CPU uses hash fallback → ID misaligned with articles.json)
Local files: same problem (filename stems aren't numeric)

Root cause: article_id was never explicitly communicated from the pipeline to embed — embed had to reverse-engineer it from the filesystem.

Fix

Pipeline writes article_id into tile manifests (tiles.json + chunks.json) after rendering. This is the authoritative source — embed reads it from the manifest, not the directory name.
render_pdf gains a stem parameter so the pipeline names PDF tile directories by position index (like URLs), making directory names consistent. Standalone pixelshot CLI still defaults to the filename stem.
Backward compatible: embed falls back to directory name parsing when article_id is absent from the manifest (existing large-scale indexes like the Wikipedia corpus where dirs are already numeric).

Follows the sidecar-metadata pattern used by ColPali, LEANN (passage_id_scheme + ids.txt), and Rulin Shao's MassiveDS.

What changed

File	Change
`pipelines.py`	Write `article_id` into manifests after rendering; pass `stem=str(idx)` to `render_pdf`
`pdf.py` + `render.py`	Add `stem` parameter to `render_pdf`
`embed.py`	Read `article_id` from tiles.json/chunks.json first, fallback to dir name
`embed_cpu.py`	Same
`tests/test_article_id.py`	5 tests: manifest wins, fallback works, non-numeric handled, override, multi-article

…dir names The embed pipeline extracted article_id by parsing the tile directory name (e.g. "3104240.png.tiles" → int("3104240")). This broke for PDFs (directory named after the filename stem, e.g. "report.png.tiles" → int("report") fails) and for any non-numeric directory name. GPU embed skipped the tile silently; CPU embed used a hash fallback that produced IDs misaligned with articles.json. Root cause: article_id was never explicitly communicated from the pipeline to the embed stage — embed had to reverse-engineer it from the filesystem. Fix: the pipeline now writes article_id into tiles.json and chunks.json after rendering. Embed reads it from the manifest first, falling back to directory name parsing for backward compatibility with existing large-scale indexes (e.g. the Wikipedia corpus where dir names are already numeric). Also: render_pdf gains a stem parameter so the pipeline can name PDF tile directories by position index (like URLs), making directory names consistent across all source types. Follows the same sidecar-metadata pattern used by ColPali (JSONL manifest mapping FAISS IDs to doc metadata) and LEANN (passage_id_scheme + ids.txt/offset map). See also Rulin Shao's MassiveDS (shared shard IDs).

vercel · 2026-06-23T13:00:47Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
web	Ready	Preview, Comment	Jun 24, 2026 3:37am

…path Addresses review of #83 — the manifest contract was a leaky abstraction on the default GPU chunks path; it only worked because dir names happened to be numeric. Three fixes: 1. chunk.py now propagates article_id from tiles.json into chunks.json. Before, the pipeline wrote article_id into chunks.json at Stage 1 — but for URLs that file doesn't exist until Stage 2 (chunk.py), so the write was a silent no-op, and chunk.py rebuilt chunks.json without article_id. The GPU embedder (scan_shard_chunks, reads only chunks.json) therefore never saw it and always fell back to dir-name parsing → non-numeric dirs (PDF/local rendered standalone) were silently skipped. chunk.py is the right place: it already reads tiles.json, so it carries article_id forward. 2. scan_shard_chunks (GPU) now falls back to the sibling tiles.json before the dir name, matching the CPU embedder — defense in depth for chunks.json built before this change. 3. embed_cpu non-numeric fallback uses a stable sha1-based id instead of the builtin hash(), which is salted per-process (PYTHONHASHSEED) and produced a different article_id every build → index misaligned with articles.json and non-reproducible. Tests rewritten to exercise the real flow (render tiles.json -> real chunk -> real GPU/CPU scan) instead of a hand-built chunks.json the pipeline never emits, and to assert the fallback id is the exact stable value.

vercel Bot deployed to Preview June 23, 2026 13:01 View deployment

vercel Bot deployed to Preview June 24, 2026 03:37 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(index): write article_id into tile manifests, stop guessing from dir names#83

fix(index): write article_id into tile manifests, stop guessing from dir names#83
andylizf wants to merge 2 commits into
mainfrom
fix/article-id-manifest

andylizf commented Jun 23, 2026

Uh oh!

vercel Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

andylizf commented Jun 23, 2026

Problem

Fix

What changed

Uh oh!

vercel Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Jun 23, 2026 •

edited

Loading