Summary
When --source-type local is used with a directory of image files (PNG/JPG), Stage 1 of the pipeline correctly classifies them as image_docs but then does nothing with them — there is no render step for local images. Only URL docs (via CDP/Chromium) and PDFs (via poppler) get rendered to the {idx}.png.tiles/ directory structure that Stage 2 (chunk) expects.
The result is that Stage 2 finds no tile directories and produces zero embeddings, effectively making --source-type local a silent no-op for image files. There is no warning or error — it just produces an empty/unusable index.
Reproduction
mkdir /tmp/local_images
cp chart.png artwork.jpg /tmp/local_images/
pixelrag index build --source-type local --source /tmp/local_images --output /tmp/test_idx
# Appears to succeed, but the index contains 0 chunks.
# Stage 2 (chunk) finds no .tiles/ directories under tiles/
Why this is a feature gap, not a simple bug
The current pipeline has two concrete render backends:
render_urls() — CDP/Chromium screenshot of web URLs
render_pdf() — poppler-based PDF → PNG
Local images require a third render path: copy/resize the image and produce the {idx}.png.tiles/tile_0000.png + tiles.json structure. This is non-trivial:
- Large images (e.g. fine-art scans at 5906×8268px) need resizing before embedding to avoid VRAM exhaustion during the embedding step.
- The
tiles.json manifest must be written with correct page_height, viewport_width, tile_height, and tiles fields.
Proposed scope (if maintainers want this)
# In pipelines.py, after the PDF render loop:
if image_docs:
for idx, doc in image_docs:
_render_local_image(
src=doc.path,
dst_dir=tiles_dir / f"{idx}.png.tiles",
max_width=4000, # cap large images to avoid VRAM pressure
)
logger.info(" Rendered %d local images", len(image_docs))
We implemented a standalone reference workaround (prepare_tiles.py, ~100 lines) that does resize + tile conversion + manifest generation correctly, and are happy to contribute it as a PR if this direction is wanted.
Minimal alternative
If --source-type local is intentionally URL/PDF-only for now, at least fail loudly instead of producing a silent empty index:
if image_docs:
logger.warning(
" %d local image files were skipped — local image rendering is not yet "
"supported. Use --source-type url or pdf instead.",
len(image_docs),
)
Environment
- PixelRAG v0.2.1, Python 3.12
- Local images: PNG and JPEG, various sizes
- Discovered while trialling a mixed local corpus (trading charts + digital artwork + diagrams).
Related
Bug 1 (int() crash on non-numeric filename-stem IDs) is a separate, clean fix submitted as a PR.
Summary
When
--source-type localis used with a directory of image files (PNG/JPG), Stage 1 of the pipeline correctly classifies them asimage_docsbut then does nothing with them — there is no render step for local images. Only URL docs (via CDP/Chromium) and PDFs (via poppler) get rendered to the{idx}.png.tiles/directory structure that Stage 2 (chunk) expects.The result is that Stage 2 finds no tile directories and produces zero embeddings, effectively making
--source-type locala silent no-op for image files. There is no warning or error — it just produces an empty/unusable index.Reproduction
Why this is a feature gap, not a simple bug
The current pipeline has two concrete render backends:
render_urls()— CDP/Chromium screenshot of web URLsrender_pdf()— poppler-based PDF → PNGLocal images require a third render path: copy/resize the image and produce the
{idx}.png.tiles/tile_0000.png+tiles.jsonstructure. This is non-trivial:tiles.jsonmanifest must be written with correctpage_height,viewport_width,tile_height, andtilesfields.Proposed scope (if maintainers want this)
We implemented a standalone reference workaround (
prepare_tiles.py, ~100 lines) that does resize + tile conversion + manifest generation correctly, and are happy to contribute it as a PR if this direction is wanted.Minimal alternative
If
--source-type localis intentionally URL/PDF-only for now, at least fail loudly instead of producing a silent empty index:Environment
Related
Bug 1 (int() crash on non-numeric filename-stem IDs) is a separate, clean fix submitted as a PR.