Skip to content

Tsai1030/visual-pdf2rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Visual PDF2RAG

Visual PDF2RAG is a PDF-to-RAG parser for research papers and knowledge-base documents. It converts PDFs into clean Markdown, structured RAG chunks, cropped figures/tables, and VLM-ready image descriptions.

Keywords: PDF to RAG, PDF parser, Markdown extraction, Docling, PyMuPDF, visual RAG, document AI, figure extraction, table extraction, knowledge base ingestion.

Quickstart / validation

cd /mnt/c/Users/pijh1/Desktop/kbparse
uv run --with pytest pytest tests/ -q
uv run kbparse ingest ./examples/pdfs ./output --parser fake --provider mock
uv run kbparse validate ./output/<doc_id>

Parse a real text-layer PDF with PyMuPDF

PyMuPDF is the lightweight first real parser. It is intended for text-layer PDFs and smoke testing the pipeline. It currently extracts text blocks, renders page images, and extracts embedded images as pending visual assets. Complex layout, OCR, high-fidelity tables, and scanned PDFs should be handled by Docling / Marker adapters.

cd /mnt/c/Users/pijh1/Desktop/kbparse
uv run kbparse ingest ./examples/pdfs/pymupdf_smoke.pdf ./output_pymupdf --parser pymupdf --provider mock
uv run kbparse validate ./output_pymupdf/pymupdf_smoke

Expected output:

Validation OK

Expected files:

output_pymupdf/<doc_id>/
  document.json
  document.md
  chunks.jsonl
  quality_report.json
  parse_artifacts/pages/page_0001.png
  assets/images/...

assets/images/ appears only when the PDF contains embedded images.

Parse with Docling adapter

Docling is an optional heavier parser for richer layout extraction. The current KBParse Docling adapter runs Docling, stores the raw Docling JSON artifact, maps Docling body/layout references directly into canonical KBParse elements when available, crops Docling picture regions into assets/figures/, crops Docling table regions into assets/tables/, preserves table cells / HTML metadata when available, associates nearby captions, and then lets KBParse generate document.md, chunks.jsonl, and validation reports from document.json.

Install / run with the optional dependency:

cd /mnt/c/Users/pijh1/Desktop/kbparse
uv run --extra docling kbparse ingest ./examples/pdfs/pymupdf_smoke.pdf ./output_docling --parser docling --provider mock
uv run kbparse validate ./output_docling/pymupdf_smoke

Alternative one-shot install style:

uv run --with docling kbparse ingest ./examples/pdfs/pymupdf_smoke.pdf ./output_docling --parser docling --provider mock

Expected files:

output_docling/<doc_id>/
  document.json
  document.md
  chunks.jsonl
  quality_report.json
  parse_artifacts/docling_document.json
  assets/figures/p0001_fig001.png   # when Docling detects picture/figure regions
  assets/tables/p0001_table001.png  # when Docling detects table regions

Picture smoke test:

uv run --with docling kbparse ingest ./examples/pdfs/docling_picture_smoke.pdf ./output_docling_picture --parser docling --provider mock
uv run kbparse validate ./output_docling_picture/docling_picture_smoke

Table smoke test:

uv run --with docling kbparse ingest ./examples/pdfs/docling_table_smoke.pdf ./output_docling_table --parser docling --provider mock
uv run kbparse validate ./output_docling_table/docling_table_smoke

Note: Docling may classify simple drawn/vector tables as pictures depending on the source PDF and Docling model behavior. The adapter now uses table-like captions such as 表 1 / Table 1 as a conservative fallback signal, so those picture nodes become table_image elements and are cropped into assets/tables/.

Image / table-image enrichment

Parsing does not require API keys. With --provider mock, KBParse creates deterministic descriptions for tests and local smoke runs.

For real image descriptions, use the OpenAI-compatible provider:

cp .env.example .env
# Fill KBPARSE_VLM_API_KEY locally. Do not commit .env.

uv run --with docling kbparse ingest ./your.pdf ./output --parser docling --provider mock
uv run kbparse enrich-images ./output/<doc_id> --provider openai-compatible
uv run kbparse validate ./output/<doc_id>

Supported environment variables:

KBPARSE_VLM_API_KEY=...
KBPARSE_VLM_BASE_URL=https://api.openai.com/v1
KBPARSE_VLM_MODEL=gpt-4o-mini

OPENAI_API_KEY, OPENAI_BASE_URL, and OPENAI_VISION_MODEL are also accepted as aliases. Any OpenAI-compatible /v1/chat/completions endpoint that supports image_url data URLs can be used, including local VLM servers.

The enrichment stage updates only derived fields:

alt_text_short
description_long
enrichment.provider / model / confidence / needs_human_review / visual_category

It preserves original asset_path, page, bbox, captions, and parser evidence. Markdown then renders completed descriptions as:

![Transformer 架構圖](assets/figures/p0003_fig001.png)

> 圖片摘要:此圖說明 encoder 與 decoder 的注意力流程。

chunks.jsonl is rebuilt after enrichment. The chunk metadata keeps asset_path / related_assets, while text_for_embedding still excludes raw image paths.

Current Docling adapter limitations:

  • It maps Docling body.children references for text/caption/table/picture nodes, including page number and normalized bbox when prov exists.
  • It converts Docling table cells into basic Markdown tables for canonical table elements.
  • It preserves Docling table cell JSON and table HTML metadata when Docling provides them.
  • It crops Docling table regions from the source PDF into assets/tables/ using Docling bbox coordinates.
  • It uses table-like captions such as 表 1 / Table 1 as a conservative fallback to classify Docling picture nodes as table_image and store them in assets/tables/.
  • It crops Docling picture regions from the source PDF into assets/figures/ using Docling bbox coordinates.
  • It associates the nearest same-page caption with figure and table elements as caption_nearby.
  • It creates pending visual elements so VLM enrichment can run later with either the mock provider or a real OpenAI-compatible provider.
  • It falls back to exported Markdown only when no mappable Docling body elements exist.
  • It does not yet implement a general visual table classifier when there is no table-like caption.
  • Future work should add richer heading levels, stronger visual classifier heuristics, and better caption-to-table/figure matching.

Manual checks

  • Open output/<doc_id>/document.md and confirm images render.
  • Check assets/figures/, assets/tables/, or assets/images/ contains generated assets when the source has images/tables.
  • Check chunks.jsonl visual chunks include asset_path.
  • Search text_for_embedding and confirm it does not contain assets/ or raw image extensions like .png / .jpg.

MVP scope

The current MVP implements a deterministic fake parser, a PyMuPDF text-layer parser skeleton, an optional Docling adapter with direct body/text/table/picture mapping, figure crops, table crops, table cell/HTML metadata preservation, canonical document.json, Markdown export with standard image syntax, structured chunk building, mock VLM enrichment, real OpenAI-compatible VLM image/table-image enrichment, validation, and quality reports.

Not yet in scope

  • OCR for scanned PDFs.
  • General visual table classification when Docling emits a picture and there is no table-like caption.
  • Marker adapter.
  • Provider-specific adapters beyond OpenAI-compatible endpoints.
  • Vector database ingestion.

About

Visual PDF-to-RAG parser that turns research papers and docs into Markdown, structured chunks, cropped figures/tables, and VLM-ready image descriptions.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages