Visual PDF2RAG

Visual PDF2RAG is a PDF-to-RAG parser for research papers and knowledge-base documents. It converts PDFs into clean Markdown, structured RAG chunks, cropped figures/tables, and VLM-ready image descriptions.

Keywords: PDF to RAG, PDF parser, Markdown extraction, Docling, PyMuPDF, visual RAG, document AI, figure extraction, table extraction, knowledge base ingestion.

Quickstart / validation

cd /mnt/c/Users/pijh1/Desktop/kbparse
uv run --with pytest pytest tests/ -q
uv run kbparse ingest ./examples/pdfs ./output --parser fake --provider mock
uv run kbparse validate ./output/<doc_id>

Parse a real text-layer PDF with PyMuPDF

PyMuPDF is the lightweight first real parser. It is intended for text-layer PDFs and smoke testing the pipeline. It currently extracts text blocks, renders page images, and extracts embedded images as pending visual assets. Complex layout, OCR, high-fidelity tables, and scanned PDFs should be handled by Docling / Marker adapters.

cd /mnt/c/Users/pijh1/Desktop/kbparse
uv run kbparse ingest ./examples/pdfs/pymupdf_smoke.pdf ./output_pymupdf --parser pymupdf --provider mock
uv run kbparse validate ./output_pymupdf/pymupdf_smoke

Expected output:

Validation OK

Expected files:

output_pymupdf/<doc_id>/
  document.json
  document.md
  chunks.jsonl
  quality_report.json
  parse_artifacts/pages/page_0001.png
  assets/images/...

assets/images/ appears only when the PDF contains embedded images.

Parse with Docling adapter

Docling is an optional heavier parser for richer layout extraction. The current KBParse Docling adapter runs Docling, stores the raw Docling JSON artifact, maps Docling body/layout references directly into canonical KBParse elements when available, crops Docling picture regions into assets/figures/, crops Docling table regions into assets/tables/, preserves table cells / HTML metadata when available, associates nearby captions, and then lets KBParse generate document.md, chunks.jsonl, and validation reports from document.json.

Install / run with the optional dependency:

cd /mnt/c/Users/pijh1/Desktop/kbparse
uv run --extra docling kbparse ingest ./examples/pdfs/pymupdf_smoke.pdf ./output_docling --parser docling --provider mock
uv run kbparse validate ./output_docling/pymupdf_smoke

Alternative one-shot install style:

uv run --with docling kbparse ingest ./examples/pdfs/pymupdf_smoke.pdf ./output_docling --parser docling --provider mock

Expected files:

output_docling/<doc_id>/
  document.json
  document.md
  chunks.jsonl
  quality_report.json
  parse_artifacts/docling_document.json
  assets/figures/p0001_fig001.png   # when Docling detects picture/figure regions
  assets/tables/p0001_table001.png  # when Docling detects table regions

Picture smoke test:

uv run --with docling kbparse ingest ./examples/pdfs/docling_picture_smoke.pdf ./output_docling_picture --parser docling --provider mock
uv run kbparse validate ./output_docling_picture/docling_picture_smoke

Table smoke test:

uv run --with docling kbparse ingest ./examples/pdfs/docling_table_smoke.pdf ./output_docling_table --parser docling --provider mock
uv run kbparse validate ./output_docling_table/docling_table_smoke

Note: Docling may classify simple drawn/vector tables as pictures depending on the source PDF and Docling model behavior. The adapter now uses table-like captions such as 表 1 / Table 1 as a conservative fallback signal, so those picture nodes become table_image elements and are cropped into assets/tables/.

Image / table-image enrichment

Parsing does not require API keys. With --provider mock, KBParse creates deterministic descriptions for tests and local smoke runs.

For real image descriptions, use the OpenAI-compatible provider:

cp .env.example .env
# Fill KBPARSE_VLM_API_KEY locally. Do not commit .env.

uv run --with docling kbparse ingest ./your.pdf ./output --parser docling --provider mock
uv run kbparse enrich-images ./output/<doc_id> --provider openai-compatible
uv run kbparse validate ./output/<doc_id>

Supported environment variables:

KBPARSE_VLM_API_KEY=...
KBPARSE_VLM_BASE_URL=https://api.openai.com/v1
KBPARSE_VLM_MODEL=gpt-4o-mini

OPENAI_API_KEY, OPENAI_BASE_URL, and OPENAI_VISION_MODEL are also accepted as aliases. Any OpenAI-compatible /v1/chat/completions endpoint that supports image_url data URLs can be used, including local VLM servers.

The enrichment stage updates only derived fields:

alt_text_short
description_long
enrichment.provider / model / confidence / needs_human_review / visual_category

It preserves original asset_path, page, bbox, captions, and parser evidence. Markdown then renders completed descriptions as:

![Transformer 架構圖](assets/figures/p0003_fig001.png)

> 圖片摘要：此圖說明 encoder 與 decoder 的注意力流程。

chunks.jsonl is rebuilt after enrichment. The chunk metadata keeps asset_path / related_assets, while text_for_embedding still excludes raw image paths.

Current Docling adapter limitations:

It maps Docling body.children references for text/caption/table/picture nodes, including page number and normalized bbox when prov exists.
It converts Docling table cells into basic Markdown tables for canonical table elements.
It preserves Docling table cell JSON and table HTML metadata when Docling provides them.
It crops Docling table regions from the source PDF into assets/tables/ using Docling bbox coordinates.
It uses table-like captions such as 表 1 / Table 1 as a conservative fallback to classify Docling picture nodes as table_image and store them in assets/tables/.
It crops Docling picture regions from the source PDF into assets/figures/ using Docling bbox coordinates.
It associates the nearest same-page caption with figure and table elements as caption_nearby.
It creates pending visual elements so VLM enrichment can run later with either the mock provider or a real OpenAI-compatible provider.
It falls back to exported Markdown only when no mappable Docling body elements exist.
It does not yet implement a general visual table classifier when there is no table-like caption.
Future work should add richer heading levels, stronger visual classifier heuristics, and better caption-to-table/figure matching.

Manual checks

Open output/<doc_id>/document.md and confirm images render.
Check assets/figures/, assets/tables/, or assets/images/ contains generated assets when the source has images/tables.
Check chunks.jsonl visual chunks include asset_path.
Search text_for_embedding and confirm it does not contain assets/ or raw image extensions like .png / .jpg.

MVP scope

The current MVP implements a deterministic fake parser, a PyMuPDF text-layer parser skeleton, an optional Docling adapter with direct body/text/table/picture mapping, figure crops, table crops, table cell/HTML metadata preservation, canonical document.json, Markdown export with standard image syntax, structured chunk building, mock VLM enrichment, real OpenAI-compatible VLM image/table-image enrichment, validation, and quality reports.

Not yet in scope

OCR for scanned PDFs.
General visual table classification when Docling emits a picture and there is no table-like caption.
Marker adapter.
Provider-specific adapters beyond OpenAI-compatible endpoints.
Vector database ingestion.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
examples		examples
src/kbparse		src/kbparse
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual PDF2RAG

Quickstart / validation

Parse a real text-layer PDF with PyMuPDF

Parse with Docling adapter

Image / table-image enrichment

Manual checks

MVP scope

Not yet in scope

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Visual PDF2RAG

Quickstart / validation

Parse a real text-layer PDF with PyMuPDF

Parse with Docling adapter

Image / table-image enrichment

Manual checks

MVP scope

Not yet in scope

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages