Visual PDF2RAG is a PDF-to-RAG parser for research papers and knowledge-base documents. It converts PDFs into clean Markdown, structured RAG chunks, cropped figures/tables, and VLM-ready image descriptions.
Keywords: PDF to RAG, PDF parser, Markdown extraction, Docling, PyMuPDF, visual RAG, document AI, figure extraction, table extraction, knowledge base ingestion.
cd /mnt/c/Users/pijh1/Desktop/kbparse
uv run --with pytest pytest tests/ -q
uv run kbparse ingest ./examples/pdfs ./output --parser fake --provider mock
uv run kbparse validate ./output/<doc_id>PyMuPDF is the lightweight first real parser. It is intended for text-layer PDFs and smoke testing the pipeline. It currently extracts text blocks, renders page images, and extracts embedded images as pending visual assets. Complex layout, OCR, high-fidelity tables, and scanned PDFs should be handled by Docling / Marker adapters.
cd /mnt/c/Users/pijh1/Desktop/kbparse
uv run kbparse ingest ./examples/pdfs/pymupdf_smoke.pdf ./output_pymupdf --parser pymupdf --provider mock
uv run kbparse validate ./output_pymupdf/pymupdf_smokeExpected output:
Validation OK
Expected files:
output_pymupdf/<doc_id>/
document.json
document.md
chunks.jsonl
quality_report.json
parse_artifacts/pages/page_0001.png
assets/images/...
assets/images/ appears only when the PDF contains embedded images.
Docling is an optional heavier parser for richer layout extraction. The current KBParse Docling adapter runs Docling, stores the raw Docling JSON artifact, maps Docling body/layout references directly into canonical KBParse elements when available, crops Docling picture regions into assets/figures/, crops Docling table regions into assets/tables/, preserves table cells / HTML metadata when available, associates nearby captions, and then lets KBParse generate document.md, chunks.jsonl, and validation reports from document.json.
Install / run with the optional dependency:
cd /mnt/c/Users/pijh1/Desktop/kbparse
uv run --extra docling kbparse ingest ./examples/pdfs/pymupdf_smoke.pdf ./output_docling --parser docling --provider mock
uv run kbparse validate ./output_docling/pymupdf_smokeAlternative one-shot install style:
uv run --with docling kbparse ingest ./examples/pdfs/pymupdf_smoke.pdf ./output_docling --parser docling --provider mockExpected files:
output_docling/<doc_id>/
document.json
document.md
chunks.jsonl
quality_report.json
parse_artifacts/docling_document.json
assets/figures/p0001_fig001.png # when Docling detects picture/figure regions
assets/tables/p0001_table001.png # when Docling detects table regions
Picture smoke test:
uv run --with docling kbparse ingest ./examples/pdfs/docling_picture_smoke.pdf ./output_docling_picture --parser docling --provider mock
uv run kbparse validate ./output_docling_picture/docling_picture_smokeTable smoke test:
uv run --with docling kbparse ingest ./examples/pdfs/docling_table_smoke.pdf ./output_docling_table --parser docling --provider mock
uv run kbparse validate ./output_docling_table/docling_table_smokeNote: Docling may classify simple drawn/vector tables as pictures depending on the source PDF and Docling model behavior. The adapter now uses table-like captions such as 表 1 / Table 1 as a conservative fallback signal, so those picture nodes become table_image elements and are cropped into assets/tables/.
Parsing does not require API keys. With --provider mock, KBParse creates deterministic descriptions for tests and local smoke runs.
For real image descriptions, use the OpenAI-compatible provider:
cp .env.example .env
# Fill KBPARSE_VLM_API_KEY locally. Do not commit .env.
uv run --with docling kbparse ingest ./your.pdf ./output --parser docling --provider mock
uv run kbparse enrich-images ./output/<doc_id> --provider openai-compatible
uv run kbparse validate ./output/<doc_id>Supported environment variables:
KBPARSE_VLM_API_KEY=...
KBPARSE_VLM_BASE_URL=https://api.openai.com/v1
KBPARSE_VLM_MODEL=gpt-4o-mini
OPENAI_API_KEY, OPENAI_BASE_URL, and OPENAI_VISION_MODEL are also accepted as aliases. Any OpenAI-compatible /v1/chat/completions endpoint that supports image_url data URLs can be used, including local VLM servers.
The enrichment stage updates only derived fields:
alt_text_short
description_long
enrichment.provider / model / confidence / needs_human_review / visual_category
It preserves original asset_path, page, bbox, captions, and parser evidence. Markdown then renders completed descriptions as:

> 圖片摘要:此圖說明 encoder 與 decoder 的注意力流程。chunks.jsonl is rebuilt after enrichment. The chunk metadata keeps asset_path / related_assets, while text_for_embedding still excludes raw image paths.
Current Docling adapter limitations:
- It maps Docling
body.childrenreferences for text/caption/table/picture nodes, including page number and normalized bbox whenprovexists. - It converts Docling table cells into basic Markdown tables for canonical table elements.
- It preserves Docling table cell JSON and table HTML metadata when Docling provides them.
- It crops Docling table regions from the source PDF into
assets/tables/using Docling bbox coordinates. - It uses table-like captions such as
表 1/Table 1as a conservative fallback to classify Docling picture nodes astable_imageand store them inassets/tables/. - It crops Docling picture regions from the source PDF into
assets/figures/using Docling bbox coordinates. - It associates the nearest same-page caption with figure and table elements as
caption_nearby. - It creates pending visual elements so VLM enrichment can run later with either the mock provider or a real OpenAI-compatible provider.
- It falls back to exported Markdown only when no mappable Docling body elements exist.
- It does not yet implement a general visual table classifier when there is no table-like caption.
- Future work should add richer heading levels, stronger visual classifier heuristics, and better caption-to-table/figure matching.
- Open
output/<doc_id>/document.mdand confirm images render. - Check
assets/figures/,assets/tables/, orassets/images/contains generated assets when the source has images/tables. - Check
chunks.jsonlvisual chunks includeasset_path. - Search
text_for_embeddingand confirm it does not containassets/or raw image extensions like.png/.jpg.
The current MVP implements a deterministic fake parser, a PyMuPDF text-layer parser skeleton, an optional Docling adapter with direct body/text/table/picture mapping, figure crops, table crops, table cell/HTML metadata preservation, canonical document.json, Markdown export with standard image syntax, structured chunk building, mock VLM enrichment, real OpenAI-compatible VLM image/table-image enrichment, validation, and quality reports.
- OCR for scanned PDFs.
- General visual table classification when Docling emits a picture and there is no table-like caption.
- Marker adapter.
- Provider-specific adapters beyond OpenAI-compatible endpoints.
- Vector database ingestion.