Batch conversion, asset extraction, RAG-ready Markdown, JSONL chunks, and cleaner AI document pipelines for Microsoft MarkItDown.
MarkItDown Plus is an enhancement toolkit built on top of Microsoft MarkItDown. It adds folder conversion, recursive processing, optional parallel workers, Markdown cleanup, multiple chunking strategies, lightweight asset extraction, conversion manifests, and JSONL output for RAG workflows.
This project is independent and is not affiliated with Microsoft. It is designed as a companion CLI for the Microsoft MarkItDown ecosystem.
Microsoft MarkItDown is excellent for converting individual files to Markdown. MarkItDown Plus focuses on the next step: turning many documents into clean, AI-ready project output.
Key features:
- Batch convert files and folders
- Recursive directory conversion
- Parallel conversion with
--workers - Optional tqdm progress with
--progress - RAG-ready JSONL chunk export
- Chunk strategies:
heading,fixed,semantic-lite - Markdown cleanup for common PDF/document artifacts
- Basic asset extraction for DOCX / PPTX / XLSX / HTML
manifest.json,failed.json, and large-run JSONL manifest streaming- Unicode-safe output filenames
- PayPal funding link included through GitHub Sponsors/Funding
pip install markitdown-plusFor progress bars:
pip install "markitdown-plus[progress]"For development tests and coverage:
pip install -e ".[dev]"
pytestConvert a folder:
markitdown-plus convert ./docs --output ./outConvert recursively:
markitdown-plus convert ./docs --output ./out --recursiveConvert only specific file types:
markitdown-plus convert ./docs --output ./out --types pdf,docx,pptx,xlsx,html,csvClean Markdown and export RAG chunks:
markitdown-plus convert ./docs --output ./out --clean --ragUse parallel workers:
markitdown-plus convert ./docs --output ./out --recursive --workers 4 --progressUse auto worker count:
markitdown-plus convert ./docs --output ./out --workers 0Extract assets when supported:
markitdown-plus convert ./docs --output ./out --extract-assetsUse a specific chunking strategy:
markitdown-plus convert ./docs --output ./out --rag --chunk-strategy semantic-liteA normal batch run creates:
out/
markdown/
report.md
metadata/
report.json
manifest.json
With RAG enabled:
out/
markdown/
report.md
chunks/
report.jsonl
metadata/
report.json
manifest.json
With asset extraction enabled:
out/
markdown/
report.md
assets/
report_img_001.png
report_img_002.jpg
metadata/
report.json
manifest.json
For very large jobs, MarkItDown Plus avoids huge manifest.json files by streaming records:
out/
manifest.json
manifest-records.jsonl
failed.jsonl
Default. Preserves Markdown heading paths and is best for most structured documents.
markitdown-plus convert ./docs -o ./out --rag --chunk-strategy headingCreates stable chunk sizes and ignores heading boundaries. Useful for embedding pipelines that prefer consistent lengths.
markitdown-plus convert ./docs -o ./out --rag --chunk-strategy fixedDependency-free rule-based topical splitting. It starts new chunks at obvious semantic cues such as headings, summary, conclusion, recommendations, and other section-like paragraphs.
markitdown-plus convert ./docs -o ./out --rag --chunk-strategy semantic-lite--extract-assets currently supports lightweight extraction for:
.docx.pptx.xlsx.html/.htmlocal image references
PDF image extraction is intentionally left for a later version because reliable PDF asset extraction requires heavier format-specific dependencies.
When assets are extracted, MarkItDown Plus appends an Extracted Assets section to the generated Markdown and records asset metadata in the file-level metadata JSON.
Convert one file directly:
markitdown-plus single report.pdf -o report.mdClean an existing Markdown file:
markitdown-plus clean dirty.md -o clean.mdChunk an existing Markdown file:
markitdown-plus chunk clean.md -o chunks.jsonl --chunk-strategy fixedgit clone https://github.com/lamguo/markitdown-plus.git
cd markitdown-plus
pip install -e ".[dev]"
pytestThe test configuration includes a coverage gate:
pytest --cov=markitdown_plus --cov-fail-under=85Optional property and benchmark tests are included. They are skipped automatically if hypothesis or pytest-benchmark is not installed.
Suggested topics for the repository:
markitdown
microsoft-markitdown
markdown
rag
llm
document-conversion
pdf-to-markdown
docx-to-markdown
batch-conversion
jsonl
asset-extraction
ai-tools
If MarkItDown Plus helps you save time or build better AI document pipelines, you can support development here:
- Star this repository
- Support via PayPal: https://www.paypal.me/lamguo
Thank you for supporting open-source development.
MIT License.
| Project | |
|---|---|
| recallgate | Token-efficient memory gate for AI coding agents |
| markitdown-plus | Batch document conversion toolkit |
| trendpilot-ai | Curated AI tools, workflows, and templates |
Built from real use, for real use.