Skip to content

xberg-io/html-to-markdown

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2,766 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Xberg

html-to-markdown

Fast, robust HTML → Markdown for 16 languages. A tiered converter that picks the safest, fastest path per input without losing content.

What and Why?

html-to-markdown converts real-world HTML — unclosed tags, CDATA, custom elements, malformed entities, nested tables, mixed encodings — into clean CommonMark (or Djot) without losing content, from one Rust core with native bindings for 16 languages.

It routes each input through three tiers: a single-pass byte scanner for clean HTML, a tolerant DOM walker for complex inputs, and an html5ever repair pass for malformed HTML — with byte-identical output across tiers, enforced by a 116-snapshot oracle and per-group performance gates in CI. The dispatcher is invisible: the same convert() call works regardless of which tier runs.

Features

Feature Description
16 languages, one Rust core Rust, Python, Node.js, WASM, Java, Go, C#, PHP, Ruby, Elixir, R, Dart, Kotlin (Android), Swift, Zig, and a C ABI
Tiered dispatch Byte scanner → DOM walker → html5ever repair, with byte-equal output across tiers
Real-HTML robust Unclosed tags, CDATA, custom elements, malformed entities, nested tables, mixed encodings — handled without losing content
GFM tables Padded cells, alignment, and pipe escaping
Djot output Set output_format = "djot" to emit Djot instead of Markdown
Metadata extraction Parse <head> into structured metadata (Open Graph, Twitter, JSON-LD, microdata, RDFa, header hierarchy)
Inline images Opt-in mirroring of data URIs and remote image references
Visitor API Feature-gated traversal to transform the converted Markdown AST
Configurable preprocessing Standard, strict, and lenient presets — or build your own
Fast 19–116 MB/s on the Wikipedia/mdream corpus; per-group regression thresholds enforced on every PR

⭐ Star this repo to show your support — it helps others discover html-to-markdown.

Quick Start

convert() is the single entry point — it returns a structured result with content, warnings, and optional metadata.

Language Packages

Rust
cargo add html-to-markdown-rs

See Rust README for full documentation.

Python
pip install html-to-markdown

See Python README for full documentation.

Node.js
npm install @kreuzberg/html-to-markdown

See Node.js README for full documentation.

Go
go get github.com/xberg-io/html-to-markdown/packages/go/v3

See Go README for full documentation.

Java

Available on Maven Central as dev.kreuzberg:html-to-markdown. See Java README for the dependency snippet and current version.

C#
dotnet add package KreuzbergDev.HtmlToMarkdown

See C# README for full documentation.

Ruby
gem install html-to-markdown

See Ruby README for full documentation.

PHP

This is a native PHP extension (Rust ext-php-rs), so install it with PIE — not composer require:

pie install xberg-io/html-to-markdown

See PHP README for full documentation.

Elixir

Add {:html_to_markdown, "~> 3.6"} to your mix.exs dependencies. See Elixir README for full documentation.

R
install.packages("htmltomarkdown", repos = "https://xberg-io.r-universe.dev")

See R README for full documentation.

Dart / Flutter
dart pub add h2m

See Dart README for full documentation.

Kotlin (Android)

Available on Maven Central as dev.kreuzberg:html-to-markdown-android. See Kotlin README for the dependency snippet and current version.

Swift

Add via Swift Package Manager. See Swift README for full documentation.

Zig

See Zig README for installation and usage.

WebAssembly
npm install @kreuzberg/html-to-markdown-wasm

See WebAssembly README for full documentation.

C/C++ (FFI)

Pre-built .so / .dll / .dylib from GitHub Releases. See FFI crate for full documentation.

CLI
cargo install html-to-markdown-cli
brew install xberg-io/tap/html-to-markdown

See CLI usage for full documentation.

AI Coding Assistants

Install the html-to-markdown plugin from the xberg-io/plugins marketplace. It ships the html-to-markdown agent skills and works with every major coding agent — expand your harness below.

Claude Code
/plugin marketplace add xberg-io/plugins
/plugin install html-to-markdown@kreuzberg
Codex CLI
/plugins add https://github.com/xberg-io/plugins

Then search for html-to-markdown and select Install Plugin.

Cursor

Settings → Plugins → Add from URL → https://github.com/xberg-io/plugins, then select html-to-markdown.

Gemini CLI
gemini extensions install https://github.com/xberg-io/plugins
Factory Droid
droid plugin marketplace add https://github.com/xberg-io/plugins
droid plugin install html-to-markdown@kreuzberg
GitHub Copilot CLI
copilot plugin marketplace add https://github.com/xberg-io/plugins
copilot plugin install html-to-markdown@kreuzberg
opencode

Add the package to opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "plugin": ["@kreuzberg/opencode-html-to-markdown"]
}

Documentation

Full guides, the convert() API for every binding, tier architecture, the metadata and visitor APIs, and performance benchmarks live at docs.html-to-markdown.xberg.io.

Part of Kreuzberg.dev

  • Kreuzberg — document intelligence: text, tables, metadata from 91+ formats with optional OCR.
  • Xberg Enterprise — managed extraction API with SDKs, dashboards, and observability.
  • crawlberg — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
  • html-to-markdown — fast, lossless HTML→Markdown engine.
  • liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
  • tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
  • alef — the polyglot binding generator that produces every per-language binding across the 5 polyglot repos.

Contributing

Contributions welcome! See CONTRIBUTING.md for setup instructions and guidelines.

License

MIT License — see LICENSE for details.

About

High performance and CommonMark compliant HTML to Markdown converter. Maintained by the Kreuzberg team. Kreuzberg is a fast, polyglot document intelligence engine with a Rust core. It extracts structured data from 56+ document formats using streaming parsers and built-in OCR.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors