Xberg

Extract text, metadata, transcripts, and code intelligence from 96 file formats and 306 programming languages at native speeds without needing a GPU.

What and Why?

Xberg is a document-intelligence framework with a Rust core and native bindings for 16 languages. It turns documents, images, audio, and source code into clean, structured text — extracting tables, metadata, transcripts, and code intelligence from 96 file formats and 306 programming languages.

Modern AI and RAG pipelines need fast, reliable extraction without a GPU or a stack of heavyweight dependencies. Xberg delivers that from a single Rust core: SIMD-accelerated parsing, pure-Rust PDF, streaming for multi-GB files, and consistent output across every binding. Run it as a library, CLI, REST API, or MCP server.

OCR (Tesseract, PaddleOCR, EasyOCR, and VLM across 143 vision providers), Whisper audio/video transcription, chunking, language detection, embeddings, and structured LLM extraction are all built in.

Features

Feature	Description
96 file formats	PDF, Office, images, HTML/XML, email, archives, and academic formats across 8 categories
306 languages	Code intelligence — functions, classes, imports, symbols, docstrings — via tree-sitter
Polyglot	Native bindings for Rust, Python, Node.js, WebAssembly, Ruby, Go, Java, Kotlin, C#, PHP, Elixir, R, Dart, Swift, Zig, and C
OCR	Tesseract (incl. WASM), PaddleOCR, EasyOCR, and VLM OCR across 143 vision providers — extensible via plugins
Transcription	Whisper ONNX transcripts for MP3, M4A, WAV, WebM, and MP4 audio tracks
LLM intelligence	Structured JSON extraction, embeddings, and VLM OCR through liter-llm, including local engines
Deployment	Use as a library, CLI tool, REST API server, or MCP server
High performance	Rust core with pure-Rust PDF, SIMD optimizations, full parallelism, and streaming for multi-GB files
Token-efficient output	TOON wire format uses ~30–50% fewer tokens than JSON for LLM/RAG pipelines
Extensible	Plugin system for custom OCR backends, validators, post-processors, extractors, and renderers

Supported Formats

96 file formats across 8 categories — Office documents, images (OCR-enabled), web and structured data, email, archives, academic, and audio/video — plus code intelligence for 306 programming languages. See the format reference for the complete list.

⭐ Star this repo to show your support — it helps others discover Xberg.

Quick Start

Language Packages

Python

pip install xberg

uv add xberg

See Python README for full documentation.

Node.js

npm install @xberg/node

See Node.js README for full documentation.

Rust

cargo add xberg

See Rust README for full documentation.

Go

go get github.com/xberg-io/xberg

See Go README for full documentation.

Java

Available on Maven Central as io.xberg:xberg. See Java README for the dependency snippet and current version.

C#

dotnet add package Xberg

See C# README for full documentation.

Ruby

gem install xberg

See Ruby README for full documentation.

PHP

composer require xberg-io/xberg

See PHP README for full documentation.

Elixir

Add {:xberg, "~> 5.0"} to your mix.exs dependencies. See Elixir README for full documentation.

WebAssembly

npm install @xberg/wasm

See WebAssembly README for full documentation.

R

Install from r-universe. See R README for full documentation.

Kotlin (Android)

Available on Maven Central as io.xberg:xberg-android. See Kotlin README for the dependency snippet and current version.

Swift

Add via Swift Package Manager. See Swift README for full documentation.

Dart / Flutter

dart pub add xberg

See Dart README for full documentation.

Zig

Add via zig fetch. See Zig README for full documentation.

C/C++ (FFI)

Build from source as part of this workspace. See C (FFI) README for full documentation.

CLI

brew install xberg-io/tap/xberg

See CLI usage for full documentation.

Docker

docker pull ghcr.io/xberg-io/xberg:latest

See Docker guide for API, CLI, and MCP server modes.

MCP Server

Run Xberg as a Model Context Protocol server. The prebuilt binaries (Homebrew, install.sh, Docker) include it; from source, enable the mcp feature.

# Prebuilt (Homebrew / install.sh / Docker) — MCP is included
brew install xberg-io/tap/xberg
xberg mcp                                   # stdio (default)

# From source — enable the mcp feature
cargo install xberg-cli --features mcp
xberg mcp

# HTTP transport instead of stdio
xberg mcp --transport http --host 127.0.0.1 --port 8001

Add it to an MCP client (Claude Desktop claude_desktop_config.json, Cursor .cursor/mcp.json):

{
  "mcpServers": {
    "xberg": { "command": "xberg", "args": ["mcp"] }
  }
}

See the MCP integration guide for tools, resources, prompts, HTTP transport, and configuration.

AI Coding Assistants

Install the Xberg plugin from the xberg-io/plugins marketplace. It ships the Xberg agent skills (extraction APIs, OCR backends, configuration, language conventions) and works with every major coding agent — expand your harness below.

Claude Code

/plugin marketplace add xberg-io/plugins
/plugin install xberg@xberg

Codex CLI

/plugins add https://github.com/xberg-io/plugins

Then search for xberg and select Install Plugin.

Cursor

Settings → Plugins → Add from URL → https://github.com/xberg-io/plugins, then select xberg.

Gemini CLI

gemini extensions install https://github.com/xberg-io/plugins

Factory Droid

droid plugin marketplace add https://github.com/xberg-io/plugins
droid plugin install xberg@xberg

GitHub Copilot CLI

copilot plugin marketplace add https://github.com/xberg-io/plugins
copilot plugin install xberg@xberg

opencode

Add the package to opencode.json:

{
  "$schema": "https://opencode.ai/config.json",
  "plugin": ["@xberg/opencode-xberg"]
}

Documentation

Full guides, API references for every binding, and the complete format and configuration reference live at xberg.io. Try it in the browser with the live demo.

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

Join our Discord community for questions and discussion.

Part of Xberg.dev

Xberg Enterprise — managed extraction API with SDKs, dashboards, and observability.
crawlberg — web crawling and scraping with HTML→Markdown and headless-Chrome fallback.
html-to-markdown — fast, lossless HTML→Markdown engine.
liter-llm — universal LLM API client with native bindings for 14 languages and 143 providers.
tree-sitter-language-pack — tree-sitter grammars and code-intelligence primitives.
alef — the polyglot binding generator that produces every per-language binding across the 5 polyglot repos.

License

MIT License (MIT) — see LICENSE for details. See the MIT License for the full text.

Name		Name	Last commit message	Last commit date
Latest commit History 6,307 Commits
.ai-rulez		.ai-rulez
.cargo		.cargo
.github		.github
.task		.task
charts/xberg		charts/xberg
cli-proxy		cli-proxy
crates		crates
docker		docker
docs		docs
e2e		e2e
fixtures		fixtures
packages		packages
scripts		scripts
templates/readme		templates/readme
test_documents @ 850eae9		test_documents @ 850eae9
tools		tools
.clang-format		.clang-format
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gh-actions-updater.toml		.gh-actions-updater.toml
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
.golangci.yml		.golangci.yml
.hadolint.yaml		.hadolint.yaml
.lychee.toml		.lychee.toml
.npmrc		.npmrc
.oxfmtrc.json		.oxfmtrc.json
.oxlintrc.json		.oxlintrc.json
.pre-commit-config.yaml		.pre-commit-config.yaml
.rumdl.toml		.rumdl.toml
.sdkmanrc		.sdkmanrc
.shellcheckrc		.shellcheckrc
.textlintrc.json		.textlintrc.json
.typos.toml		.typos.toml
ATTRIBUTIONS.md		ATTRIBUTIONS.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
Package.swift		Package.swift
README.md		README.md
SECURITY.md		SECURITY.md
THIRD_PARTY_LICENSES.md		THIRD_PARTY_LICENSES.md
Taskfile.yml		Taskfile.yml
alef.toml		alef.toml
composer.json		composer.json
composer.lock		composer.lock
config.m4		config.m4
deny.toml		deny.toml
go.work		go.work
go.work.sum		go.work.sum
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
pyproject.toml		pyproject.toml
rust-toolchain.toml		rust-toolchain.toml
rustfmt.toml		rustfmt.toml
server.json		server.json
tsconfig.json		tsconfig.json
uv.lock		uv.lock
zensical.toml		zensical.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Xberg

What and Why?

Features

Supported Formats

Quick Start

Language Packages

AI Coding Assistants

Documentation

Contributing

Part of Xberg.dev

License

About

Uh oh!

Releases 233

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Xberg

What and Why?

Features

Supported Formats

Quick Start

Language Packages

AI Coding Assistants

Documentation

Contributing

Part of Xberg.dev

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 233

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages