strutex

Structured Text Extraction — Extract structured JSON from documents using LLMs

pip install strutex

@

The Simplest Example

from strutex import extract
from strutex.schemas import INVOICE_US

invoice = extract("invoice.pdf", model=INVOICE_US)
print(invoice.invoice_number, invoice.total)

That's it. Three lines. No custom schema to write.

Schemas are required — but you have options:

Built-in schemas — 10+ ready-to-use (invoices, receipts, shipping docs, resumes)

Native types — Object, String, Number, Array (lightweight, no dependencies)

Pydantic models — Full type safety and validation

What You Can Do

Level	Features	When to use
Basic	`extract()`, schemas	Most use cases — just extract data
Reliability	`verify=True`, validators	Production — ensure accuracy
Scale	caching, async, batch	High volume — reduce costs
Architecture	agentic, router, fallback	Complex reasoning & robustness
Extensibility	plugins, hooks, CLI	Advanced — extend anything

Most users only need Level 1. The rest is there when you need it.

Level 1: Basic Extraction

With Pydantic (recommended)

import strutex
from pydantic import BaseModel

class Receipt(BaseModel):
    store: str
    date: str
    total: float

receipt = strutex.extract("receipt.jpg", model=Receipt)

With Native Schema

from strutex import extract, Object, String, Number

schema = Object(properties={
    "invoice_number": String,
    "total": Number,
})

result = extract("invoice.pdf", schema=schema)

With Built-in Schemas

from strutex import extract
from strutex.schemas import INVOICE_US, BILL_OF_LADING

invoice = extract("invoice.pdf", model=INVOICE_US)
bol = extract("bl.pdf", model=BILL_OF_LADING)

Available: INVOICE_GENERIC, INVOICE_US, INVOICE_EU, RECEIPT, PURCHASE_ORDER, BILL_OF_LADING, RESUME, BANK_STATEMENT, etc.

Level 2: Reliability Features

Optional Double-Check

Ask the LLM to validate its own answers automatically — adds accuracy, completely optional:

result = strutex.extract(
    "contract.pdf",
    model=ContractSchema,
    verify=True  # LLM reviews its own output
)

Choosing a Provider

Create a provider instance for full control over API keys and configuration:

from strutex import DocumentProcessor
from strutex import GeminiProvider, OpenAIProvider, AnthropicProvider, OllamaProvider
from strutex.schemas import INVOICE_US
# Google Gemini
processor = DocumentProcessor(provider=GeminiProvider(api_key="your-key"))

# OpenAI
processor = DocumentProcessor(provider=OpenAIProvider(api_key="your-key", model="gpt-4o"))

# Anthropic Claude
processor = DocumentProcessor(provider=AnthropicProvider(api_key="your-key"))

# Local with Ollama (no API key needed)
processor = DocumentProcessor(provider=OllamaProvider(model="llama3"))

result = processor.process("doc.pdf", "Extract data", model=INVOICE_US)

Note: String providers like provider="gemini" are convenience shortcuts that assume correct environment variables. For production, explicit provider instances are recommended.

Level 3: Scale Features

Caching (reduce API costs)

from strutex import DocumentProcessor
from strutex.cache import SQLiteCache

processor = DocumentProcessor(
    provider="gemini",
    cache=SQLiteCache("cache.db")
)

Async Processing

import asyncio
from strutex import DocumentProcessor

async def main():
    processor = DocumentProcessor(provider="anthropic")
    results = await asyncio.gather(
        processor.aprocess("doc1.pdf", "Extract", schema),
        processor.aprocess("doc2.pdf", "Extract", schema)
    )

asyncio.run(main())

Level 4: Advanced Architectures

Move beyond simple extraction with specialized processors for complex workflows.

Agentic RAG (Self-Correcting)

The AgenticProcessor uses a planner-actor-optimizer loop to solve complex queries by actively searching, reading, and correcting itself.

from strutex import AgenticProcessor

processor = AgenticProcessor()
# Automatically plans, searches, and compiles answer
result = await processor.aprocess(
    file_path="handbook.pdf",
    prompt="What is the policy for jury duty based on the employee handbook?"
)

Specialized Processors

Compose robust pipelines using built-in strategies:

FallbackProcessor: Switch providers if primary fails.
RouterProcessor: Route to different models based on document type.
EnsembleProcessor: Query multiple models and vote on the best answer.
PrivacyProcessor: Redact PII locally before sending to cloud LLMs.

See Advanced Processors Documentation for details.

Level 5: Extensibility

Plugin System

Everything is pluggable. Just inherit from a base class:

Type	Purpose	Examples
`Provider`	LLM backends	Gemini, OpenAI, Claude, Ollama
`Extractor`	Document parsing	PDF, Image OCR, Excel
`Validator`	Output validation	Schema, sum checks, date formats
`SecurityPlugin`	Input/output protection	Injection detection, sanitization
`Postprocessor`	Data transformation	Date/number normalization

from strutex.plugins import Provider, Extractor, Validator

# Custom LLM Provider
class MyProvider(Provider):
    """Auto-registered as 'myprovider'"""
    def process(self, file_path, prompt, schema, mime_type, **kwargs):
        # Call your LLM API
        ...

# Custom Document Extractor
class WordExtractor(Extractor, name="word"):
    """Handle .docx files"""
    mime_types = ["application/vnd.openxmlformats-officedocument.wordprocessingml.document"]

    def extract(self, file_path: str) -> str:
        # Parse .docx and return text
        ...

# Custom Validator
class TotalValidator(Validator):
    """Verify line items sum to total"""
    def validate(self, data, schema, context):
        items_sum = sum(item["amount"] for item in data.get("items", []))
        return ValidationResult(
            valid=abs(items_sum - data["total"]) < 0.01,
            message="Line items must sum to total"
        )

CLI Commands

strutex plugins list              # List all plugins
strutex plugins list --type provider
strutex plugins info gemini --type provider

For Distributable Packages

# pyproject.toml
[project.entry-points."strutex.providers"]
my_provider = "my_package:MyProvider"

Hooks System

Inject logic at any point in the processing pipeline:

from strutex import DocumentProcessor

processor = DocumentProcessor(provider="gemini")

@processor.on_pre_process
def add_instructions(file_path, prompt, schema, mime_type, context):
    """Modify prompt before sending to LLM"""
    return {"prompt": prompt + "\nBe precise and thorough."}

@processor.on_post_process
def normalize_dates(result, context):
    """Transform output after extraction"""
    if "date" in result:
        result["date"] = parse_date(result["date"])
    return result

@processor.on_error
def handle_rate_limit(error, file_path, context):
    """Custom error handling"""
    if "rate limit" in str(error).lower():
        return {"error": "Rate limited, please retry"}
    return None  # Propagate other errors

Optional Extras

pip install strutex[cli]          # CLI commands
pip install strutex[ocr]          # OCR support
pip install strutex[rag]          # RAG support (Qdrant, FastEmbed)
pip install strutex[langchain]    # LangChain integration
pip install strutex[llamaindex]   # LlamaIndex integration
pip install strutex[all]          # Everything

Supported Formats

Format	Extensions	Method
PDF	`.pdf`	Text extraction with fallback chain
Images	`.png`, `.jpg`, `.tiff`	Direct vision or OCR
Excel	`.xlsx`, `.xls`	Converted to structured text
Text	`.txt`, `.csv`	Direct input

Full Feature List

Click to expand all features

Plugin System v2 — Auto-registration via inheritance, lazy loading, entry points
Hooks — Callbacks and decorators for pre/post processing pipeline
CLI Tooling — strutex plugins list|info|refresh commands
Multi-Provider LLM Support — Gemini, OpenAI, Anthropic, Ollama, Groq, Langdock
Universal Document Support — PDFs, images, Excel, and custom formats
Schema-Driven Extraction — Define your output structure, get consistent JSON
Verification & Self-Correction — Built-in audit loop for high accuracy
Security First — Built-in input sanitization and output validation
RAG Capabilities — Built-in Retrieval-Augmented Generation with Qdrant and FastEmbed
Framework Integrations — LangChain, LlamaIndex, Haystack compatibility
Caching — Memory, SQLite, and file-based caching
Async & Batch — Process multiple documents in parallel
Streaming — Real-time extraction feedback

Documentation

📚 Read the Docs

Roadmap

See ROADMAP.md for the full development plan.

Recent releases:

v0.1.0 — Core functionality
v0.2.0 — Plugin registry + Security layer
v0.3.0 — Plugin System v2
v0.6.0 — Built-in Schemas & Logging
v0.7.0 — Providers & Retries
v0.8.0 — Async, Batch, Cache, Verification
v0.8.1 — Documentation & Coverage Fixes
v1.3.7 — Agentic RAG, Advanced Processors, & Full Async Support

License

This project is licensed under the GNU General Public License v3.0 — see LICENSE for details.

Contributing

Contributions welcome! Priority areas:

New plugins — Providers, extractors, validators
Documentation — Examples and tutorials
Testing — Expand test coverage

Name		Name	Last commit message	Last commit date
Latest commit History 97 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
strutex		strutex
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ROADMAP.md		ROADMAP.md
cmp.txt		cmp.txt
mkdocs.yml		mkdocs.yml
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
test_provenance_verification.py		test_provenance_verification.py

Folders and files

Latest commit

History

Repository files navigation

strutex

@

The Simplest Example

What You Can Do

Level 1: Basic Extraction

With Pydantic (recommended)

With Native Schema

With Built-in Schemas

Level 2: Reliability Features

Optional Double-Check

Choosing a Provider

Level 3: Scale Features

Caching (reduce API costs)

Async Processing

Level 4: Advanced Architectures

Agentic RAG (Self-Correcting)

Specialized Processors

Level 5: Extensibility

Plugin System

CLI Commands

For Distributable Packages

Hooks System

Optional Extras

Supported Formats

Full Feature List

Documentation

Roadmap

License

Contributing

About

Topics

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages