Structured Text Extraction — Extract structured JSON from documents using LLMs
pip install strutexfrom strutex import extract
from strutex.schemas import INVOICE_US
invoice = extract("invoice.pdf", model=INVOICE_US)
print(invoice.invoice_number, invoice.total)That's it. Three lines. No custom schema to write.
Schemas are required — but you have options:
- Built-in schemas — 10+ ready-to-use (invoices, receipts, shipping docs, resumes)
- Native types —
Object,String,Number,Array(lightweight, no dependencies)- Pydantic models — Full type safety and validation
| Level | Features | When to use |
|---|---|---|
| Basic | extract(), schemas |
Most use cases — just extract data |
| Reliability | verify=True, validators |
Production — ensure accuracy |
| Scale | caching, async, batch | High volume — reduce costs |
| Architecture | agentic, router, fallback | Complex reasoning & robustness |
| Extensibility | plugins, hooks, CLI | Advanced — extend anything |
Most users only need Level 1. The rest is there when you need it.
import strutex
from pydantic import BaseModel
class Receipt(BaseModel):
store: str
date: str
total: float
receipt = strutex.extract("receipt.jpg", model=Receipt)from strutex import extract, Object, String, Number
schema = Object(properties={
"invoice_number": String,
"total": Number,
})
result = extract("invoice.pdf", schema=schema)from strutex import extract
from strutex.schemas import INVOICE_US, BILL_OF_LADING
invoice = extract("invoice.pdf", model=INVOICE_US)
bol = extract("bl.pdf", model=BILL_OF_LADING)Available: INVOICE_GENERIC, INVOICE_US, INVOICE_EU, RECEIPT, PURCHASE_ORDER, BILL_OF_LADING, RESUME, BANK_STATEMENT, etc.
Ask the LLM to validate its own answers automatically — adds accuracy, completely optional:
result = strutex.extract(
"contract.pdf",
model=ContractSchema,
verify=True # LLM reviews its own output
)Create a provider instance for full control over API keys and configuration:
from strutex import DocumentProcessor
from strutex import GeminiProvider, OpenAIProvider, AnthropicProvider, OllamaProvider
from strutex.schemas import INVOICE_US
# Google Gemini
processor = DocumentProcessor(provider=GeminiProvider(api_key="your-key"))
# OpenAI
processor = DocumentProcessor(provider=OpenAIProvider(api_key="your-key", model="gpt-4o"))
# Anthropic Claude
processor = DocumentProcessor(provider=AnthropicProvider(api_key="your-key"))
# Local with Ollama (no API key needed)
processor = DocumentProcessor(provider=OllamaProvider(model="llama3"))
result = processor.process("doc.pdf", "Extract data", model=INVOICE_US)Note: String providers like
provider="gemini"are convenience shortcuts that assume correct environment variables. For production, explicit provider instances are recommended.
from strutex import DocumentProcessor
from strutex.cache import SQLiteCache
processor = DocumentProcessor(
provider="gemini",
cache=SQLiteCache("cache.db")
)import asyncio
from strutex import DocumentProcessor
async def main():
processor = DocumentProcessor(provider="anthropic")
results = await asyncio.gather(
processor.aprocess("doc1.pdf", "Extract", schema),
processor.aprocess("doc2.pdf", "Extract", schema)
)
asyncio.run(main())Move beyond simple extraction with specialized processors for complex workflows.
The AgenticProcessor uses a planner-actor-optimizer loop to solve complex queries by actively searching, reading, and correcting itself.
from strutex import AgenticProcessor
processor = AgenticProcessor()
# Automatically plans, searches, and compiles answer
result = await processor.aprocess(
file_path="handbook.pdf",
prompt="What is the policy for jury duty based on the employee handbook?"
)Compose robust pipelines using built-in strategies:
FallbackProcessor: Switch providers if primary fails.RouterProcessor: Route to different models based on document type.EnsembleProcessor: Query multiple models and vote on the best answer.PrivacyProcessor: Redact PII locally before sending to cloud LLMs.
See Advanced Processors Documentation for details.
Everything is pluggable. Just inherit from a base class:
| Type | Purpose | Examples |
|---|---|---|
Provider |
LLM backends | Gemini, OpenAI, Claude, Ollama |
Extractor |
Document parsing | PDF, Image OCR, Excel |
Validator |
Output validation | Schema, sum checks, date formats |
SecurityPlugin |
Input/output protection | Injection detection, sanitization |
Postprocessor |
Data transformation | Date/number normalization |
from strutex.plugins import Provider, Extractor, Validator
# Custom LLM Provider
class MyProvider(Provider):
"""Auto-registered as 'myprovider'"""
def process(self, file_path, prompt, schema, mime_type, **kwargs):
# Call your LLM API
...
# Custom Document Extractor
class WordExtractor(Extractor, name="word"):
"""Handle .docx files"""
mime_types = ["application/vnd.openxmlformats-officedocument.wordprocessingml.document"]
def extract(self, file_path: str) -> str:
# Parse .docx and return text
...
# Custom Validator
class TotalValidator(Validator):
"""Verify line items sum to total"""
def validate(self, data, schema, context):
items_sum = sum(item["amount"] for item in data.get("items", []))
return ValidationResult(
valid=abs(items_sum - data["total"]) < 0.01,
message="Line items must sum to total"
)strutex plugins list # List all plugins
strutex plugins list --type provider
strutex plugins info gemini --type provider# pyproject.toml
[project.entry-points."strutex.providers"]
my_provider = "my_package:MyProvider"Inject logic at any point in the processing pipeline:
from strutex import DocumentProcessor
processor = DocumentProcessor(provider="gemini")
@processor.on_pre_process
def add_instructions(file_path, prompt, schema, mime_type, context):
"""Modify prompt before sending to LLM"""
return {"prompt": prompt + "\nBe precise and thorough."}
@processor.on_post_process
def normalize_dates(result, context):
"""Transform output after extraction"""
if "date" in result:
result["date"] = parse_date(result["date"])
return result
@processor.on_error
def handle_rate_limit(error, file_path, context):
"""Custom error handling"""
if "rate limit" in str(error).lower():
return {"error": "Rate limited, please retry"}
return None # Propagate other errorspip install strutex[cli] # CLI commands
pip install strutex[ocr] # OCR support
pip install strutex[rag] # RAG support (Qdrant, FastEmbed)
pip install strutex[langchain] # LangChain integration
pip install strutex[llamaindex] # LlamaIndex integration
pip install strutex[all] # Everything| Format | Extensions | Method |
|---|---|---|
.pdf |
Text extraction with fallback chain | |
| Images | .png, .jpg, .tiff |
Direct vision or OCR |
| Excel | .xlsx, .xls |
Converted to structured text |
| Text | .txt, .csv |
Direct input |
Click to expand all features
- Plugin System v2 — Auto-registration via inheritance, lazy loading, entry points
- Hooks — Callbacks and decorators for pre/post processing pipeline
- CLI Tooling —
strutex plugins list|info|refreshcommands - Multi-Provider LLM Support — Gemini, OpenAI, Anthropic, Ollama, Groq, Langdock
- Universal Document Support — PDFs, images, Excel, and custom formats
- Schema-Driven Extraction — Define your output structure, get consistent JSON
- Verification & Self-Correction — Built-in audit loop for high accuracy
- Security First — Built-in input sanitization and output validation
- RAG Capabilities — Built-in Retrieval-Augmented Generation with Qdrant and FastEmbed
- Framework Integrations — LangChain, LlamaIndex, Haystack compatibility
- Caching — Memory, SQLite, and file-based caching
- Async & Batch — Process multiple documents in parallel
- Streaming — Real-time extraction feedback
See ROADMAP.md for the full development plan.
Recent releases:
- v0.1.0 — Core functionality
- v0.2.0 — Plugin registry + Security layer
- v0.3.0 — Plugin System v2
- v0.6.0 — Built-in Schemas & Logging
- v0.7.0 — Providers & Retries
- v0.8.0 — Async, Batch, Cache, Verification
- v0.8.1 — Documentation & Coverage Fixes
- v1.3.7 — Agentic RAG, Advanced Processors, & Full Async Support
This project is licensed under the GNU General Public License v3.0 — see LICENSE for details.
Contributions welcome! Priority areas:
- New plugins — Providers, extractors, validators
- Documentation — Examples and tutorials
- Testing — Expand test coverage