Agentic PDF Search via MCP — Inspired by PageIndex
Vectorless, reasoning-based document retrieval that thinks like a human
PageIndex Light MCP brings agentic search capabilities to your PDF documents through the Model Context Protocol. Instead of traditional vector similarity, it leverages LLM reasoning for intelligent, human-like document navigation.
Inspired by VectifyAI/PageIndex and pageindex-mcp.
- Agentic Search — LLM-powered semantic search through document structure
- MCP Sampling — Native MCP protocol sampling support
- LLM Fallback — Auto-fallback to OpenAI-compatible APIs for non-sampling clients
- OCR Fallback — Automatic OCR for scanned PDFs
| Tool | Description |
|---|---|
get_index |
Get PDF index with semantic search support |
get_detail |
Retrieve detailed content of a specific page |
flowchart TB
subgraph Input
A[PDF File] --> B{Text Extraction}
end
subgraph TextExtraction["Text Extraction"]
B -->|Success| C[Raw Text]
B -->|Empty/Minimal| D{OCR Configured?}
D -->|Yes| E[Vision LLM OCR]
D -->|No| C
E --> C
end
subgraph Indexing
C --> F[LLM Summarization]
F -->|Per Page| G[Page Summaries]
G --> H[(Cached Index)]
end
subgraph Search["Agentic Search"]
I[User Query] --> J{Has Query?}
J -->|No| K[Return Full Index]
J -->|Yes| L[LLM Reasoning]
H --> L
L --> M[Ranked Results]
end
subgraph LLMProvider["LLM Provider"]
N{MCP Sampling?}
N -->|Supported| O[MCP Client LLM]
N -->|Not Supported| P[Fallback LLM API]
end
F -.-> N
L -.-> N
Add to your MCP config:
{
"mcpServers": {
"pageindex": {
"command": "uv",
"args": ["run", "--directory", "/path/to/pageindex-light-mcp", "server.py"],
"env": {
"PAGEINDEX_LLM_BASE_URL": "https://api.openai.com/v1",
"PAGEINDEX_LLM_API_KEY": "sk-xxx",
"PAGEINDEX_LLM_MODEL": "gpt-4o-mini",
"PAGEINDEX_OCR_BASE_URL": "https://api.openai.com/v1",
"PAGEINDEX_OCR_API_KEY": "sk-xxx",
"PAGEINDEX_OCR_MODEL": "gpt-4o-mini"
}
}
}
}Both configurations are optional and independent:
| Variable | Purpose | Required |
|---|---|---|
PAGEINDEX_LLM_* |
Fallback for non-Sampling MCP clients | Optional |
PAGEINDEX_OCR_* |
Fallback for scanned PDFs (when text extraction fails) | Optional |
# LLM Config — Used when MCP client doesn't support Sampling
PAGEINDEX_LLM_BASE_URL=https://api.openai.com/v1
PAGEINDEX_LLM_API_KEY=sk-xxx
PAGEINDEX_LLM_MODEL=gpt-4o-mini
# OCR Config — Used when PDF text extraction returns empty/minimal content
PAGEINDEX_OCR_BASE_URL=https://api.openai.com/v1
PAGEINDEX_OCR_API_KEY=sk-xxx
PAGEINDEX_OCR_MODEL=gpt-4o-mini # Any vision-capable modelMIT