This repo is an end-to-end agentic RAG system: ingest PDFs, store embeddings in Postgres/pgvector, answer questions through an OpenAI-compatible API for OpenWebUI, and trace activity in Arize Phoenix.
- Indexer CLI (Docling → chunk → embed → store in pgvector)
- Backend API (FastAPI +
/v1/chat/completions+ agent tools) - Evaluator CLI (RAGAS metrics over a generated/curated test set)
- Observability (Phoenix traces for retrieval + tool calls)
| Area | Choice | Where |
|---|---|---|
| REST API contract | OpenAI-compatible (/v1/chat/completions, /v1/models) |
src/agentic_rag/backend/api/v1/ |
| Chunking strategy | Heading-first contextual chunking with optional LLM context | src/agentic_rag/indexer/chunking.py |
| Embedding model | qwen3-embedding:0.6b (Ollama) |
.env.example |
| LLM model | qwen3:1.7b (Ollama) |
.env.example |
| Retrieval | Hybrid (pgvector + Postgres full-text) with RRF fusion | src/agentic_rag/backend/rag/retriever.py |
| Re-ranking | LLM reranker (Ollama) | src/agentic_rag/backend/rag/reranker.py |
| Agent prompts | Jinja2 prompts synced to Phoenix | src/agentic_rag/prompts/ |
- Start the stack
- Drop PDFs into
data/raw/ - Run the indexer
- Open OpenWebUI and chat
- Open Phoenix and inspect traces
- Run evaluation and review RAGAS scores
git clone <repo-url>
cd agentic-rag
cp .env.example .env
docker compose up -d
curl http://localhost:8000/healthOn first launch the ollama-init service automatically pulls the models
defined in .env (LLM_MODEL and EMBEDDING_MODEL), and the backend
applies SQL migrations on startup — no manual steps required.
The repo includes PDPL (Personal Data Protection Law) documents in data/sample/.
To index them so the chatbot can answer questions:
docker compose exec backend agentic-index --source data/sample/Then open http://localhost:3000 (OpenWebUI) and ask questions like "What is PDPL?" or "What are the rules for transferring personal data outside the Kingdom?"
Mac with host Ollama (Metal GPU): Use the compose override to skip the containerised Ollama and its init job:
docker compose -f docker-compose.yml -f docker-compose.mac.yml up -dYou must pull the models yourself:
ollama pull qwen3:1.7b && ollama pull qwen3-embedding:0.6b
Full reset: To wipe all data and start fresh:
docker compose down -v # removes containers + volumes docker compose up -d # recreates everything
Local Development (outside Docker): The
.env.exampleuses Docker service names (postgres,ollama,phoenix). If running locally without Docker, update these tolocalhostin your.envfile. Note: if Docker Compose is running, Postgres is on host port 5433, not 5432:DATABASE_URL=postgresql+asyncpg://postgres:postgres@localhost:5433/ragdb OLLAMA_BASE_URL=http://localhost:11434 PHOENIX_COLLECTOR_ENDPOINT=http://localhost:6006/v1/traces
Put PDFs in data/raw/ then:
agentic-index --source data/raw/Index versioning: If you change the embedding model, tokenizer, or chunking settings, bump
INDEX_VERSION (in .env) and re-run the indexer. This keeps retrieval aligned to the
correct embedding space.
Chunking modes:
| Mode | Command | What it does | When to use |
|---|---|---|---|
fast |
agentic-index --source data/raw/ |
Structured prefix only ([Doc: ...][Section: ...]) |
Default. Fast, deterministic, good for most documents |
llm |
agentic-index --source data/raw/ --mode llm |
Prefix + LLM-generated context summary per chunk | When embedding quality matters more than indexing speed. Uses first 6000 chars of the document for context, so works best with focused documents. Non-deterministic. |
For PDFs, Docling extracts page count and the chunker estimates page numbers per chunk based on character offsets. Markdown files don't have page numbers.
- OpenWebUI:
http://localhost:3000 - Backend API:
http://localhost:8000 - Ollama:
http://localhost:11434
The backend exposes:
GET /v1/modelsPOST /v1/chat/completionsGET /docs— interactive Swagger UI
- Default (portable): Use the Ollama container.
OLLAMA_BASE_URL=http://ollama:11434 - Optional (Mac speed): Use host Ollama with Metal acceleration:
docker compose -f docker-compose.yml -f docker-compose.mac.yml up -d# Make sure evaluator model is pulled
ollama pull qwen3:4b
# 1. Generate a synthetic test set from indexed chunks
agentic-eval generate --num-samples 10 --output eval_testset.json
# 2. Run retrieval + answer pipeline and compute RAGAS metrics
agentic-eval evaluate --testset eval_testset.json --output eval_results.json
# 3. Pretty-print the results
agentic-eval report --results eval_results.jsonRun evaluations on a schedule to monitor retrieval quality over time:
agentic-eval monitor --testset eval_testset.json --output-dir eval_runs --interval-seconds 3600Set --skip-ragas for faster retrieval-only monitoring.
Note on evaluation data: agentic-eval generate creates a synthetic Q/A dataset from random chunks. If you need curated ground-truth, provide a JSON file in the same format (question, ground_truth, and optional metadata) and pass it to agentic-eval evaluate.
RAGAS evaluation uses a separate evaluator model (EVAL_MODEL, default: qwen3:4b) to avoid self-evaluation bias — the chat model does not judge its own output. Pull it before running evaluation:
# If using Docker:
docker compose exec ollama ollama pull qwen3:4b
# If running Ollama locally:
ollama pull qwen3:4bOverride the evaluator model via EVAL_MODEL in .env if needed.
Phoenix UI: http://localhost:6006
What to check:
- retrieved chunks and scores
- tool call sequence (retriever → rerank → response)
Prompts are stored as Jinja2 templates in src/agentic_rag/prompts/. Some are only used in optional modes (agent mode, LLM chunking, or eval generation).
| Template | Used by | Purpose |
|---|---|---|
system_prompt.j2 |
Chat endpoint | System instructions for the chat model |
user_prompt.j2 |
Chat endpoint, evaluator | Main RAG prompt: injects query + retrieved context |
context_generation_template.j2 |
Indexer (--mode llm) |
Generates contextual summaries per chunk (Anthropic-style) |
reranker_template.j2 |
LLM reranker | Scores chunk relevance to a query |
researcher_backstory.j2 |
CrewAI researcher agent | Agent persona and instructions |
writer_backstory.j2 |
CrewAI writer agent | Agent persona and instructions |
qa_generation_template.j2 |
Evaluator (testset generation) | Generates synthetic Q/A pairs from chunks |
scope_anchors.txt |
Scope gate | Anchor phrases used to classify in-scope queries |
Phoenix sync: When PHOENIX_PROMPT_SYNC=true (default in .env.example), the backend and CLI tools push all templates to Phoenix on startup and tag them with PHOENIX_PROMPT_TAG (default: development). In production, set the tag to production to version prompts in the Phoenix UI.
When PHOENIX_PROMPT_SYNC=false, prompts are served from the local .j2 files only. Disable sync during local development to avoid unnecessary Phoenix calls.
In production (ENVIRONMENT=prod), PromptRegistry.render() and get_template() fetch the tagged prompt from Phoenix first and fall back to local if Phoenix is unreachable.
Phoenix checklist:
- Set
ENVIRONMENT=prodandPHOENIX_PROMPT_TAG=demo(in.envor your shell) - Start backend or CLI
- In Phoenix UI, confirm prompts exist under the tag
- Edit a prompt, re-run a query, and confirm the response changes
RRF weights: Configure in .env with RRF_WEIGHT_VECTOR and RRF_WEIGHT_KEYWORD.
Reranker settings:
| Setting | Default | Notes |
|---|---|---|
TOP_K_RERANK |
5 | Final number of chunks returned after reranking |
RERANKER_TIMEOUT |
30s | Total timeout; falls back to retrieval order on expiry |
TOP_K_RETRIEVAL |
10 | Candidates from hybrid search before reranking |
The reranker is only active in agent mode (CrewAI path). Fast RAG skips it entirely.
Each response includes structured citations with complete source metadata. The backend returns an AgentResponse with a citations array containing:
Citation Schema:
{
"document_id": "uuid",
"chunk_id": "uuid",
"file_name": "document.pdf",
"page_number": 12,
"section_path": "Introduction > Overview",
"chunk_text": "Retrieved text snippet...",
"score": 0.92
}Fields:
document_id: UUID of source documentchunk_id: UUID of specific chunkfile_name: Original filenamepage_number: Page number (null if unavailable)section_path: Hierarchical section location (e.g., "Chapter 1 > Section 1.2")chunk_text: Actual retrieved textscore: Relevance score (0.0-1.0)
The agent's text response typically includes inline citations.
| Service | Port | Notes |
|---|---|---|
| Backend API | 8000 | FastAPI (/v1/chat/completions) |
| OpenWebUI | 3000 | Chat frontend |
| PostgreSQL | 5432 | pgvector store |
| Ollama | 11434 | local LLM + embeddings |
| Phoenix | 6006 | tracing dashboard |
Port notes:
- PostgreSQL is mapped to host port 5433 (not 5432) to avoid conflicts with a local Postgres. When connecting from outside Docker, use
localhost:5433. Inside Docker, services usepostgres:5432. - OpenWebUI is mapped to host port 3000 (container port 8080).
OpenWebUI integration: Configure OPENAI_API_BASE_URL=http://backend:8000/v1 and OPENAI_API_KEY=dummy. OpenWebUI will discover models via /v1/models.
Session persistence: The API returns an X-Session-Id header. Reuse it on subsequent requests to keep conversation memory.
Health & service status: GET /health checks database, Ollama, and Phoenix. If DB or Ollama are down, status is unhealthy. If Phoenix is down, status is degraded.
- No API authentication — the backend API (port 8000) has no auth layer. OpenWebUI (port 3000) is the intended user-facing entry point and provides its own authentication. In production, remove the backend port mapping and place it behind a reverse proxy or API gateway.
- PDFs with complex tables/scans depend heavily on Docling parsing quality.
- Retrieval quality depends on chunking + embedding model choice.
- First launch may take several minutes while Ollama models are downloaded.
Backend can’t reach Ollama
- Check
OLLAMA_BASE_URLand that theollamaservice is up.
Mac GPU Ollama (optional override)
Use the compose override:
docker compose -f docker-compose.yml -f docker-compose.mac.yml up -dNo results retrieved
- Confirm the indexer ran successfully and vectors are in Postgres.
- Check DB connection string and schema migration ran.
# 1. Install the project in editable mode
pip install -e ".[dev,eval]"
# 2. Start Postgres (pgvector), Ollama, and Phoenix however you prefer,
# then point your .env at localhost (see Quick start note above).
# 3. Run the database migrations manually
psql "$DATABASE_URL" -f migrations/001_init_extensions.sql
psql "$DATABASE_URL" -f migrations/002_create_tables.sql
psql "$DATABASE_URL" -f migrations/003_create_indexes.sql
# 4. Pull the required Ollama models
ollama pull qwen3:1.7b
ollama pull qwen3-embedding:0.6b
# 5. Start the backend
agentic-apiAPI docs are available at http://localhost:8000/docs (Swagger UI).
pip install -e ".[dev,eval]"
ruff check src/ tests/
pytest -v
mypy src/agentic_ragInstall test dependencies before running pytest:
pip install -e ".[dev,eval]"MIT