Agentic RAG (OpenWebUI + pgvector + Ollama)

This repo is an end-to-end agentic RAG system: ingest PDFs, store embeddings in Postgres/pgvector, answer questions through an OpenAI-compatible API for OpenWebUI, and trace activity in Arize Phoenix.

What’s included

Indexer CLI (Docling → chunk → embed → store in pgvector)
Backend API (FastAPI + /v1/chat/completions + agent tools)
Evaluator CLI (RAGAS metrics over a generated/curated test set)
Observability (Phoenix traces for retrieval + tool calls)

Design choices (at a glance)

Area	Choice	Where
REST API contract	OpenAI-compatible (`/v1/chat/completions`, `/v1/models`)	`src/agentic_rag/backend/api/v1/`
Chunking strategy	Heading-first contextual chunking with optional LLM context	`src/agentic_rag/indexer/chunking.py`
Embedding model	`qwen3-embedding:0.6b` (Ollama)	`.env.example`
LLM model	`qwen3:1.7b` (Ollama)	`.env.example`
Retrieval	Hybrid (pgvector + Postgres full-text) with RRF fusion	`src/agentic_rag/backend/rag/retriever.py`
Re-ranking	LLM reranker (Ollama)	`src/agentic_rag/backend/rag/reranker.py`
Agent prompts	Jinja2 prompts synced to Phoenix	`src/agentic_rag/prompts/`

Demo flow

Start the stack
Drop PDFs into data/raw/
Run the indexer
Open OpenWebUI and chat
Open Phoenix and inspect traces
Run evaluation and review RAGAS scores

Quick start

git clone <repo-url>
cd agentic-rag
cp .env.example .env

docker compose up -d
curl http://localhost:8000/health

On first launch the ollama-init service automatically pulls the models defined in .env (LLM_MODEL and EMBEDDING_MODEL), and the backend applies SQL migrations on startup — no manual steps required.

Index the sample documents

The repo includes PDPL (Personal Data Protection Law) documents in data/sample/. To index them so the chatbot can answer questions:

docker compose exec backend agentic-index --source data/sample/

Then open http://localhost:3000 (OpenWebUI) and ask questions like "What is PDPL?" or "What are the rules for transferring personal data outside the Kingdom?"

Mac with host Ollama (Metal GPU): Use the compose override to skip the containerised Ollama and its init job:
docker compose -f docker-compose.yml -f docker-compose.mac.yml up -d
You must pull the models yourself: ollama pull qwen3:1.7b && ollama pull qwen3-embedding:0.6b

Full reset: To wipe all data and start fresh:

docker compose down -v          # removes containers + volumes
docker compose up -d            # recreates everything

Local Development (outside Docker): The .env.example uses Docker service names (postgres, ollama, phoenix). If running locally without Docker, update these to localhost in your .env file. Note: if Docker Compose is running, Postgres is on host port 5433, not 5432:
DATABASE_URL=postgresql+asyncpg://postgres:postgres@localhost:5433/ragdb
OLLAMA_BASE_URL=http://localhost:11434
PHOENIX_COLLECTOR_ENDPOINT=http://localhost:6006/v1/traces

Index documents

Put PDFs in data/raw/ then:

agentic-index --source data/raw/

Index versioning: If you change the embedding model, tokenizer, or chunking settings, bump INDEX_VERSION (in .env) and re-run the indexer. This keeps retrieval aligned to the correct embedding space.

Chunking modes:

Mode	Command	What it does	When to use
`fast`	`agentic-index --source data/raw/`	Structured prefix only (`[Doc: ...][Section: ...]`)	Default. Fast, deterministic, good for most documents
`llm`	`agentic-index --source data/raw/ --mode llm`	Prefix + LLM-generated context summary per chunk	When embedding quality matters more than indexing speed. Uses first 6000 chars of the document for context, so works best with focused documents. Non-deterministic.

For PDFs, Docling extracts page count and the chunker estimates page numbers per chunk based on character offsets. Markdown files don't have page numbers.

Chat (OpenWebUI)

OpenWebUI: http://localhost:3000
Backend API: http://localhost:8000
Ollama: http://localhost:11434

The backend exposes:

GET /v1/models
POST /v1/chat/completions
GET /docs — interactive Swagger UI

Ollama modes

Default (portable): Use the Ollama container. OLLAMA_BASE_URL=http://ollama:11434
Optional (Mac speed): Use host Ollama with Metal acceleration:

docker compose -f docker-compose.yml -f docker-compose.mac.yml up -d

Evaluate (RAGAS)

# Make sure evaluator model is pulled
ollama pull qwen3:4b

# 1. Generate a synthetic test set from indexed chunks
agentic-eval generate --num-samples 10 --output eval_testset.json

# 2. Run retrieval + answer pipeline and compute RAGAS metrics
agentic-eval evaluate --testset eval_testset.json --output eval_results.json

# 3. Pretty-print the results
agentic-eval report --results eval_results.json

Continuous evaluation (monitoring)

Run evaluations on a schedule to monitor retrieval quality over time:

agentic-eval monitor --testset eval_testset.json --output-dir eval_runs --interval-seconds 3600

Set --skip-ragas for faster retrieval-only monitoring.

Note on evaluation data: agentic-eval generate creates a synthetic Q/A dataset from random chunks. If you need curated ground-truth, provide a JSON file in the same format (question, ground_truth, and optional metadata) and pass it to agentic-eval evaluate.

RAGAS evaluation uses a separate evaluator model (EVAL_MODEL, default: qwen3:4b) to avoid self-evaluation bias — the chat model does not judge its own output. Pull it before running evaluation:

# If using Docker:
docker compose exec ollama ollama pull qwen3:4b

# If running Ollama locally:
ollama pull qwen3:4b

Override the evaluator model via EVAL_MODEL in .env if needed.

Traces (Phoenix)

Phoenix UI: http://localhost:6006

What to check:

retrieved chunks and scores
tool call sequence (retriever → rerank → response)

Prompt management (Phoenix)

Prompts are stored as Jinja2 templates in src/agentic_rag/prompts/. Some are only used in optional modes (agent mode, LLM chunking, or eval generation).

Template	Used by	Purpose
`system_prompt.j2`	Chat endpoint	System instructions for the chat model
`user_prompt.j2`	Chat endpoint, evaluator	Main RAG prompt: injects query + retrieved context
`context_generation_template.j2`	Indexer (`--mode llm`)	Generates contextual summaries per chunk (Anthropic-style)
`reranker_template.j2`	LLM reranker	Scores chunk relevance to a query
`researcher_backstory.j2`	CrewAI researcher agent	Agent persona and instructions
`writer_backstory.j2`	CrewAI writer agent	Agent persona and instructions
`qa_generation_template.j2`	Evaluator (testset generation)	Generates synthetic Q/A pairs from chunks
`scope_anchors.txt`	Scope gate	Anchor phrases used to classify in-scope queries

Phoenix sync: When PHOENIX_PROMPT_SYNC=true (default in .env.example), the backend and CLI tools push all templates to Phoenix on startup and tag them with PHOENIX_PROMPT_TAG (default: development). In production, set the tag to production to version prompts in the Phoenix UI.

When PHOENIX_PROMPT_SYNC=false, prompts are served from the local .j2 files only. Disable sync during local development to avoid unnecessary Phoenix calls.

In production (ENVIRONMENT=prod), PromptRegistry.render() and get_template() fetch the tagged prompt from Phoenix first and fall back to local if Phoenix is unreachable.

Phoenix checklist:

Set ENVIRONMENT=prod and PHOENIX_PROMPT_TAG=demo (in .env or your shell)
Start backend or CLI
In Phoenix UI, confirm prompts exist under the tag
Edit a prompt, re-run a query, and confirm the response changes

Retrieval & reranker tuning

RRF weights: Configure in .env with RRF_WEIGHT_VECTOR and RRF_WEIGHT_KEYWORD.

Reranker settings:

Setting	Default	Notes
`TOP_K_RERANK`	5	Final number of chunks returned after reranking
`RERANKER_TIMEOUT`	30s	Total timeout; falls back to retrieval order on expiry
`TOP_K_RETRIEVAL`	10	Candidates from hybrid search before reranking

The reranker is only active in agent mode (CrewAI path). Fast RAG skips it entirely.

Citation format

Each response includes structured citations with complete source metadata. The backend returns an AgentResponse with a citations array containing:

Citation Schema:

{
  "document_id": "uuid",
  "chunk_id": "uuid", 
  "file_name": "document.pdf",
  "page_number": 12,
  "section_path": "Introduction > Overview",
  "chunk_text": "Retrieved text snippet...",
  "score": 0.92
}

Fields:

document_id: UUID of source document
chunk_id: UUID of specific chunk
file_name: Original filename
page_number: Page number (null if unavailable)
section_path: Hierarchical section location (e.g., "Chapter 1 > Section 1.2")
chunk_text: Actual retrieved text
score: Relevance score (0.0-1.0)

The agent's text response typically includes inline citations.

Services

Service	Port	Notes
Backend API	8000	FastAPI (`/v1/chat/completions`)
OpenWebUI	3000	Chat frontend
PostgreSQL	5432	pgvector store
Ollama	11434	local LLM + embeddings
Phoenix	6006	tracing dashboard

Port notes:

PostgreSQL is mapped to host port 5433 (not 5432) to avoid conflicts with a local Postgres. When connecting from outside Docker, use localhost:5433. Inside Docker, services use postgres:5432.
OpenWebUI is mapped to host port 3000 (container port 8080).

OpenWebUI integration: Configure OPENAI_API_BASE_URL=http://backend:8000/v1 and OPENAI_API_KEY=dummy. OpenWebUI will discover models via /v1/models.

Session persistence: The API returns an X-Session-Id header. Reuse it on subsequent requests to keep conversation memory.

Health & service status: GET /health checks database, Ollama, and Phoenix. If DB or Ollama are down, status is unhealthy. If Phoenix is down, status is degraded.

Known limitations (current)

No API authentication — the backend API (port 8000) has no auth layer. OpenWebUI (port 3000) is the intended user-facing entry point and provides its own authentication. In production, remove the backend port mapping and place it behind a reverse proxy or API gateway.
PDFs with complex tables/scans depend heavily on Docling parsing quality.
Retrieval quality depends on chunking + embedding model choice.
First launch may take several minutes while Ollama models are downloaded.

Troubleshooting

Backend can’t reach Ollama

Check OLLAMA_BASE_URL and that the ollama service is up.

Mac GPU Ollama (optional override)

Use the compose override:

docker compose -f docker-compose.yml -f docker-compose.mac.yml up -d

No results retrieved

Confirm the indexer ran successfully and vectors are in Postgres.
Check DB connection string and schema migration ran.

Local development (without Docker)

# 1. Install the project in editable mode
pip install -e ".[dev,eval]"

# 2. Start Postgres (pgvector), Ollama, and Phoenix however you prefer,
#    then point your .env at localhost (see Quick start note above).

# 3. Run the database migrations manually
psql "$DATABASE_URL" -f migrations/001_init_extensions.sql
psql "$DATABASE_URL" -f migrations/002_create_tables.sql
psql "$DATABASE_URL" -f migrations/003_create_indexes.sql

# 4. Pull the required Ollama models
ollama pull qwen3:1.7b
ollama pull qwen3-embedding:0.6b

# 5. Start the backend
agentic-api

API docs are available at http://localhost:8000/docs (Swagger UI).

Development

pip install -e ".[dev,eval]"
ruff check src/ tests/
pytest -v
mypy src/agentic_rag

Testing prerequisites

Install test dependencies before running pytest:

pip install -e ".[dev,eval]"

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
data		data
docker		docker
migrations		migrations
scripts		scripts
src/agentic_rag		src/agentic_rag
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.mac.yml		docker-compose.mac.yml
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agentic RAG (OpenWebUI + pgvector + Ollama)

What’s included

Design choices (at a glance)

Demo flow

Quick start

Index the sample documents

Index documents

Chat (OpenWebUI)

Ollama modes

Evaluate (RAGAS)

Continuous evaluation (monitoring)

Traces (Phoenix)

Prompt management (Phoenix)

Retrieval & reranker tuning

Citation format

Services

Known limitations (current)

Troubleshooting

Local development (without Docker)

Development

Testing prerequisites

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Agentic RAG (OpenWebUI + pgvector + Ollama)

What’s included

Design choices (at a glance)

Demo flow

Quick start

Index the sample documents

Index documents

Chat (OpenWebUI)

Ollama modes

Evaluate (RAGAS)

Continuous evaluation (monitoring)

Traces (Phoenix)

Prompt management (Phoenix)

Retrieval & reranker tuning

Citation format

Services

Known limitations (current)

Troubleshooting

Local development (without Docker)

Development

Testing prerequisites

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages