Skip to content

Duade10/ragtrack

Repository files navigation

RagTrack

RagTrack is a lightweight Git-like version control layer for RAG knowledge bases.

It helps you:

  • track document changes across ingestions
  • avoid stale chunks in retrieval pipelines
  • search only the latest knowledge
  • inspect what changed between document versions
  • keep source traces for retrieved chunks

RagTrack is CLI-first and designed for small VPS deployments. It uses local files, SQLite, SQLAlchemy, keyword search, and a small server-rendered FastAPI UI by default. Optional semantic search uses FastEmbed, an ONNX Runtime based embedding provider that avoids PyTorch and CUDA dependencies. RagTrack does not use Redis, Celery, MinIO, Kubernetes, authentication, external cloud services, or a JavaScript build pipeline.

What Problem It Solves

RAG apps often treat ingestion as a one-way import. Over time, source documents change, chunks become stale, and it becomes hard to answer simple questions:

  • Which files changed?
  • Which chunks belong to the latest document version?
  • What did the last ingestion add or remove?
  • Can I search current knowledge without keeping old chunks active?

RagTrack adds a small version-control layer around local document ingestion so RAG knowledge bases are easier to inspect, refresh, and trust.

Features

  • recursive ingestion for .pdf, .docx, .txt, .md, and .markdown
  • SHA256 file hashing for change detection
  • document versions stored in SQLite
  • chunk hashes and chunk metadata stored in SQLite
  • keyword search with no embedding package required
  • optional FastEmbed semantic embeddings using BAAI/bge-small-en-v1.5
  • optional SentenceTransformers provider for users who explicitly want it
  • semantic search limited to latest document versions when embedding support is installed
  • diff summaries between latest and previous versions
  • status summaries for project health
  • local FastAPI/Jinja web interface for upload and inspection

Installation

Create and activate a virtual environment:

python -m venv .venv
.\.venv\Scripts\Activate.ps1

Install RagTrack:

pip install -e .

Install FastEmbed semantic search support only when you want embeddings:

pip install -e ".[embed]"

Install the legacy SentenceTransformers provider only if you explicitly want it:

pip install -e ".[st]"

For development:

pip install -e ".[dev]"

For development with semantic search:

pip install -e ".[dev,embed]"

Local Usage

Create a docs directory:

mkdir docs

Ingest documents:

ragtrack ingest .\docs

Search latest chunks with the default keyword mode:

ragtrack search "wifi password"

Semantic search requires embedding support:

pip install -e ".[embed]"
export SEARCH_MODE=semantic

Show changes between latest and previous versions:

ragtrack diff

Show changes for one document:

ragtrack diff .\docs\house_rules.pdf

Show project status:

ragtrack status

Web Interface

RagTrack includes a lightweight server-rendered web interface built with FastAPI, Jinja templates, Tailwind CDN, and HTMX. It reads from the same SQLite database and search index as the CLI.

Run the local UI:

ragtrack serve

Then open:

http://127.0.0.1:8000

Bind to a VPS interface:

ragtrack serve --host 0.0.0.0 --port 8000

Pages included:

  • Dashboard: project counts, recent activity, tracked documents
  • Documents: upload supported files, document list, and document detail pages
  • Versions: version history
  • Search: latest-version chunk search
  • Diff: latest vs previous chunk-hash summaries
  • Audit: lightweight integrity overview
  • Settings: local runtime paths and model configuration

Screenshots can be added under docs/screenshots/ as the interface stabilizes.

Uploaded files are saved to ./docs and ingested through the same versioning, parsing, chunking, and optional search-index pipeline as the CLI.

CLI Examples

ragtrack ingest ./docs
ragtrack search "check in time"
ragtrack diff
ragtrack diff ./docs/guest_manual.docx
ragtrack status
ragtrack serve

Example diff output:

house_rules.pdf

v1 -> v2

+ Added: 3 chunks
- Removed: 1 chunk
~ Modified: 5 chunks

Docker Usage

Build the image:

docker compose build

Run commands with your local ./docs directory mounted at /docs and local ./.ragtrack mounted for persistence:

docker compose up
docker compose run --rm ragtrack ingest /docs
docker compose run --rm ragtrack search "wifi password"
docker compose run --rm ragtrack diff
docker compose run --rm ragtrack status

The Docker setup serves the UI on http://localhost:8000 and stores SQLite and local RagTrack data in ./.ragtrack on the host.

Configuration

RagTrack reads environment variables with the RAGTRACK_ prefix:

  • RAGTRACK_PROJECT_ROOT
  • RAGTRACK_DATA_DIR
  • RAGTRACK_DATABASE_URL
  • RAGTRACK_STORAGE_DIR
  • RAGTRACK_VECTOR_INDEX_PATH
  • RAGTRACK_EMBEDDING_PROVIDER
  • RAGTRACK_EMBEDDING_MODEL
  • RAGTRACK_SEARCH_MODE
  • RAGTRACK_LOG_LEVEL

Defaults:

  • database: .ragtrack/ragtrack.db
  • search mode: keyword
  • embedding provider: fastembed
  • embedding model: BAAI/bge-small-en-v1.5
  • semantic vector index: .ragtrack/faiss.json
  • vector metadata: .ragtrack/faiss_meta.json

Short environment variable aliases are also supported:

  • EMBEDDING_PROVIDER
  • SEARCH_MODE

Supported embedding providers:

  • fastembed
  • sentence-transformers

Supported search modes:

  • keyword
  • semantic

Why FastEmbed

FastEmbed is the recommended semantic embedding provider for RagTrack because it is CPU-first, ONNX Runtime based, and avoids the PyTorch, Transformers, CUDA, and NVIDIA dependency chain that makes installs heavy on small VPS machines. This keeps the default RagTrack experience lightweight while still allowing semantic retrieval when users opt in.

V1 Limitations

  • diff uses chunk hashes and estimates modified chunks as min(added, removed)
  • semantic search is optional and requires pip install -e ".[embed]"
  • SentenceTransformers remains available through pip install -e ".[st]"
  • when semantic search is enabled, one local vector index is rebuilt after CLI/UI ingestion
  • only latest document versions are searchable
  • no background workers
  • web UI upload is synchronous in V1
  • no authentication
  • no remote object storage
  • no hosted vector database

Roadmap

  • richer diff output with chunk previews
  • configurable chunking strategies
  • parser metadata improvements
  • incremental vector index updates
  • optional Qdrant or pgvector backend
  • export/import project snapshots
  • document restore and pinning
  • better source citation formatting

Development

Run tests:

python -m pytest

Project layout:

ragtrack/
|-- ragtrack/
|   |-- cli/
|   |-- db/
|   |-- models/
|   |-- services/
|   |-- parsers/
|   |-- chunking/
|   |-- embeddings/
|   |-- vectorstore/
|   `-- storage/
|-- tests/
|-- pyproject.toml
|-- Dockerfile
|-- docker-compose.yml
`-- README.md

About

RagTrack is a lightweight Git-like version control layer for RAG knowledge bases. It basically helps you track changes across different documents in your RAG Application

Topics

Resources

Stars

Watchers

Forks

Contributors